http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Ipargaru&feedformat=atomstatwiki - User contributions [US]2024-03-28T13:10:17ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=5513stat8412009-11-24T00:11:20Z<p>Ipargaru: /* Putting it all together */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rise of fields such as data-mining, bioinformatics, and machine learning, classification has becomes a fast-developing topic. In the age of information, vast amounts of data are generated constantly, and the goal of classification is to learn from data. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
'''Definition''': The problem of Prediction a discrete random variable <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math> is called Classification.<br />
<br />
In classification,, we attempt to approximate a function <math>\,h</math>, by using a training data set, which will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>d</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data which are identical independent distributions, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, for instance, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability. Intuitively speaking, to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we find <math>\,y\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y|X=x)</math>, and classify <math>\,X</math> into class <math>\,y</math>. In order to calculate the value of <math>\,P(Y=y|X=x)</math>, we use ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posterior probability, <math>\,P(Y=y)</math> as the prior probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two classes, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<sub></sub><br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''3 different approaches to classification''':<br />
<br />
1) Empirical Risk Minimization: Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of <math>\,L(h)</math><br />
<br />
2) Regression: Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3) Density Estimation: estimate <math>\,P(X=x|Y=0)</math> and <math>\,P(X=x|Y=1)</math> (less popular in high-dimension cases)<br />
<br />
<br />
<br />
'''Bayes Classification Rule Optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when<math></math> the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
<br />
:<math>\, h^*(X)= \left\{\begin{matrix} <br />
1 & if P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
Remark:<br />
<br />
1)Bayes classification rule is optimal. Proof:[http://www.ee.columbia.edu/~vittorio/BayesProof.pdf]<br />
<br />
2)We still need any other method, since we cannot define prior probability in realistic.<br />
<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, since it is generally impossible for us to know the prior <math>\,P(Y=1)</math>, and class conditional density <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], tree-augmented naive Bayes (TAN), Bayesian network augmented naive Bayes (BAN) and general Bayesian network (GBN).<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as having objective existence. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayesian and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayesian'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In Bayesian method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
'''Multi-class Classification''':<br />
<br />
Y takes on more than two values.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
''Theorem'': Suppose that <math>\,Y \in \mathcal{Y}= \{1,\dots,k\}</math>, the optimal rule is :<math>\,h^*(X) = \arg\max_{k}{P(Y = k|X = x)}</math><br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
1 Empirical Risk Minimization:Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
<br />
2 Regression:Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3 Density estimation, estimate <math>P(X = x|Y = 0)</math> and <math>P(X = x|Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
It is quadratic because there is no boundaries.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on [http://academicearth.org/lectures/advice-for-applying-machine-learning LDA and QDA] so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
LDA is for classification and FDA is used for feature extraction.<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
<br />
=== logistic function ===<br />
A logistic function or logistic curve is the most common sigmoid curve. <br />
<br />
:<math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
1. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
2. <math>y(0) = \frac{1}{2}</math><br />
<br />
3. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
4. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
5. The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood, using Pr(Y|X). The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Seeing these equations as a weighted least squares problem makes them easier to derivate.<br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== Perceptron (Foundation of Neural Network) ===<br />
<br />
==== Separating Hyperplane Classifiers ====<br />
Separating hyperplane trys to separate the data using linear decision boundaries. When the classes overlap, it can be generalized to support vector machine, which constructs nonlinear boundaries by constructing a linear boundary in an enlarged and transformed feature space.<br />
<br />
==== Perceptron ====<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
Least Squares returns the sign of a linear combination of data points as the class label<br />
<br />
sign(<math>(\underline{\beta}^T \underline{x} + {\beta}_0)) = sign(\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2})</math><br />
<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the [http://en.wikipedia.org/wiki/Artificial_neural_network Artificial Neural Network] models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br />
<br />
<br />
If we find a hyperplane that is not unique between 2 classes, there will be infinitely many solutions obtained from the perceptron algorithm.<br />
<br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
A Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math>, where <math>\,I</math> indicates the sign of the expression and returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math> (initial guess). Its goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach], which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
====Problems with the Algorithm and Issues Affecting Convergence====<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. This problem can be eliminated by using basis expansions technique. To be specific, we try to find a hyperplane not in the original space, but in the enlarged space obtained by using some basis functions.<br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#A perfect separation is not always available even desirable. If observations comes from different classes sharing the same imput, the classification model seems to be overfitting and will generally have poor predictive performance.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
<br />
</ref>.<br />
====Comment on gradient descent algorithm====<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction in which the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and you initially stand in the middle, then you will finally arrive at the saddle point (local minimum) and get stuck there.<br />
<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over <math>\,i</math> (all data points). Actually, this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent) known as Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>\,{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
*A Perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection with branches that "fan out" onto as many connections as desired, each carrying the same signal - the processing element output signal. <br />
<br />
<ref> Haykin, Simon (2009). Neural Networks and Learning Machines. Pearson Education, Inc. </ref><br />
A neural network resembles the brain in two respects:<br />
1. Knowledge is acquired by the network from its environment through a learning process.<br />
2. Interneuron connection strengths, known as synaptic weights, are used to store the acquired knowledge.<br />
<br />
<ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref> It is a multistage regression or classification model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
A regression problem typically has only one unit in the output layer. In a k-class classification problem, there are usually k units in the output layer that each represent the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1).<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classification by NN. <br />
<br />
In perceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] and is not continuous at 0. Thus, we replace it by a smooth function <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it the '''activation function'''.<br />
<br>The choice of this function <math>\displaystyle \sigma </math> is determined by the properties of the data and the assumed distribution of target variables, but for multiple binary classification problems the logistic function, also known as inverse-logit, is often used: <br />
<math>\sigma(a)=\frac {1}{1+e^{-a}}</math><br />
<br />
There are some important properties for the activation function.<br />
<br />
# Activation function is nonlinear. It can be shown that if the activation function of the hidden units is linear, a three-layer neural network is equivalent to a two layer one. <br />
# Activation function saturate, which means there are maximum and minimum output value. This property ensures that the weights are bounded and therefore the searching time is limited. <br />
# Activation function is continuous and smooth.<br />
# Activation function is monotonic. This property is not necessary, since we know that RBF networks is also a kind of power model. <br />
<br />
'''Note:''' A key difference between a perceptron and a neural network is that a neural network uses continuous nonlinearities in the units, for the purpose of differentiation, whereas the perceptron often uses a non-differentiable activation function. The neural network function is differentiable with respect to the network parameters so that a gradient descent method can be used in training. Moreover, a perceptron is a linear classifier, whereas a neural network, by combining layers of perceptrons, is able to classify non-linear problems through proper training.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes into the perceptron, to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural network is called [http://en.wikipedia.org/wiki/Feedforward_neural_network Feed-Forward Neural Network]. Applications to Feed-Forward Neural Networks include data reduction, speech recognition, sensor signal processing, and ECG abnormality detection, to name a few. <ref>J. Annema, Feed-Forward Neural Networks, (Springer 1995), pp. 9 </ref><br />
<br />
===Back-propagation===<br />
For a while, the Neural Network model was just an idea, since there were no algorithms for training the model until 1986, when Geoffrey Hinton <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> devised an algorithm called '''back-propagation''' [http://en.wikipedia.org/wiki/Backpropagation#Algorithm]. After that, a number of other training algorithms and various configurations of neural networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied a gradient descent algorithm for optimizing weights. Back-propagation uses this idea of gradient descent to train a neural network based on the chain rule in calculus. <br />
<br />
Assume that the last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
For simplicity, there is only 1 unit at the end and assume for the moment we are doing regression.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each unit we have a function <math>\displaystyle z_i=\sigma(a_i)</math> that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classification and the resulting classification output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
''i j are reversed!''<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\delta_j \cdot z_l</math><br />
<br />
Note that a change in <math>\,a_j</math> causes changes in all <math>\,a_i</math> in the next layer on which the error is based, so we need to sum over i in the chain:<br />
<math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} =\sum_i \delta_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\,\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math> Using the activation function <math>\,\sigma(\cdot)</math><br />
<br />
So <math>\delta_j = \sum_i \delta_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \delta_i \cdot u_{ij}</math><br />
<br />
We can propagate the error calculated in the output back through the previous layers and adjust weights to minimize error.<br />
<br />
==Neural Networks (NN) - October 30, 2009 ==<br />
<br />
=== Back-propagation ===<br />
The idea is that we first feed an input from the training set to the Neural Network, then find the error rate at the output and then we propagate the error to previous layers and for each edge of weight <math>\,u_{ij}</math> we find <math>\frac{\partial \mathrm{err}}{\partial u_{ij}}</math>. Having the error rates at hand we adjust the weight of each edge by taking steps proportional to the negative of the gradient to decrease the error at output. The next step is to apply the next input from the training set and go through the described adjustment procedure.<br />
The overview of Back-propagation algorithm:<br />
#Feed a point <math>\,x</math> in the training set to the network, and find the output of all the nodes.<br />
#Evaluate <math>\,\delta_k=y_k-\hat{y_k}</math> for all output units, where <math>y_k</math> is the expected output and <math>\hat{y_k}</math> is the real output.<br />
#By propagating to the previous layers evaluate all <math>\,\delta_j</math>s for hidden units: <math>\,\delta_j=\sigma'(a_j)\sum_i \delta_i u_{ij}</math> where <math>i</math> is associated to the previous layer.<br />
#Using <math>\frac{\partial \mathrm{err}}{\partial u_{jl}} = \delta_j\cdot z_l</math> find all the derivatives.<br />
#Adjust each weight by taking steps proportional to the negative of the gradient: <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} -\rho \frac{\partial \mathrm{err}}{u_{ij}}</math><br />
#Feed the next point in the training set and repeat the above steps.<br />
==== How to initialize the weights ====<br />
This still leaves the question of how to initialize the weights <math>\,u_{ij}, w_i</math>. The method of choosing weights mentioned in class was to randomize the weights before the first step. This is not likely to be near the optimal solution in every case, but is simple to implement. To be more specific, random values near zero will be a good choice for the initial weights(usually from [-1,1]). In this case, the model evolves from a nearly linear one to a nonlinear one as we desired. An alternative is to use an orthogonal least squares method to find the initial weights <ref>http://www.mitpressjournals.org/doi/abs/10.1162/neco.1995.7.5.982</ref>. Regression is performed on the weights and output by using a linear approximation of <math>\,\sigma(a_i)</math>, and finds optimal weights in the linear model. Back propagation is used afterward to find the optimal solution, since the NN is non-linear.<br />
<br />
==== How to set learning rates ====<br />
The learning rate <math>\,\rho</math> is usually a constant. <br />
<br />
If we use On-line learning, as a form of stochastic approximation process, <math>\,\rho</math> should decrease as the iteration increase.<br />
<br />
<br />
Choosing too large learning rate may cause the unstability of the system, while the too small learning rate may lead to a very slow convergence rate(very long time in learning phase). However, the advantage of small learning rate is that it can guarantee the convergence. Thus, generally, it is better to choose a relatively small learning rate to ensure the stability. Usually, choose <math>\,\rho</math> between 0.01 and 0.7<br />
<br />
==== How to determine the number of hidden units ====<br />
<br />
Here we will mainly discuss how to estimate the number of hidden units at very beginning. Obviously, we should adjust it to be more precise using CV, LOO or other complexity control methods. <br />
<br />
Basically, if the patterns are well separated, few hidden units are fairly enough. If the patterns are drawn from some highly complicated mixture models, more hidden units are really needed. <br />
<br />
Actually, the number of hidden units determines the size of the model, and therefore the total number of the weights in the model. Typically speaking, the number of weights should not be larger than the number of training data, say N. Thus, sometimes, N/10 is a good choice. However, in pratice, many well performed models will use more hidden units.<br />
<br />
=== Dimensionality reduction application ===<br />
[[File:NN-bottelneck.png|350px|thumb|right|Figure 1: Bottleneck configuration for applying dimensionality reduction.]]<br />
One possible application of Neural Networks is to perform dimensionality reduction, like other techniques, e.g., PCA, MDS, LLE and Isomap.<br />
<br />
Consider the following configuration as shown in figure 1:<br />
As we go forward in layers of this Neural Network, the number of nodes is reduced, until we reach a layer with the number of nodes representing the desired dimensionality. However, note that at the very first few layers the number of nodes may not be strictly decreasing, as long as finally it can reach a layer with less nodes. From now on in the Neural Network <br />
the previous layers are mirrored. So at the output layer we have the same number of states as we have in the input layer. Now note that if we feed the network with each point and get an output approximately equal to the fed input, that means at the output the same input is reconstructed from the middle layer units. So the output of the middle layer units can represent the input with less dimensions.<br />
<br />
To train this Neural Network, we feed the network with a training point and through back propagation we adjust the network weights based on the error between the input layer and the reconstruction at the output layer. Our low dimensional mapping will be the observed output from the middle layer. Data reconstruction consists of putting the low dimensional data through the second half of the network.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
== Complexity Control October 30, 2009 ==<br />
<br />
[[File:overfitting-model.png|500px|thumb|right|Figure 2. The overfitting model passes through all the points of the training set, but has poor predictive power for new points.<br />
In exchange the line model has some error on the training points but has extracted the main characteristic of the training points, and has good predictive power.]]<br />
There are [http://academicearth.org/lectures/underfitting-and-overfitting two issues] that we have to avoid in Machine Learning:<br />
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]<br />
#Underfitting<br />
<br />
Overfitting occurs when our model is heavily complex with so many degrees of freedom, that we can learn every detail of the training set. Such a model will have very high precision on the training set but will show very poor ability to predict outcomes of new instances, especially outside the domain of the training set.<br />
<br />
In a Neural Network if the depth is too much, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will show a very precise outcome of the training set but will not be able to generalize the commonality of the training set to predict the outcome of new cases.<br />
<br />
Underfitting occurs when the model we picked to describe the data is not complex enough, and has high error rate on the training set.<br />
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.<br />
<br />
'''Example'''<br />
#Consider the example showed in the figure. We have a training set and we want to find a model which fits it the best. We can find a polynomial of high degree which almost passes through all the points in the training set. But, in fact the training set is coming from a line model. Now the problem is although the complex model has less error on the training set it diverges from the line in other ranges which we have no training point. Because of that the high degree polynomail has very poor predictive result on test cases. This is an example of overfitting model.<br />
#Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem.<br />
#Consider a simple classification example. If our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is a yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features typical of bananas, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie., we have overfit the data. This occurs when we have considered so many features that we have perfectly described the existing bananas, but if presented with a new banana of slightly different shape than the existing bananas, for example, it cannot be detected. This is the tradeoff; what is the right level of complexity?<br />
<br />
== Complexity Control - Nov 2, 2009 ==<br />
<br />
Overfitting occurs when the model becomes too complex and underfitting occurs when it is not complex enough, both of which are not desirable. To control complexity, it is necessary to make assumptions for the model before fitting the data. Assumptions that we can make for a model are with polynomials or a neural network. There are other ways as well.<br />
<br />
[[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 1: An example of a model with a family of polynomials]]<br />
We do not want a model to get too complex, so we control it by making an assumption on the model. With complexity control, we want a model or a classifier with a low error rate.<br />
<br />
=== '''How do we choose a good classifier?''' ===<br />
<br />
Our goal is to find a classifier that minimizes the true error rate. <br />
Recall the empirical error rate<br />
<br />
<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math><br />
<br />
<math>\,h</math> is a classifier and we want to minimize the error rate. So we apply <math>\displaystyle x_1</math> to <math>\displaystyle x_n</math> to <math>\displaystyle h</math>, and take the average to get the empirical true error rate estimation of probability that <br />
<math>h(x_{i}) \neq y_{i}</math>.<br />
<br />
<span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 2]]</span><br />
There is a downward bias to this estimate meaning that it is always less than the true error rate. <br />
<br />
If there is a change in our complexity from low to high, our error rate is always decreasing. When we apply our model to the test data, our error rate will start to decrease to a point, but then it will increase since the model hasn't seen it before. This can be explained since training error will decrease when we fit the model better by increasing its complexity, but as we have seen, this complex model will not generalize well, resulting in a larger test error. <br />
<br />
We use our test data (from the test sample line shown on Figure 2) to get our empirical error rate.<br />
Right complexity is defined as where error rate of the test data is minimum; and this is one idea behind complexity control.<br />
<br />
<br />
<br />
[[File:Bias.jpg|200px|thumb|left|Figure 3]]<br />
<br />
We assume that we have samples <math>\,X_1, . . . ,X_n</math> that follow some (possibly unknown) distribution. We want to estimate a parameter <math>\,f</math> of the unknown distribution. This parameter may be the mean <math>\,E(X_i)</math>, the variance <math>\,var(X_i)</math> or some other quantity.<br />
<br />
The unknown parameter <math>\,f</math> is a fixed real number <math>f\in R</math>. To estimate it, we use an estimator which is a<br />
function of our observations, <math>\hat{f}(X_1,...,X_n)</math>. <br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f</math><br />
<br />
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]</math><br />
<br />
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math><br />
<br />
One property we desire of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>.<br />
However, there is a more important property for an estimator than just being unbiased: the mean squared error. In statistics, there are problems for which it may be good to use an estimator with a small bias. In some cases, an estimator with a small bias may have lesser mean squared error or be median-unbiased (rather than mean-unbiased, the standard unbiasedness property). The property of median-unbiasedness is invariant under transformations while the property of mean-unbiasedness may be lost under nonlinear transformations. For example, while using an unbiased estimator with large mean square error to estimate the parameter, we highly risk a big error. In contrast, a biased estimator with small mean square error will well improve the precision of our prediction.<br />
<br />
Hence, our goal is to minimize <math>MSE (\hat{f})</math>.<br />
<br />
From figure 3, we can see that the relationship of the three parameters is:<br />
<math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.<br />
<br />
A Test error is a good estimation on MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias.<br />
<br />
<br />
Referring to Figure 2, overfitting happens after the point where training data (training sample line) starts to decrease and test data (test sample line) starts to increase. There are 2 main approaches to avoid overfitting:<br />
<br />
1. Estimating error rate<br />
<br />
<math>\hookrightarrow</math> Empirical training error is not a good estimation<br />
<br />
<math>\hookrightarrow</math> Empirical test error is a better estimation<br />
<br />
<math>\hookrightarrow</math> Cross-Validation is fast<br />
<br />
<math>\hookrightarrow</math> Computing error bound (analytically) using some probability inequality.<br />
<br />
We will not discuss computing the error bound in class; however, a popular method for doing this computation is called VC Dimension (short for Vapnik–Chervonenkis Dimension). Information can be found from [http://www.autonlab.org/tutorials/vcdim.html Andrew Moore] and [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.7171&rep=rep1&type=pdf Steve Gunn].<br />
<br />
2. Regularization<br />
<br />
<math>\hookrightarrow</math> Use of shrinkage method<br />
<br />
<math>\hookrightarrow</math> Decrease the chance of overfitting by controlling the weights<br />
<br />
=== '''Example of under and overfitting in R''' ===<br />
<br />
To give further intuition of over and underfitting, consider this example. A simple quadratic data set with some random noise is generated, and then polynomials of varying degrees are fitted. The errors for the training set and a test set are calculated.<br />
[[File:Curvefitting-rex2.png|250px|thumb|right|Polynomial fits to curved data set.]]<br />
<br />
>> x <- rnorm(200,0,1)<br />
>> y <- x^2-0.5*x+rnorm(200,0,0.3)<br />
>> xtest <- rnorm(50,1,1)<br />
>> ytest <- xtest^2-0.5*xtest+rnorm(50,0,0.3)<br />
>> p1 <- lm(y~x)<br />
>> p2 <- lm(y ~ poly(x,2))<br />
>> pn <- lm(y ~ poly(x,10))<br />
>> psi <- lm(y~I(sin(x))+I(cos(x)))<br />
<br />
: <code>x</code> values for the training set are based on a <math>\,N(0,1)</math> distribution, while the test set has a <math>\,N(1,1)</math> distribution. <code>y</code> values are determined by <math>\,y = x^2 + x + N(0,0.3)</math>, a quadratic function with some random variation. Polynomial least square fits of degree 1, 2, and 10 are calculated, as well as a fit of <math>\,sin(x)+cos(x)</math>.<br />
<br />
>> > # calculate the mean squared error of degree 1 poly<br />
>> > sum((y-predict(p1,data.frame(x)))^2)/length(y)<br />
>> [1] 1.576042<br />
>> > sum((ytest-predict(p1,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 7.727615<br />
: Training and test mean squared errors for the linear fit. These are both quite high - and since the data is non-linear, the different mean value of the test data increases the error quite a bit.<br />
>> > # calculate the mean squared error of degree 2 poly<br />
>> > sum((y-predict(p2,data.frame(x)))^2)/length(y)<br />
>> [1] 0.08608467<br />
>> > sum((ytest-predict(p2,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 0.08407432<br />
: This fit is far better - and there is not much difference between the training and test error, either.<br />
>> > # calculate the mean squared error of degree 10 poly<br />
>> > sum((y-predict(pn,data.frame(x)))^2)/length(y)<br />
>> [1] 0.07967558<br />
>> > sum((ytest-predict(pn,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 156.7139<br />
: With a high-degree polynomial, the training error continues to decrease, but not by much - and the test set error has risen again. The overfitting makes it a poor predictor. As the degree of the polynomial rises further, the accuracy of the computer becomes an issue - and a good fit is not even consistently produced for the training data.<br />
>> > # calculate mse of sin/cos fit<br />
>> > sum((y-predict(psi,data.frame(x)))^2)/length(y)<br />
>> [1] 0.1105446<br />
>> > sum((ytest-predict(psi,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 1.320404<br />
: Fitting a function of the form sin(x)+cos(x) works pretty well on the training set, but because it is not the real underlying function, it fails on test data which doesn't lie on the same domain.<br />
<br />
== ''' Cross-Validation (CV) - Introduction ''' ==<br />
<br />
[[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]]<br />
Cross-Validation is used to estimate the error rate of a classifier with respect to test data rather than data used in the model. Here is a general introduction to CV:<br />
<br />
<math>\hookrightarrow</math> We have a set of collected data for which we know the proper labels<br />
<br />
<math>\hookrightarrow</math> We divide it into 2 parts, Training data (T) and Validation data (V)<br />
<br />
<math>\hookrightarrow</math> For our calculation, we pretend that we do not know the label of V and we use data in T to train the classifier<br />
<br />
<math>\hookrightarrow</math> We estimate an empirical error rate on V since the model hasn't seen V yet and we know the proper label of all elements in V to know how many were misclassified.<br />
<br />
CV has different implementations which can reduce the variance of the calculated error rate, but sometimes with a tradeoff of a higher calculation time.<br />
<br />
== ''' Complexity Control - Nov 4, 2009''' ==<br />
<br />
== Cross-validation ==<br />
[[File:Cross-validation.png|350px|thumb|right|Figure 1: Classical/Standard cross-validation]]<br />
Cross-validation is the simplest and most widely used method to estimate the true error. It comes from the observation that although training error always decreases with the increasing complexity of the model, the test error starts to increase from a certain point, which is noted as overfitting (see [[#prediction-error|figure 2]] above). Since test error estimates MSE (mean square error) best, people came up with the idea to divide the data set into three parts: training set, validation set, and test set. training set is used to build the model, validation set is used to deside the parameters and the optimal model, and the test set is used to estimate the performance of the chosen model. A classical division is 50% for training set, and 25% each for validation set and test set. All of them are randomly selected from the original data set. <br />
<br />
Then, we only use the part of our data marked as the "training set" to train our algorithm, while keeping the remaining marked as the "validation set" untouched. As a result, the validation set will be totally unknown to the trained model. The error rate is then estimated by:<br />
<br />
<math>\hat L(h) = \frac{1}{|\nu|}\sum_{X_i \in \nu}(h(x_i) \neq y_i)</math>, where <math>\,|\nu|</math> is the cardinality of the validation set.<br />
<br />
When we change the complexity, the error generated by the validation set will have the same behavior as the test set, so we are able to choose the best parameters to get the lowest error.<br />
<br />
<br />
=== K-fold Cross-validation ===<br />
[[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]]<br />
Above is the simplest form of complexity control. However, in reality, it may be hard to collect data ??and we usually suffer from the curse of dimensionality??, and a larger data set may be hard to come by. Consequently, we may not be able to afford to sacrifice part of the limited resources. In this case we use another method that addresses this problem, K-fold cross-validation. We divide the data set into <math>\,K</math> subsets roughly equal in size. The usual choice is <math>\,K = 10</math>.<br />
<br />
Generally, how to choose <math>\,K</math>:<br />
<br />
if <math>\,K=n</math>, leave one out, low bias, high variance. Each subset contains a single element, so the model is trained with all except one point, and then validated using that point.<br />
<br />
if <math>\,K=2</math>, say 2-fold, 5-fold, high bias, low variance. Each subset contains approximately <math>\,\frac{1}{2}</math> or <math>\,\frac{1}{5}</math> of the data.<br />
<br />
For every <math>\,k</math>th <math>( \,k \in [ 1, K ] )</math> part, we use the other <math>\,K-1</math> parts to fit the model and test on the <math>\,k</math>th part to estimate the prediction error <math>\hat L_k</math>, where<br />
<br />
<math>\hat L(h) = \frac{1}{K}\sum_{k=1}^K\hat L_k</math><br />
<br />
For example, suppose we want to fit a polynomial model to the data set and split the set into four equal subsets as shown in Figure 2. First we choose the degree to be 1, i.e. a linear model. Next we use the first three sets as training sets and the last as validation set, then the 1st, 2nd, 4th subsets as training set and the 3rd as validation set, so on and so forth until all the subsets have been the validation set once (all observations are used for both training and validation). After we get <math>\hat L_1, \hat L_2, \hat L_3, \hat L_4</math>, we can calculate the average <math>\hat L</math> for degree 1 model. Similarly, we can estimate the error for n degree model and generate a simulating curve. Now we are able to choose the right degree which corresponds to the minimum error. Also, we can use this method to find the optimal unit number of hidden layers of neural networks. We can begin with 1 unit number, then 2, 3 and so on and so forth. Then find the unit number of hidden layers with lowest average error.<br />
<br />
=== Generalized Cross-validation ===<br />
If the vector of observed values is denoted by <math>\mathbf{y}</math>, and the vector of fitted values by <math>\hat\mathbf{y}</math>.<br />
<br />
<math>\mathbf{y} = \mathbf{H}\hat\mathbf{y}</math>, <br />
<br />
where the hat matrix is given by<br />
<br />
<math>\mathbf{H} = \mathbf{X}( \mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X}^{T}</math>,<br />
<br />
<math> \frac{1}{N}\sum_{i=1}^{N}[y_{i} - \hat f^{-i}(\mathbf{x}_{i})]^{2}=\frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-\mathbf{H}_{ii}}]^{2}</math>,<br />
<br />
Then the GCV approximation is given by<br />
<br />
<math> GCV(\hat f) = \frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-trace(\mathbf{H})/N}]^{2}</math>,<br />
<br />
Thus, one of the biggest advantages of the GCV is that the trace is more easily computed.<br />
<br />
=== Leave-one-out Cross-validation ===<br />
Leave-one-out cross-validation involves using all but one data point in the original training data set to train our model, then using the data point that we initially left out as a means to estimate true error. But repeating this process for every data point in our original data set, we can obtain a good estimation of true error.<br />
<br />
In other words, leave-one-out cross-validation is k-fold cross-validation in which we set the subset number <math>\,K</math> to be the cardinality of the whole data set.<br />
<br />
In the above example, we can see that k-fold cross-validation can be computationally expensive: for every possible value of the parameter, we must train the model <math>\,K</math> times. This deficiency is even more obvious in leave-one-out cross-validation, where we must train the model <math>\,n</math> times, where <math>\,n</math> is the number of data point in the data set.<br />
<br />
Fortunately, when adding data points to the classifier is reversible, calculating the difference between two classifiers is computationally more efficient than calculating the two classifiers separately. So, if the classifier on all the data points is known, we simply undo the changes from a data point <math>\,K</math> times to calculate the leave-one-out cross-validation error rate.<br />
<br />
== Regularization for Neural Network — Weight Decay ==<br />
[[File:figure 2.png|350px|thumb|right|Figure 1: activation function]]<br />
Weight decay training is suggested as an implementation for achieving a robust neural network which is insensitive to noise. Since the number of hidden layers in NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting.<br />
<br />
It can be seen from Figure 1 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, we can avoid overfitting by constraining the weights to be small. This gives us a hint to initialize the random weights to be close to zero.<br />
<br />
Formally, we penalize nonlinear weights by adding a penalty term in the error function. Now the regularized error function becomes:<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}|w_i|^2 + \sum_{jk}|u_{jk}|^2)</math>, where <math>\,err</math> is the original error in back-propagation; <math>\,w_i</math> is the weights of the output layer; <math>\,u_{jk}</math> is the weights of the hidden layers.<br />
<br />
Usually, too large <math>\,\lambda</math> will make the weights <math>\,w_i</math> and <math>\,u_{jk}</math> too small. We can use cross validation to estimate <math>\,\lambda</math>.<br />
<br />
A similar penalty, weight elimination, is given by,<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}\frac{|w_i|^2}{1 + |w_i|^2} + \sum_{jk}\frac{|u_{jk}|^2}{1+|u_{jk}|^2})</math>.<br />
<br />
As in back-propagation, we take partial derivative with respect to the weights:<br />
<br />
<math>\frac{\partial REG}{\partial w_i} = \frac{\partial err}{\partial w_i} + 2\lambda w_i</math><br />
<br />
<math>\frac{\partial REG}{\partial u_{jk}} = \frac{\partial err}{\partial u_{jk}} + 2\lambda u_{jk}</math><br />
<br />
<math>w^{new} \leftarrow w^{old} - \rho\left(\frac{\partial err}{\partial w} + 2\lambda w\right)</math><br />
<br />
<math>u^{new} \leftarrow u^{old} - \rho\left(\frac{\partial err}{\partial u} + 2\lambda u\right)</math><br />
<br />
Note that here <math>\,\lambda</math> serves as a trade-off parameter, tuning between the error rate and the linearity. Actually, we may also set <math>\,\lambda</math> by cross-validation. The tuning parameter is important since weights of zero will lead to zero derivatives and the algorithm will not change. On the other hand, starting with weights that are too large means starting with a nonlinear model which can often lead to poor solutions. <ref>Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning (Springer 2009) pp.398</ref><br />
<br />
== Radial Basis Function (RBF) Networks - November 6, 2009 ==<br />
<br />
[[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]]<br />
<br />
=== Introduction === <br />
<br />
A Radial Basis Function (RBF) network [http://en.wikipedia.org/wiki/Radial_basis_function_network] is a type of artificial neural network with an output layer and a single hidden layer, with weights from the hidden layer to the output layer, and can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. One choice that has been widely used is that of radial basis functions, which have the property that each basis function depends only on the radial distance (typically Euclidean) from a center <math>\displaystyle\mu_{j}</math>, so that <math>\phi_{j}(x)= h({\Vert x - \mu_{j}\Vert})</math>.<br />
<br />
<br />
The output of an RBF network can be expressed as a weighted sum of its radial basis functions as follows:<br />
<br />
<math>\hat y_{k} = \sum_{j=1}^M\phi_{j}(x) w_{jk}</math><br />
<br />
The radial basis function is: <br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br /><br />
(Gaussian without a normalization constant)<br /><br /><br />
'''note:'''The hidden layer has a variable number of neurons (the optimal number is determined by the training process). As usual the more neurons in the hidden layer, the higher the model complexity. Each neuron consists of a radial basis function centered on a point with the same dimensions as the input data. The radii of the RBF functions may be different. The centers and radii can be determined through clustering or an EM algorithm. When the x vector is given from the input layer, the hidden neuron computes the radial distance from the neuron’s center point and then applies RBF function to this distance. The resulting value is passed to the the output layer and weighed together to form the output. <br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat{Y}_{n,k} = \left[ \begin{matrix}<br />
\hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\<br />
\hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\<br />
\vdots &\vdots & \ddots & \vdots \\<br />
\hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k}<br />
\end{matrix}\right] </math> is the matrix of output variables. <br />
<br />
:<math>\Phi_{n,m} = \left[ \begin{matrix}<br />
\phi_{1,1} & \phi_{1,2} & \cdots & \phi_{1,m} \\<br />
\phi_{2,1} & \phi_{2,2} & \cdots & \phi_{2,m} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
\phi_{n,1} & \phi_{n,2} & \cdots & \phi_{n,m}<br />
\end{matrix}\right] </math> is the matrix of Radial Basis Functions.<br />
<br />
:<math>W_{m,k} = \left[ \begin{matrix}<br />
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\<br />
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
w_{m,1} & w_{m,2} & \cdots & w_{m,k}<br />
\end{matrix}\right] </math> is the matrix of weights.<br />
<br />
Here, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>k = 1</math>, <math>\hat Y</math> and <math>W</math> are column vectors.<br />
<br />
''related reading'':<br />
<br />
Introduction of the Radial Basis Function (RBF) Networks [http://axiom.anu.edu.au/~daa/courses/GSAC6017/rbf.pdf]<br />
<br />
Radial Basis Function (RBF) Networks [http://documents.wolfram.com/applications/neuralnetworks/index6.html] [http://lcn.epfl.ch/tutorial/english/rbf/html/index.html]<br />
<br />
=== Estimation of weight matrix W ===<br />
<br />
We minimize the training error, <math>\Vert Y - \hat{Y}\Vert^2</math> in order to find <math>\,W</math>.<br /><br /><br />
From a previous result in linear algebra we know that <br />
<br />
<math>\Vert A \Vert^2 = Tr(A^{T}A)</math><br />
<br />
Thus we have a problem similar to linear regression:<br />
<br />
<math>\ err = \Vert Y - \Phi W\Vert^{2} = Tr[(Y - \Phi W)^{T}(Y - \Phi W)]</math><br />
<br />
<math>\ err = Tr[Y^{T}Y - Y^{T}\Phi W - W^{T} \Phi^{T} Y + W^{T}\Phi^{T} \Phi W]</math><br />
<br />
<br />
==== Useful properties of matrix differentiation ====<br />
<br />
<br />
<math>\frac{\partial Tr(AX)}{\partial X} = A^{T}</math><br />
<br />
<math>\frac{\partial Tr(X^{T}A)}{\partial X} = A</math><br />
<br />
<math>\frac{\partial Tr(X^{T}AX)}{\partial X} = (A^{T} + A)X</math><br />
<br />
==== Solving for W ====<br />
<br />
We find the minimum over <math>\,W</math> by setting <math>\frac{\partial err}{\partial W}</math> equal to zero and using the aforementioned properties of matrix differentiation.<br />
<br />
<math>\frac{\partial err}{\partial W} = 0</math><br />
<br />
<math>\ 0 - \Phi^{T}Y - \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ -2 \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math><br />
<br />
<math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY</math><br />
<br />
where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}</math><br />
<br />
<math>\,H</math> is the hat matrix for this model. This gives us a nice results since the solution has a closed form and we do not have to worry about convexity problems in this case.<br />
<br />
=== Including an additional bias ===<br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat Y = \left[ \begin{matrix}<br />
y_{11} & y_{12} & \cdots & y_{1k} \\<br />
y_{21} & y_{22} & \cdots & y_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
y_{n1} & y_{n2} & \cdots & y_{nk}<br />
\end{matrix}\right] </math> is the matrix(n by k) of output variables.<br />
<br />
:<math>\Phi = \left[ \begin{matrix}<br />
\phi_{10} &\phi_{11} & \phi_{12} & \cdots & \phi_{1M} \\<br />
\phi_{20} & \phi_{21} & \phi_{22} & \cdots & \phi_{2M} \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{n0} &\phi_{n1} & \phi_{n2} & \cdots & \phi_{nM}<br />
\end{matrix}\right] </math> is the matrix(n by M+1) of Radial Basis Functions.<br />
<br />
:<math>W = \left[ \begin{matrix}<br />
w_{01} & w_{02} & \cdots & w_{0k} \\<br />
w_{11} & w_{12} & \cdots & w_{1k} \\<br />
w_{21} & w_{22} & \cdots & w_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
w_{M1} & w_{M2} & \cdots & w_{Mk}<br />
\end{matrix}\right] </math> is the matrix(M+1 by k) of weights.<br />
<br />
where the extra basis function <math>\Phi_{0}</math> is set to 1.<br />
<br />
==== Normalized RBF ====<br />
<br />
In addition to the above unnormalized architecture, the normalized RBF can be represented as:<br />
<br />
<math>\hat{y}_{k}(X) = \frac{\sum_{j=1}^{M} w_{jk}\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math><br /><br /><br />
<br />
<br />
Actually, <math>\Phi^{\ast}_{j}(X) = \frac{\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math> is known as a normalized radial basis function. Giving the familiar form,<br /><br />
<br />
<math>\hat{y}_{k}(X) = \sum_{j=1}^{M} w_{jk}\Phi^{\ast}_{j}(X)</math><br /><br /><br />
<br />
=== Conceptualizing RBF networks ===<br />
<br />
In the past, we have classified data using models that were explicitly linear, quadratic, or otherwise definite. In RBF networks, like in Neural Networks, we can fit an arbitrary model. How can we do this without changing the equations being used?<br />
<br />
Recall a [[#Trick:_Using_LDA_to_do_QDA_-_October_7.2C_2009|trick]] that was discussed in the October 7 lecture: if we add new features to our original data set, we can project into higher dimensions, use a linear algorithm, and get a quadratic result by collapsing to a lower dimension afterward. In RBF networks, something similar can happen.<br />
<br />
Think of <math>\,\Phi</math>, our matrix of radial basis functions, as a feature space of the input. Each hidden unit, then, can be thought to represent a feature; we can see that, if there are more hidden units than input units, we can essentially project to a higher-dimensional space, as we did in our earlier trick. However, this does not mean that an RBF network will actually do this, it is merely a way to convince yourself that RBF networks (and neural networks) can fit arbitrary models. Nevertheless, it is also noticed that just because of such great power, the problem of overfitting appears more important. We have to control its complexity so that it does not fit to an arbitrary training model but to a general one.<br />
<br />
=== RBF networks for classification -- a probabilistic paradigm ===<br />
<br />
[[File:Rbf_graphical_model.png|350px|thumb|left|Figure 1: RBF graphical model]]<br />
<br />
An RBF network is akin to fitting a Gaussian mixture model to data. We assume that each class can be modelled by a single function <math>\,\phi</math> and data is generated by a mixture model. According to Bayes Rule,<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(x|y_{k})*Pr(y_{k})}{Pr(x)}</math><br />
<br />
While all classifiers that we have seen thus far in the course have been in discriminative form, the RBF network is a generative model that can be represented using a directed graph.<br />
<br />
We can replace the class conditional density in the above conditional probability expression by marginalizing <math>\,x</math> over <math>\,j</math>:<br />
<math>\Pr(x|y_{k}) = \sum_{j} Pr(x|j)*Pr(j|y_{k})</math><br />
<br />
<br />
<br />
<br/><br/><br />
*'''Note''' We made the assumption that each class can be modelled by a single function <math>\displaystyle\Phi</math> and that the data was generated by a mixture model. The Gaussian mixture model has the form:<br />
<math>f(x)=\sum_{m=1}^M \alpha_m \phi(x;\mu_m,\Sigma_m)</math> where <math>\displaystyle\alpha_m</math> are mixing proportions, <math>\displaystyle\sum_m \alpha_m=1</math>, and <math>\displaystyle\mu_m</math> and <math>\displaystyle\Sigma_m</math> are the mean and covariance of each Gaussian density respectively. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009), pp. 214. </ref> The generative model in Figure 1 shows graphically how each Gaussian in the mixture model is chosen to sample from.<br />
<br />
== '''Radial Basis Function (RBF) Networks - November 9th, 2009''' ==<br />
<br />
=== RBF Network for classification (A probabilistic point of view) ===<br />
Using RBF Network to do classification, we usually treat it as regression problem. We want to set a threshold to decide what the data’s class membership is. However, to find some insight to the classification problem of what we are doing in terms of RBF Network, we often think of mixture models and make certain assumptions.<br />
<br />
[[File:RBF.png|350px|thumb|right|Figure 26.1: RBF Network Classification Demo]] <br />
<br />
We assume, as we can see in the graph on the right hand side, that we have three random variables, <math>\displaystyle y_k</math>, <math>\displaystyle j</math>, and <math>\displaystyle x</math> where <math>\displaystyle y_k</math> denotes class <math>\,k</math>, <math>\displaystyle x</math> is what we observed here, and <math>\displaystyle j</math> is some hidden random variable. There is a process that there are different classes, and each class can trigger a different hidden random variable <math>\displaystyle j</math>. To understand this, we can assume that, for instance, this random variable <math>\displaystyle j</math> has a Gaussian distribution (it could have any other distribution as well) and that all the <math>\displaystyle j</math>’s have the same distribution (Gaussian), but with different parameters. From each Gaussian distribution triggered by each class, we are going to sample some data points. Therefore, in the end, we are going to get a set of data, which are not strictly Gaussian, but are actually a mixture of Gaussians.<br />
<br />
Again, we look at the posterior distribution from [http://en.wikipedia.org/wiki/Bayes'_theorem Bayes' Rule].<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(X = x | Y = y_{k})*Pr(Y = y_{k})}{Pr(X = x)}</math><br />
<br />
Since we made the assumption that the data has been generated from a mixture model, we can estimate this conditional probability by<br />
<br />
<math>\Pr(X = x | Y = y_{k}) = \sum_{j} Pr(X = x | j)*Pr(j | Y = y_{k})</math>, <br />
<br />
which is the class conditional distribution (or probability) of the mixture model. Note, here, if we only have a simple model from <math>\displaystyle y_k</math> to <math>\displaystyle x</math>, then we won’t have this summation.<br />
<br />
We can substitute this class conditional distribution into Bayes' formula. We can see that the posterior of class <math>\displaystyle k</math> is the summation over <math>\displaystyle j</math> of the probability of <math>\displaystyle x</math> given <math>\displaystyle j</math> times the probability of <math>\displaystyle j</math> given <math>\displaystyle y_k</math>, times the prior distribution of class <math>\displaystyle k</math>, and lastly divided by the marginal probability of <math>\displaystyle x</math>. That is,<br />
<br />
<math>\Pr(y_k | x) = \frac {\sum_{j} Pr(x | j)*Pr(j | y_{k})*Pr(y_{k})}{Pr(x)}</math>.<br />
<br />
Since, the prior probability of class <math>\displaystyle k</math>, <math>\displaystyle Pr(y_{k})</math>, does not have an index of <math>\displaystyle j</math>, it can be taken out of the summation. This yields,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)}</math>.<br />
<br />
We multiply this by <math>\displaystyle 1 = \frac {Pr(j)}{Pr(j)}</math>. Then, it becomes,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)} * \frac {Pr(j)}{Pr(j)}</math>.<br />
<br />
Next, note that <math>\displaystyle Pr(j | x) = \frac {Pr(x | j)*Pr(j)}{Pr(x)}</math>, and <math>\displaystyle Pr(y_k | j) = \frac {Pr(j | y_k)*Pr(y_k)}{Pr(j)}</math>. Then rearranging the terms, we finally have the posterior:<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} Pr(j | x)Pr(y_k | j)</math>.<br />
<br />
Interestingly, this is just the product of the posterior of the two functions that are summed.<br />
<br />
==== Interpretation of RBF Network classification ====<br />
<br />
[[File:2.png|350px|thumb|right|Figure 26.1.2(2): RBF Nerwork ]]<br />
<br />
We want to relate the results that we derived above to our RBF Network. In a RBF Network, as we can see on the right hand side, we have a set of data, <math>\displaystyle x_1</math> to <math>\displaystyle x_d</math>, and the hidden basis function, <math>\displaystyle \phi_{1}</math> to <math>\displaystyle \phi_{M}</math>, and then we have some output, <math>\displaystyle y_1</math> to <math>\displaystyle y_k</math>. Also, we have weights from the hidden layer to output layer. The output is just the linear sum of <math>\displaystyle \phi</math>’s. <br />
<br />
Now consider probability of <math>\displaystyle j</math> given <math>\displaystyle x</math> to be <math>\displaystyle \phi</math>, and the probability of <math>\displaystyle y_k</math> given <math>\displaystyle j</math> to be the weights <math>\displaystyle w_{jk}</math>, then the posterior can be written as,<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math>.<br />
<br />
[[File:3.png|350px|thumb|left|Figure 26.1.2(1): Gaussian mixture ]]<br />
<br />
Now, let us look at an example in one dimensional case. Suppose,<br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, and <math>\displaystyle j</math> is from 1 to 2. <br />
<br />
We know that <math>\displaystyle \phi</math> is a radial basis function. It's as if we put some Gaussian over data. And for each Gaussian, we consider the center <math>\displaystyle \mu</math>. Then, what <math>\displaystyle \phi</math> computes is the similarity of any data point to the center. <br />
<br />
We can see the graph on the left which plots the density of <math>\displaystyle \phi_{1}</math> and <math>\displaystyle \phi_{2}</math>. Take <math>\displaystyle \phi_{1}</math> for instance, if the point gets far from the center <math>\displaystyle \mu_{1}</math>, then it will reduce <math>\displaystyle \phi_{1}</math> to become nearly zero. Remember that, we can usually find a non-linear regression or classification of input space by doing a linear one in some extended space or some feature space (more details in Aside). Here, the <math>\displaystyle \phi</math>’s actually produce that feature space. <br />
<br />
So, one way to look at this is that this <math>\displaystyle \phi</math> is telling us that given an input, how likely the probability of presence of a particular feature is. Say, for example, we define the features as the centers of these Gaussian distributions. Then, this <math>\displaystyle \phi</math> function somehow computes the possibility given certain data points, of this kind of feature appearing. If the data point is right at the center, then the value of that <math>\displaystyle \phi</math> would be one, i.e. the probability is 1. If the point is far from the center, then the probability (<math>\displaystyle \phi</math> function value) will be close to zero, that is, it’s less likely. Therefore, we can treat <math>\displaystyle Pr(j | x)</math> as the probability of a particular feature given data. <br />
<br />
When we have those features, then <math>\displaystyle y</math> is the linear combination of the features. Hence, any of the weights <math>\displaystyle w</math>, which is equal to <math>\displaystyle Pr(y_k | j)</math>, tells us how likely this particular <math>\displaystyle y</math> will appear given those features. Therefore, the weight <math>\displaystyle w_{jk}</math> shows the probability of class membership given feature. <br />
<br />
Hence, we have found a probabilistic point of view to look at RBF Network!<br />
<br />
*'''Note''' There are some inconsistencies with this probabilistic point of view. There are no restrictions that force <math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math> to be between 0 and 1. So if least squares is used to solve this, <math>\displaystyle w_{jk}</math> cannot be interpreted as a probability. <br />
<br />
<br />
''' Aside '''<br />
*Feature Space:<br />
:One way to produce a feature space is LDA<br />
:Suppose, we have n data points <math>\mathbf{x}_1</math> to <math>\mathbf{x}_n </math>. Each data point has d features. And these n data points consist of the <math>X</math> matrix, <br />
:<math>X = \left[ \begin{matrix}<br />
x_{11} & x_{21} & \cdots & x_{n1} \\<br />
x_{12} & x_{22} & \cdots & x_{n2} \\<br />
\vdots & & \ddots & \vdots \\<br />
x_{1d} & x_{2d} & \cdots & x_{nd}<br />
\end{matrix}\right] </math><br />
:Also, we have feature space,<br />
:<math>\Phi^{T} = \left[ \begin{matrix}<br />
\phi_{1}(\mathbf{x_1}) & \phi_{1}(\mathbf{x_2})& \cdots & \phi_{1}(\mathbf{x_n})\\<br />
\phi_{2}(\mathbf{x_1})& \phi_{2}(\mathbf{x_2})& \cdots & \phi_{2}(\mathbf{x_n}) \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{M}(\mathbf{x_1}) & \phi_{M}(\mathbf{x_2}) & \cdots & \phi_{M}(\mathbf{x_n})<br />
\end{matrix}\right] </math> <br />
:If we want to solve a regression problem for the input data, we don’t perform Least Square on this <math>\displaystyle X</math> matrix, we do Least Square on the feature space, i.e. on the <math>\displaystyle \Phi^{T}</math> matrix. The dimensionality of <math>\displaystyle \Phi^{T}</math> is M by n.<br />
:Now, we still have n data points, but we define these n data points in terms of a new set of features. So, originally, we define our data points by d features, but now, we define them by M features. And what are those M features telling us? <br />
:Let us look at the first column of <math>\displaystyle \Phi^{T}</math> matrix. The first entry is <math>\displaystyle \phi_1</math> applied to <math>\mathbf{x_1}</math>, and so on, until the last entry is <math>\displaystyle \phi_M</math> applied to <math>\mathbf{x_1}</math>. Suppose each of these <math>\displaystyle \phi_j</math> is defined by<br />
:<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>.<br />
:Then, each <math>\displaystyle \phi_j</math> checks the similarity of the data point with its center. Hence, the new set of features are actually representing M centers in our data set, and for each data point, its new features check how this point is similar to the first center; how it is similar to the second center; and how it is similar to the <math>\displaystyle M^{th}</math> center. And this checking process will apply to all data points. Therefore, feature space gives another representation of our data set. <br />
<br />
</noinclude><br />
<br />
=== Model selection or complexity control for RBF Network - a brief introduction ===<br />
In order to obtain a better fit for the training data, we often want to increase the complexity of our RBF Network. By its construction, we know that to change the complexity of a RBF Network, the only way is to add or decrease the number of basis functions. A large number of basis function yields a more complex network. In theory, if we add enough basis functions, the RFB Network would work for any training; however, it doesn't mean this RBF Network model can generalize well. Therefore, for the purpose of avoiding overfitting problem (see Notes below), we only want to increase the number of basis function to certain point, i.e. its optimal level. <br />
<br />
For the model selection, what we usually do is estimate the training error. After working through the training error, we’ll see that the training error in fact can be decomposed, and one component of training error is called Mean Squared Error (MSE). In the later notes, we will find that our final goal is to get a good estimate of MSE. Moreover, in order to find an optimal model for our data, we select the model with the smallest MSE.<br />
<br />
Now, let us introduce some notations that we will use in the analysis:<br />
*<math>\hat f</math> -- the prediction model estimated by a RBF network from the training data<br />
*<math>\displaystyle f</math> -- the real model (not null), and ideally, we want <math>\hat f</math> to be close to <math>\displaystyle f</math><br />
*<math>\displaystyle err</math> -- the training error<br />
*<math>\displaystyle Err</math> -- the testing error<br />
*<math>\displaystyle MSE</math> -- the Mean Squared Error<br />
<br />
''' Notes '''<br />
<br />
[[File:overfitting.png|350px|thumb|left|Figure 26.2: Overfitting]]<br />
<br />
*Being more complex isn’t always a good thing. Sometime, [http://en.wikipedia.org/wiki/Overfitting overfitting] causes the model to lose its generality. For example in the graph on left hand side, the data points are sampled from the model <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle f(x_i)</math> is a linear function, which is shown by the blue line, and <math>\displaystyle \epsilon_i</math>is additive Gaussian noise from <math>~N(0,\sigma^2)</math>. The red curve displayed in the graph shows the over-fitted model. Clearly, this over-fitted model only works for any training data, and is useless for any further prediction when new data points are introduced.<br />
<br />
> n<-20;<br />
> x<-seq(1,10,length=n);<br />
> alpha<-2.5;<br />
> beta<-1.75;<br />
> y<-alpha+beta*x+rnorm(n);<br />
> plot(y~x, pch=16, lwd=3, cex=0.5, main='Overfitting');<br />
> abline(alpha, beta, col='blue');<br />
> lines(spline(x, y), col = 2);<br />
<br />
*More details on this topic later on.<br />
<br />
<br />
<br />
</noinclude><br />
<br />
<br />
<br />
<br />
<br />
<br />
== '''Model Selection(Stein's Unbiased Risk Estimate)- November 11th, 2009''' ==<br />
<br />
===Model Selection===<br />
<br />
Model selection is a task of selecting a model of optimal complexity for a given data. Learning a radial basis function network from data is a parameter estimation problem. One difficulty with this problem is selecting parameters that show good performance on both training and testing data. In principle, a model is selected to have parameters associated with the best observed performance on training data, although our goal really is to achieve good performance on unseen testing data. Not surprisingly, a model selected on the basis of training data does not necessarily exhibit comparable performance on the testing data. When squared error is used as the performance index, a zero-error model on the training data can always be achieved by using a sufficient number of basis functions.<br />
<br />
<br />
But, training error and testing error do not demonstrate a linear relationship. In particular, a smaller training error does do not necessarily result in a smaller testing error. In practice, one often observes that, up to a certain point, the model error on testing data tends to decrease as the training error decreases. However, if one attempts to decrease the training error too far by increasing model complexity, the testing error often can take a dramatic increase.<br />
<br />
<br />
The basic reason behind this phenomenon is that in the process of minimizing training error, after a certain point, the model begins to over-fit the training set. Over-fitting in this context means fitting the model to training data at the expense of losing generality. In the extreme form, a set of <math>\displaystyle N</math> training data points can be modeled exactly with <math>\displaystyle N</math> radial basis functions. Such a model follows the training data perfectly. However, the model is not representative features of the true underlying data source, and this is why it fails to correctly model new data points.<br />
<br />
<br />
In general, the training error rate will be less than the testing error on the new data. A model typically adapts to the training data, and hence the training error will be an overly optimistic estimate of the testing error. An obvious way to well estimate testing err is to add a penalty term to the training error to compensate. Actually, SURE is developed based on this.<br />
<br />
<br />
<br />
===Stein's unbiased risk estimate (SURE)===<br />
<br />
<br />
====Important Notation====<br />
<br />
Let:<br />
*<math>\hat f(X)</math> denote the ''prediction model'', which is estimated from a training sample by the RBF neural network model.<br />
*<math>\displaystyle f(X)</math> denote the ''true model''.<br />
*<math>\displaystyle err=\sum_{i=1}^N (\hat y_i-y_i)^2 </math> denote the ''training error'',which is the average loss over the training sample.<br />
*<math>\displaystyle Err=\sum_{i=1}^M (\hat y_i-y_i)^2 </math> denote the ''test error'', which is the expected prediction error on an independent test sample.<br />
*<math>\displaystyle MSE=E(\hat f-f)^2</math> denote the ''mean squared error'', where <math>\hat f(X)</math> is the estimated model and <math>\displaystyle f(X)</math> is the true model.<br />
<br />
<br />
<br />
Suppose the observations <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>.We need to estimate <math>\hat f</math> from training data set <math>T=\{(x_i,y_i)\}^2_{i=1}</math>. Let <math>\hat f_i=\hat f_(x_i)</math> and <math>\displaystyle f_i= f(x_i)</math>, then <br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i-\epsilon_i)^2]</math><math>=E[(\hat f_i-f_i)^2]+E[\epsilon_i^2]-2E[\epsilon_i(\hat f_i-f_i)]</math><br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2E[\epsilon_i(\hat f_i-f_i)]</math> <math>\displaystyle (1)</math><br />
<br />
The last term can be written as:<br />
<br />
<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=E[(y_i-f_i)(\hat f_i-f_i)]=cov(y_i,\hat f)</math>, where<math>\displaystyle y_i</math> and <math>\hat f_i</math> both have same mean <math>\displaystyle f_i</math>.<br />
<br />
<br />
<br />
====Stein's Lemma====<br />
<br />
If <math>\,Z</math> is <math>\,N(\mu,\sigma^2)</math> and if <math>\displaystyle g(Z)</math> is weakly differentiable,such that<math>\displaystyle E[\vert g'(Z)\vert]<\infty</math>, then <math>\displaystyle E[g(Z)(Z-\mu)]=\sigma^2E(g'(Z))</math>.<br />
<br />
<br />
According to Stein's Lemma, the last cross term of <math>\displaystyle (1)</math>, <math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]</math> can be written as <math>\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math>. The derivation is as follows.<br />
<br />
<math>\displaystyle Proof</math>: Let <math>\,Z = \epsilon</math>. Then <math>g(Z) = \hat f-f</math>, since <math>\hat y = f + \epsilon</math>, and <math>\,f</math> is a constant. So <math>\,\mu = 0</math> and <math>\,\sigma^2</math> is the variance in <math>\,\epsilon</math>.<br />
<math>\displaystyle E[g(Z)(Z-\mu)]=E[(\hat f-f)\epsilon]=\sigma^2E(g'(Z))=\sigma^2 E[\frac {\partial (\hat f-f)}{\partial y_i}]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}-\frac {\partial f}{\partial y_i}]</math><br />
<br />
<br />
Since <math>\displaystyle f</math> is the true model, not the function of the observations <math>\displaystyle y_i</math>, then <math>\frac {\partial f}{\partial y_i}=0</math>.<br />
<br />
So,<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math> <math>\displaystyle (2)</math><br />
<br />
====Two Different Cases====<br />
SURE in RBF,<br />
[http://www.cs.ualberta.ca/~papersdb/uploaded_files/801/paper_automatic-basis-selection-for.pdf Automatic basis selection for RBF networks using Stein’s unbiased risk estimator,Ali Ghodsi Dale Schuurmans]<br />
<br />
<br />
=====''Case 1''=====<br />
<br />
Consider the case in which a new data point has been introduced to the estimated model, i.e. <math>(x_i,y_i)\not\in\tau</math>; this new point belong to the validation set <math>\displaystyle \nu</math>,i.e.<math>(x_i,y_i)\in\nu</math>. Since <math>\displaystyle y_i</math> is a new point, <math>\hat f</math> and <math>\displaystyle y_i</math> are independent. Therefore <math>\displaystyle cov(y_i,\hat f)=0</math> (or think about <math>\frac{\partial \hat f}{\partial y_i}</math>, when <math>\,y_i</math>is a new point, then it has nothing with <math>\hat f</math> because the estimation of <math>\hat f</math> is from the training data, so <math>\frac{\partial \hat f}{\partial y_i}=0</math> ) and <math>\displaystyle (1)</math> in this case can be written as:<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2</math>. <br />
<br />
This expectation means <math>\frac {1}{m}\sum_{i=1}^m (\hat y_i-y_i)^2 = \frac {1}{m}\sum_{i=1}^m (\hat f_i-f_i)^2+ \sigma^2</math>.<br />
<br />
<math>\sum_{i=1}^m (\hat y_i-y_i)^2 = \sum_{i=1}^m (\hat f_i-f_i)^2+ m\sigma^2</math><br />
<br />
Based on the notation we denote above, then we obtain:<br />
<math>\displaystyle MSE=Err-m\sigma^2</math><br />
<br />
<br />
<br />
This is the justification behind the technique of cross validation. since <math>\displaystyle \sigma^2</math> is constant, to minimize <math>\displaystyle MSE</math> is equal to minimize the test err <math>\displaystyle Err</math>. In cross vaildation to avoid overfitting or underfitting, a validation data set is independent from the estimated model.<br />
<br />
<br />
=====''Case 2''=====<br />
<br />
A more interesting case is the case in which we do not use new data points to assess the performance of the estimated model. and the training data is used for both estimating and assessing a model <math>\hat f_i</math>. In this case the cross term in <math>\displaystyle (1)</math> cannot be ignored because <math>\hat f_i</math> and <math>\displaystyle y_i</math> are not independent. Therefore the cross term can be estimated by Stein's lemma, which was originally proposed to estimated the mean of a Guassian distribution.<br />
<br />
<br />
Suppose <math>(x_i,y_i)\in\tau</math>, then by applying Stein's lemma, we obtain <math>\displaystyle (2)</math> proved above.<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2\sigma^2E[\frac {\partial \hat f}{\partial y_i}]</math>.<br />
<br />
This expectation means <math>\frac {1}{N}\sum_{i=1}^N (\hat y_i-y_i)^2 = \frac {1}{N}\sum_{i=1}^N (\hat f_i-f_i)^2+ \sigma^2-\frac {2}{N}\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<br />
<math>\sum_{i=1}^N (\hat y_i-y_i)^2 = \sum_{i=1}^N (\hat f_i-f_i)^2+ N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<math>\displaystyle err=MSE+N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math><br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math> <math>\displaystyle (3)</math><br />
<br />
In statistics, this is known as [http://www.reference.com/browse/Stein%27s+unbiased+risk+estimate Stein's unbiased risk estimate (SURE)] is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter, and thus cannot be determined completely.<br />
<br />
===SURE for RBF Network===<br />
<br />
Based on SURE, the optimum number of basis functions should be assigned to have the minimum generalization err <math>\displaystyle err</math>. Based on the Radial Basis Function Network, by setting <math>\frac{\partial err}{\partial W}</math> equal to zero , we get the least squared solution of<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math>.Then we have <math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}Y = HY</math>,where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}</math>, <math>\,H</math> is the hat matrix for this model.<br />
<br />
<br />
<math>\hat f=\,H_{i1}y_2+\,H_{i2}y_2+\cdots+\,H_{in}y_n</math> <math>\displaystyle (3)</math><br />
<br />
where <math>\,H</math> depends on the input vector <math>\displaystyle x_i</math> but not on <math>\displaystyle y_i</math>. <br />
<br />
By taking the derivative of <math>\hat f_i</math> with respect to <math>\displaystyle y_i</math>, we can easily obtain:<br />
<br />
<math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Now, substituing this into<math>\displaystyle (3)</math> ,we get<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Here, we can tell that <math>\sum_{i=1}^N \,H_{ii}= \,Trace(H)</math>, the sum of the diagonal elements of <math>\,H</math>. Thus, we can obtain the further simplification that <math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1})\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=d</math>, where<math>\displaystyle d</math> is the dimension of <math>\displaystyle \Phi</math>. Since <math>\displaystyle \Phi</math> is a projection of input matrix <math>\,X</math> onto a basis set spanned by <math>\,M</math>, the number of basis functions. If considering intercept, then <math>\,Trace(H)= M+1</math>.<br />
<br />
Then,<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1)</math>.<br />
<br />
===SURE Algorithm===<br />
<br />
<br />
[[File:27.1.jpg|350px|thumb|right|Figure 27.1]]<br />
<br />
We use this method to find the optimum number of basis function by choosing the model with smallest MSE over the set of models considered. Given a set of models <math>\hat f_M(x)</math> indexed by the number of basis functions, <math>\displaystyle err(M)</math>. <br />
<br />
Then, <math>\displaystyle MSE(M)=err(M)-N\sigma^2+2\sigma^2(M+1)</math><br />
<br />
where <math>\displaystyle N</math> is the number of training samples and the noise,<math>\sigma^2</math>, can be estimated from the training data as<br />
<br />
<math>\hat \sigma^2=\frac {1}{N-1}\sum_{i=1}^N (\hat y-y)^2</math>.<br />
<br />
<br />
By applying SURE algorithm to SPECT Heart data, we get the optimal number of basis functions is <math>\displaystyle M=4</math>.<br />
<br />
<br />
Pls take a look at figure 27.1 on the right, which shows that<math>\displaystyle MSE</math> is smallest when <math>\displaystyle M=4</math>.<br />
<br />
<br />
Calculating the SURE value is easy if you have access to <math>\,\sigma</math>.<br />
<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
If <math>\,\sigma</math> is not known, it can be estimated using the error.<br />
<br />
error = (output - expected_output) .^ 2;<br />
sigma = (dot(err, ones(1, num_data_point)) / (num_data_point)) / (num_data_point - 1);<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
=='''SURE for RBF network & Support Vector Machine - November 13th, 2009'''==<br />
<br />
===SURE for RBF network===<br />
<br />
====Minimizing MSE====<br />
<br />
By Stein's unbiased risk estimate (SURE) for Radial Basis Function (RBF) Network<br />
we get:<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1) </math> (28.1)<br />
<br />
*<math>\displaystyle MSE</math>(mean square error)= <math>\sum_{i=1}^N (\hat y_i-y_i)^2 </math><br />
*<math>\displaystyle err</math>(training error)= <math>\sum_{i=1}^N (\hat f_i-f_i)^2 </math><br />
*<math>\displaystyle (M+1) </math>( number of hidden units)= <math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math><br />
<br />
<br />
'''Goal''': To minimize MSE<br />
<br />
1. If <math>\displaystyle \sigma </math> is known, then it is no impact (i.e. a constant),<br />
and we can ignore it. Only need to minimize <math>\displaystyle MSE=err +2\sigma^2(M+1)</math>.<br />
<br />
2. In reality, we do not know <math>\displaystyle \sigma</math>, and the estimate <math>\,\hat \sigma</math> changes when <math>\displaystyle (M+1) </math> changes. However, we can estimate <math>\displaystyle \sigma </math>.<br />
<br />
<math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>. Suppose we do not know the variance of <math>\displaystyle \epsilon</math>. Then, <br />
<br />
<math>\displaystyle \sigma^2=\frac{1}{N-1}\sum_{i=1}^N (\hat y_i-y_i)^2 =\frac{1}{N-1}err</math> (28.2)<br />
<br />
Substitute (28.2) into (28.1), get<br />
<br />
<math>\displaystyle MSE=err-N\frac{1}{N-1}err+2\frac{1}{N-1}err(M+1)</math><br />
<br />
<math>\displaystyle MSE=err(1-\frac{N}{N-1}+\frac{2(M+1)}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{N-1-N+2M+2}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{2M+1}{N-1}) </math> (28.3) <br />
<br />
<br />
[[File:28.1.jpg|350px|thumb|Figure 28.1: MSE vs err]]<br />
<br />
Figure 28.1: the training error will decrease and the MSE will increase when increasing the number of hidden units (i.e. the model is more complex).<br />
<br />
<br />
When the number of hidden units gets larger and larger, the training error will decrease until it approaches to <math>\displaystyle 0 </math>. If training error equals <math>\displaystyle 0 </math>, then no matter how large <math>\displaystyle (M+1) </math> is, from (28.3) we can see the estimate of MSE will approach <math>\displaystyle 0 </math> as well. However, in fact it does not happen since when training error is close to <math>\displaystyle 0 </math> [http://en.wikipedia.org/wiki/Overfitting overfitting] happens, and MSE should increase instead of being close to <math>\displaystyle 0 </math>. We can see it from the Figure 28.1. <br />
<br />
<br />
We can see <math>\displaystyle \sigma^2 </math> is the average of <math>\displaystyle err </math>. In order to deal with this problem, we can take the average for <math>\displaystyle err</math> of each hidden unit. For example: we can first take 1 hidden unit, and take 10 hidden units in the next. Since in reality the value of <math>\, \sigma^2</math> is a constant adjustment to the data points, and doesn't depend on <math>\,M+1</math>, using the average <math>\,\sigma^2</math> value for 1 to 10 hidden units has a firm theoretical basis.<br />
<br />
We can also see that unlike the classical Cross Validation (CV) or Leave one out (LOO) techniques, the SURE technique does not need to do the validation to find the optimal model. Hence, SURE technique uses less data than CV or LOO. It is suitable for the case that there is not enough data for validation. However, to implement SURE we need to find <math>\frac {\partial \hat f}{\partial y_i}</math>, which may not be trivial for models that do not have a closed-form solution.<br />
<br />
====Kmeans Clustering====<br />
<br />
Discription:<br /> [http://en.wikipedia.org/wiki/K-means_clustering Kmeans clustering] is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.<br />
<br />
*The number of hidden units is same as the number of clusters, which is <math>\displaystyle \phi </math><br />
<br />
*<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, we set it same for all clusters.<br />
<br />
The basic details for <math>K</math>-means clustering are given:<br />
<br />
The <math>K</math> initial centers are randomly chosen from the training data.<br />
<br />
Then the following two steps are iterated alternatively until convergernce.<br />
<br />
1. for each existing center, we reidentify its cluster (every points in this cluster should be closer to this center than others).<br />
<br />
2. compute the mean for each cluster and make it as the new center for each cluster.<br />
<br />
<br />
Example:<br /><br />
Partition data into 2 clusters (2 hidden values)<br />
<br />
<br />
>> X=rand(30,80); <br />
>> [IDX,C,sumD,D]=kmeans(X,2); <br />
>> size(IDX) <br />
>> 30 1<br />
>> size(C) <br />
>> 2 80<br />
>> size(sumD) <br />
>> 2 1<br />
>> c1=sum(IDX==1)<br />
>> 14<br />
>> c2=sum(IDX==2)<br />
>> 16<br />
>> sumD<br />
>> 85.6643<br />
>> 101.0419<br />
>> v1=sumD(1,1)/c1 <br />
>> 6.1189<br />
>> v2=sumD(2,1)/c2 <br />
>> 6.3151 <br />
<br />
<br />
<br />
Comments:<br />
<br />
We create <math>X</math> randomly as a train set with 80 data points and 30 dimensions, and then apply “kmeans” method to separate X into 2 clusters.IDX is a vector contains 1 or 2 which indicates 2 clusters, and its size is 30*1. <math>\displaystyle C </math> is the center (mean) of each cluster with size 2*80; sumD is sum of the square distance between the data points and center of its cluster. The <math>\displaystyle c1 </math> and <math>\displaystyle c2 </math> indicate the number of data points in cluster 1 and 2. <math>\displaystyle V1 </math> is the variance of the first cluster <math>\displaystyle (v1=\sigma_1)</math>; <math>\displaystyle V2 </math> is the variance of the second cluster <math>\displaystyle (v2=\sigma_2)</math>. Now we can get <math>\displaystyle \phi </math>, <math>\displaystyle w </math>, hat matrix <math>\displaystyle H </math> and <math>\displaystyle \hat Y </math> by following equations. Finally, we will get the <math>\displaystyle MSE </math> and predict the test set. <br />
<br />
<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br />
<br />
<math>\displaystyle W = (\phi^{T}\Phi)^{-1}\phi^{T}Y</math><br />
<br />
<math>\displaystyle H = \phi(\Phi^{T}\Phi)^{-1}\phi^{T}</math><br />
<br />
<math>\displaystyle \hat{Y} = \phi W = \phi(\phi^{T}\phi)^{-1}\phi^{T}Y = HY</math><br />
<br />
<br />
<br />
Aside:<br />
<br />
Similar in spirit to <math>K</math>-means, there is EM algorithm with respect to Gaussian mixture model. Generally speaking, the Gaussian mixture model is referred to as a soft clustering while <math>K</math>-means is hard clustering.<br />
<br />
Similar to <math>K</math>-means, the following two steps are iterated alternatively until convergernce.<br />
<br />
E-step, each point is assigned a weight for each cluster based on the likelihood of each of the corresponding Gaussians. Observations is assigned 1 for one cluster if they are closer to the center of that cluster, and is assigned 0 for other clusters. <br />
<br />
M-step, compute the weighted means and covariances and make them as the new means and covariances for every cluster.<br />
<br />
>>[P,mu,phi,1Pxtr]=mdgEM(X,2,200,0);<br />
<br />
===Support Vector Machine===<br />
<br />
====Introduction====<br />
We have seen that linear discriminate analysis and logistic regression both estimate linear decision boundaries in similar but slightly different ways. Separating hyperplane classifiers provide the basis for support vector classifier. It constructs linear decision boundaries that explicitly try to separate the data into different classes as well as possible. The techniques that make the extensions to the nonseparable case, where the classes overlap are generalized to what is known as the support vector machine. It produces nonlinear boundaries by constructing a linear boundary in a high dimensional, transformed version of the feature space.<br />
<br />
The original basis for SVM was published in the 1960s by [http://en.wikipedia.org/wiki/Vapnik Vapnik], Chervonenkis and co, however the ideas did not gain any attention until strong results were shown in the early 90s.<br />
<br />
Definition: <br /><br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machine (SVM)] are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a Maximum margin hyperplane or set of hyperplanes in a higher or infinite dimensional space. The set of points near the class boundaries, support vectors, define the model. which can be used for classification, regression or other tasks.<br />
<br />
====Optimal Seperating Hyperplane====<br />
<br />
[[File:28.2.jpg|350px|thumb|right|Figure 28.2]]<br />
<br />
Figure 28.2 An example with two classes separated by a hyperplane. The blue line is the least squares solution, which misclassifies one of the training points. Also shown are the black separating hyperplanes found by the [http://en.wikipedia.org/wiki/Perceptron perceptron] learning algorithm with different random starts.<br /><br />
<br />
We can see the data points can be separated by a linear boundary are in two classes in <math>\displaystyle \mathbb{R}^{2} </math>. Suppose a dataset is indeed linearly separable, then there exits infinitely many possible separating hyperplanes including the black lines in the figure as two of them for training data. However, which solution is the best when we introduce the new data. <br /><br />
<br />
Aside: <br /><br />
The blue line is the least squares solution to the problem,obtained by regressing the <math>\displaystyle -1/+1 </math> response <math>\displaystyle Y </math> on <math>\displaystyle X </math> (with intercept); the line is given by<br />
<math>\displaystyle {X:\hat\beta_0+\hat\beta_1X_1+\hat\beta_2X_2=0}</math>.<br />
This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by linear discriminant analysis, in light of its equivalence with linear regression in the two-class case.<br />
<br />
Classifiers such as (28.4) that compute a linear combination of the input features and return the sign were called ''perceptrons'' in the engineering literature in the late 1950s. <br />
<br />
<br />
Identifications:<br />
<br />
*Hyperplane: seprate two classes <br />
<br />
<math>\displaystyle x^{T}\beta+\beta_0=0</math><br />
<br />
*Margin: the distance between the hyperplane and the closest point.<br />
<br />
<math>\displaystyle d_i=x_i^{T}\beta+\beta_0 </math> where <math>\displaystyle i=1,....,N</math><br />
<br />
Note: since distance is positive, if the data is on <math>\displaystyle +1 </math> side the distance is <math>\displaystyle d_i(+1)</math>. If the data is on the <math>\displaystyle -1 </math> side the distance is <math>\displaystyle d_i(-1)</math>.<br />
<br />
*Data points: <math>\displaystyle y_i\in\{-1,+1\}</math>we can classify points as<math>\displaystyle sign\{d_i\}</math> if <math>\displaystyle \beta,\beta_0 </math> is known.<br /><br />
<br />
====Maximum Margin Classifiers in the Linearly separable case====<br />
Choose the line farthest from both classes or Choose the line which has the maximum distance from the closest point. (i.e maximize the margin)<br /><br />
<br />
<math>\displaystyle Margin=min\{y_id_i\}</math> <math>\displaystyle i=1,2,....,N </math> <br />
where <math>\displaystyle y_i </math> is label and <math>\displaystyle d_i </math> is distance<br /><br />
<br />
[[File:28.3.jpg|350px|thumb|right|Figure 28.3 The linear algebra of a hyperplane]]<br />
<br />
<br />
<br />
Fiqure 28.3 depicts a hyperplane defined by the equation <math>\displaystyle x^{T}\beta+\beta_0=0</math>. Since they are in <math>\displaystyle \mathbb{R}^{2} </math>, the hyperplane is a line.<br /><br />
<br />
<br />
Let us rewrite <math>\displaystyle Margin=min\{y_id_i\}</math> by using the following properties:<br /><br />
<br />
1. <math>\displaystyle \beta </math> is orthogonal to the hyperplane <br /><br />
<br />
Two points<math>\displaystyle x_1,x_2</math> lying on the hyperplane.<br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0\beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}(x_1-x_2)=0</math><br />
<br />
Hence,<math>\displaystyle \beta </math> is orthogonal to <math>\displaystyle (x_1-x_2)</math>, and<math>\displaystyle \beta^*=\frac{\beta}{\|\beta\|} </math> is the vector normal to the hyperplane.<br /><br />
<br />
2. For any point <math>\displaystyle x_0 </math> on the hyperplane, <br />
<br />
<math>\displaystyle \beta^{T}x_0+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1=-\beta_0</math><br />
For any point on the hyperplane <math>\displaystyle \beta^{T}</math> gives negative value of the intercept of the hyperplane. <br/><br />
<br />
<br />
3. The signed distance for any point <math>\displaystyle x </math> to the hyperplane is <math>\displaystyle d_i=\beta^{T}(x_i-x_0)</math>. <br/>Since the length of <math>\displaystyle \beta </math> is not known, let it be unit vector.<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}(x_i-x_0)}{\|\beta\|} </math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i-\beta^{T}x_0}{\|\beta\|} </math><br />
<br />
by property 2<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i+\beta_0}{\|\beta\|} </math><br />
<br />
<br />
<br />
[[File:4.jpg|350px|thumb|right|Figure 28.4]]<br />
<br />
<br />
We had <math>\displaystyle Margin=min(y_id_i)</math> <math>\displaystyle i=1,2,....,N </math>, and since we now know how to compute <math>\displaystyle d_i \Rightarrow</math> <br />
<br />
<math>\displaystyle Margin=min\{y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\} </math><br />
<br />
Suppose <math>\displaystyle x_i </math> is not on the hyperplan<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)>0 </math><br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq c </math>for <math>\displaystyle c>0 </math><br />
<br />
<br />
<math>\displaystyle y_i(\frac{\beta^{T}x_i}{c}+\frac{\beta_0}{c})\geq1</math> <br />
<br />
This is known as the cnonical representation of the decision hyperplane.<br />
<br />
For <math>\displaystyle \beta^{T} </math> only the direction is important, so <math>\displaystyle \frac{\beta^{T}}{c} </math> does not change its direction, the hyperplance will be the same.<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq1 </math><br />
<br />
<math>\displaystyle y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\geq\frac{1}{\|\beta\|} </math><br />
<br />
<math>\displaystyle Margin=min(\frac{1}{\|\beta\|} )</math><br />
<br />
which is equivalent as <math>\displaystyle Margin=max(\|\beta\| )</math><br />
<br />
<br />
<br />
<br />
Refrence:<br /><br />
Hastie,T.,Tibshirani,R., Friedman,J.,(2008).The Elements of Statistical Learning:129-130<br />
<br />
=='''Optimizing The Support Vector Machine - November 16th, 2009'''==<br />
We currently derive Support Vector Machine for the case where two classes are separable in the given feature space. This margin can be written as <math>\,min\{y_id_i\}</math>, or the distance of each point from the hyperplane, where <math>\,d_i</math> is the distance and <math>\,y_i</math> is used as the sign.<br />
===Margin Maximizing Problem for the Support Vector Machine===<br />
<math>\,Margin=min\{y_id_i\}</math> can be rewritten as <math>\,min\left\{\frac{y_i\left(\beta^Tx_i+\beta_0\right)}{|\beta|}\right\}</math>. <br />
<br\>Note that the term <math>\,y_i\left(\beta^Tx_i+\beta_0\right) = 0</math> if <math>\,x_i</math> is on the hyperplane, but <math>\,y_i\left(\beta^Tx_i+\beta_0\right) > 0</math> if <math>\,x_i</math> is <math>\displaystyle not</math> on the hyperplane<br />
<br />
This implies <math>\,\exists C>0</math> such that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq C</math>.<br />
<br />
Divide through by C to produce <math>\,y_i\left(\frac{\beta^T}{C}x_i + \frac{\beta_0}{C}\right) \geq 1</math>. <br />
<br />
<math>\,\beta, \beta_0</math> compose a hyperplane that can have different values but we care about the direction, dividng through by a constant does not change the direction of the hyperplane. Thus, by assuming scaled values for <math>\,\beta, \beta_0</math> we eliminate C, so that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq 1</math>. Implying that the lower bound on <math>\,y_i\left(\beta^Tx_i+\beta_0\right)</math> is <math>\displaystyle 1</math><br />
<br />
Now in order to maximize the margin, we simply need to find <math>\,min\left\{\frac{1}{|\beta|}\right\}</math>. <br />
<br />
In other words, our optimization problem is now to find maximum <math>\,|\beta|</math>, under the constraint that <math>\,min_i\{y_i(\beta^Tx_i+\beta_0)\} = 1</math>.<br />
<br />
Note that we're dealing with the norm of <math>\,\beta</math>. There are many different choices of possible norms, in general[http://en.wikipedia.org/wiki/P-norm#p-norm p-norm]. The 1-norm of a vector is simply the sum of the absolute value of each element (also known as the taxicab or Manhattan distance), and is apparently more accurate, but also has a discontinuity in the derivative. 2-norm or the Euclidean norm (the intuitive measure of the length of a vector), is easier to work with - that is <math>\,\|\beta\|_2 = (\beta^T\beta)^{1/2}</math>. For convenience, we will maximize <math>\,\frac{1}{2}\|\beta\|_2^2 = \frac{1}{2}\beta^T\beta</math> where the constant 1/2 has been added for simplification and that the maximizing the function is the same as maximizing the square root of that function.<br />
<br />
This is an example of a quadratic programming problem and we will minimize a quadratic function respective to a linear inequality constraints.<br />
<br />
<br />
====Writing Lagrangian Form of Support Vector Machine====<br />
The Lagrangian form is introduced to ensure that the conditions are satisfied, as well as finding an optimal solution(The optimal saddle point of the lagrangian for the classic quadratic optimization). The problem will be solved in dual space by introducing <math>\,\alpha_i</math> as dual constraints, this is in contrast to solving the problem in primal space as function of the betas. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.<br />
<br />
<math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math>. To find the optimal value, set the derivative equal to zero.<br />
<br />
<math>\,\frac{\partial L}{\partial \beta} = 0</math>, <math>\,\frac{\partial L}{\partial \beta_0} = 0</math>. Note that <math>\,\frac{\partial L}{\partial \alpha_i}</math> is equivalent to the constraints <math>\left(y_i(\beta^Tx_i+\beta_0)-1\right) \geq 0, \,\forall\, i</math><br />
<br />
First, <math>\,\frac{\partial L}{\partial \beta} = \frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\left\{\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i)+\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0-\frac{\partial}{\partial \beta}\alpha_iy_i\right\}}</math><br />
<br />
: <math>\frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 = \beta</math>.<br />
<br />
: <math>\,\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i) = \alpha_iy_ix_i</math><br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0 = 0</math>. <br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i = 0</math>.<br />
<br />
So this simplifies to <math>\,\frac{\partial L}{\partial \beta} = \beta - \sum_{i=1}^n{\alpha_iy_ix_i} = 0</math>. In other words,<br />
<br />
<math>\,\beta = \sum_{i=1}^n{\alpha_iy_ix_i}</math>, <math>\,\beta^T = \sum_{i=1}^n{\alpha_iy_ix_i^T}</math><br />
<br />
Similarly, <math>\,\frac{\partial L}{\partial \beta_0} = -\sum_{i=1}^n{\alpha_iy_i} = 0</math>.<br />
<br />
This allows us to rewrite the Lagrangian without <math>\,\beta</math>.<br />
<br />
<math>\,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} - \sum_{i=1}^n{\alpha_iy_i(\sum_{j=1}^n{\alpha_jy_jx_j^Tx_i} + \beta_0) - 1}</math>. <br />
<br />
Because <math>\,\sum_{i=1}^n{\alpha_iy_i} = 0</math>, and <math>\,\beta_0</math> is constant, <math>\,\sum_{i=1}^n{\alpha_iy_i\beta_0} = 0</math>. So this simplifies further, to<br />
<br />
<math>L(\alpha) = \,-\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} + \sum_{i=1}^n{\alpha_i}</math><br />
A dual representation of the maximum margin.<br />
<br />
Because <math>\,\alpha_i</math> is the Lagrangian multiplier, <math>\,\alpha_i \geq 0 \forall i</math>.<br />
<br />
This is a much simpler optimization problem.<br />
<br />
=='''The Support Vector Machine algorithm - November 18, 2009'''==<br />
<br />
===Solving the Lagrangian===<br />
<br />
Continuing from the above derivation, we now have the equation that we need to minimize, as well as two constraints.<br />
<br />
The Support Vector Machine problem boils down to:<br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math><br />
:such that <math>\alpha_i \geq 0</math><br />
:and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
<br />
We are looking to minimize <math>\,\alpha</math>, which is our only unknown. Once we know <math>\,\alpha</math>, we can easily find <math>\,\beta</math> and <math>\,\beta_0</math> (see the Support Vector algorithm below for complete details).<br />
<br />
If we examine the Lagrangian equation, we can see that <math>\,\alpha</math> is multiplied by itself; that is, the Lagrangian is quadratic with respect to <math>\,\alpha</math>. Our constraints are linear. This is therefore a problem that can be solved through [http://en.wikipedia.org/wiki/Quadratic_programming quadratic programming] techniques. We will examine how to do this in Matlab shortly.<br />
<br />
We can write the Lagrangian equation in matrix form:<br />
<br />
<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:such that <math>\underline{\alpha} \geq \underline{0}</math><br />
:and <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Where:<br />
* <math>\underline{\alpha}</math> denotes an <math>\,n \times 1</math> vector; <math>\underline{\alpha}^T = [\alpha_1, ..., \alpha_n]</math><br />
* Matrix <math>S = y_iy_jx_i^Tx_j = (y_ix_i)^T(y_ix_i)</math><br />
* <math>\,\underline{0}</math> and <math>\,\underline{1}</math> are vectors containing all 0s or all 1s respectively<br />
<br />
Using this matrix notation, we can use Matlab's built in quadratic programming routine, [http://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/quadprog.html quadprog].<br />
<br />
===Quadprog example===<br />
<br />
Let's use <code>quadprog</code> to find the solution to <math>\,L(\alpha)</math>.<br />
<br />
Matlab's <code>quadprog</code> function minimizes an equation of the following form:<br />
:<math>\min_x\frac{1}{2}x^THx+f^Tx</math><br />
:such that: <math>\,A \cdot x \leq b</math>, <math>\,Aeq \cdot x = beq</math> and <math>\,lb \leq x \leq ub</math><br />
<br />
We can now see why we kept the <math>\frac{1}{2}</math> constant in the original derivation of the equation.<br />
<br />
The function is called as such: <code>x = quadprog(H,f,A,b,Aeq,beq,lb,ub)</code>. The variables correspond to values in the equation above.<br />
<br />
We can now try to find the solution to <math>\,L(\alpha)</math> (though, it should be noted, that in <math>\,L(\alpha)</math> we subtract the first term rather than add it; I'm not sure if this difference would change how we call <code>quadprog</code>).<br />
<br />
We'll use a simple one-dimensional data set, which is essentially y = -1 or 1 + Gaussian noise. (Note: you could easily put the values straight into the quadprog equation; they are separated for clarity.)<br />
<br />
x = [mvnrnd([-1],[0.01],100); mvnrnd([1],[0.01],100)]';<br />
y = [-ones(100,1); ones(100,1)];<br />
S = (y*x)' * (y*x);<br />
f = ones(200,1);<br />
A = -ones(1,200);<br />
b = 0;<br />
Aeq = y';<br />
beq = 0;<br />
lb = 0;<br />
ub = []; % There is no upper bound<br />
alpha = quadprog(S,f,A,b,Aeq,beq,lb,ub);<br />
<br />
This gives us the optimal <math>\,\alpha</math>... or at least I think it should, but it does not appear to work for me (that is, despite setting the lower boundary of 0, a number of values are still negative. Whether this is just the nature of <code>quadprog</code> or an error on my part is an exercise for the reader).<br />
<br />
===Examining K.K.T. conditions===<br />
<br />
[http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker conditions] ([http://webrum.uni-mannheim.de/mokuhn/public/KarushKuhnTucker.pdf more info]) give us a closer look into the Lagrangian equation and the associated conditions.<br />
<br />
Suppose we are looking to minimize <math>\,f(x)</math> such that <math>\,g_i(x) \geq 0, \forall{x}</math>. If <math>\,f</math> and <math>\,g</math> are differentiable, then the ''necessary'' conditions for <math>\hat{x}</math> to be a local minimum are:<br />
<br />
# At the optimal point, <math>\frac{\partial L}{\partial \hat{x}} = 0</math>; i.e. <math>f'(\hat{x}) - \sum{\alpha_ig'(\hat{x})}=0</math><br />
# <math>\alpha_i \geq 0</math>. (Dual Feasibility) <br />
# <math>\alpha_ig_i(\hat{x}) = 0, \forall{i}</math> (Complementary Slackness)<br />
# <math>g_i(\hat{x}) \geq 0</math> (Primal Feasibility)<br />
<br />
If any of these conditions are violated, then the problem is deemed not feasible.<br />
<br />
These are all trivial except for condition 4. Let's examine it further in our support vector machine problem.<br />
<br />
===Support Vectors===<br />
<br />
Basically, the support vectors are the training points that actually determine the optimal separating hyperplane which we are look for. Also, they are the most difficult points to classify or the most informative for the classification.<br />
<br />
In our case, the <math>g_i(\hat{x})</math> function is:<br />
:<math>\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1</math><br />
<br />
Substituting <math>\,g_i</math> into KKT condition 3, we get <math>\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0</math>. <br\>In order for this condition to be satisfied either <br/><math>\,\alpha_i= 0</math> or <br/><math>\,y_i(\beta^Tx_i+\beta_0)=1</math><br />
<br />
All points <math>\,x_i</math> will be either 1 or greater than 1 distance away from the hyperplane.<br />
<br />
'''Case 1: a point <math>\displaystyle x_i > 1</math> away from the margin'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) > 1 \Rightarrow \alpha_i = 0</math>.<br />
<br />
If point <math>\, x_i</math> is not on the margin, then the corresponding <math>\,\alpha_i=0</math>.<br />
<br />
'''Case 2: a point <math>\displaystyle x_i = 1</math> away from the margin'''<br />
<br />
If <math>\,\alpha_i > 0 \Rightarrow y_i(\beta^Tx_i+\beta_0) = 1</math> <br />
<br\>If point <math>\, x_i</math> is on the margin, then the corresponding <math>\,\alpha_i>0</math>.<br />
<br />
<br />
Points on the margin, with corresponding <math>\,\alpha_i > 0</math>, are called '''''support vectors'''''.<br />
<br />
===Using support vectors===<br />
<br />
Support vectors are important because they allow the support vector machine algorithm to be insensitive to outliers. If <math>\,\alpha = 0</math>, then the cost function is also 0, and won't contribute to the solution of the SVM problem; only points on the margin -- support vectors -- contribute. Hence the model given by SVM in entirely defined by the set of support vectors, a subset of the entire training set. This is interesting because in the NN methods(and can be generalize to classical statistical learning) previous to this the configuration of the network needed to be specified. In this case we have a data-driven or 'nonparametric' model in which is the training set and algorithm will determine the support vectors.<br />
<br />
References:<br />
Wang, L, 2005. Support Vector Machines: Theory and Applications, Springer, 3<br />
<br />
====The support vector machine algorithm====<br />
<br />
# Solve the quadratic programming problem:<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
## Use Matlab's quadprog to find the optimal <math>\,\underline{\alpha}</math><br />
# Find <math>\beta = \sum_{i=1}^n{\alpha_iy_i\underline{x_i}}</math><br />
# Find <math>\,\beta_0</math> by choosing a support vector (a point with <math>\,\alpha_i > 0</math>) and solving <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math><br />
<br />
===Example in Matlab===<br />
<br />
The following code, taken verbatim from the lecture, shows how to use Matlab built-in SVM routines (found in the Bioinformatics toolkit) to do classification through support vector machines.<br />
<br />
load 2_3;<br />
[U,Y] = princomp(X');<br />
data = Y(:,1:2);<br />
l = [-ones(1,200) ones(1,200)];<br />
[train,test] = crossvalind('holdOut',400);<br />
% Gives indices of train and test; so, train is a matrix of 0 or 1, 1 where the point should be used as part of the training set<br />
svmStruct = svmtrain(data(train,:), l(train), 'showPlot', true);<br />
<br />
[[File:Svm1.png|frame|center|The plot produced by training on some of the 2_3 data's first two features.]]<br />
<br />
yh = svmclassify(svmStruct, data(test,:), 'showPlot', true);<br />
<br />
[[File:Svm2.png|frame|center|The plot produced by testing some of the 2_3 data.]]<br />
=='''Non-linear hypersurfaces and Non-Separable classes - November 20, 2009'''==<br />
==='''Kernel Trick'''===<br />
We have talked about the curse of dimensions at the beginning of this course, however, now we turn to the power of high dimensions in order to make find a linearly separable hyperplane between two classes of data points. To understand this, imagine a two dimensional prison where a two dimensional person is constrained. Suppose magically we give the person a third dimension, then he can escape from the prison. In other words, the prison and the person are linearly separable now with respect to the third dimension. The intuition behind the "kernel trick" is basically to map data to a higher dimension so that they are linearly separable by a hyperplane.<br />
<br />
We have seen SVM as a linear classification problem finding the max margin hyperplane in the given input space. However, for many real world problems a more complex decision boundary is required. The following simple method was devised is order to solve the same linear classification problem but in a usually higher dimensional space, 'feature space' under which the max margin hyperplane is better suited.<br />
<br />
Let <math>\,\phi</math> be a mapping,<br />
<br />
<math>\phi:\Re^d \rightarrow \Re^D </math><br /><br /><br />
<br />
We wish to find a <math>\,\phi</math> such that our data will be suited for separation by a hyperplane. Given this function, we are lead to solving the previous constrained quadratic optimization on the transformed dataset,<br /><br /><br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_j\phi(x_i)^T\phi(x_j)}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br /><br /><br />
<br />
The solution to this optimization problem is now well know; however a workable <math>\,\phi</math> must be determined. Possibly the largest drawback in this method is that we must compute the inner product of two vectors in the high dimensional space. As the number of dimensions in the initial data set increases, the inner product becomes computationally intensive or impossible.<br />
<br />
However, we have a very useful result that says that there exists a class of functions, <math>\,\Phi</math>, which satisfy the above requirements and that for any function <math>\,\phi \in \Phi</math>,<br />
<br />
<math>\,\phi(x_i)^T\phi(x_j) = K(x_i,x_j) </math><br /><br /><br />
<br />
Where K is the kernel function in the input space satisfying [http://en.wikipedia.org/wiki/Mercer%27s_condition Mercer's condition] (to guarantee that it indeed corresponds to certain mapping function <math>\,\phi</math>). As a result, if the objective function depends on inner products but not on coordinates, we can always use the kernel function to implicitly calculate in the feature space without storing the huge data. Not only does this solve the computation problems but it no longer requires us to explicitly determine a specific mapping function in order to use this method. In fact, it is now possible to use an infinite dimensional feature space in SVM without even knowing the function <math>\,\phi</math>.<br />
<br />
==='''Mercer's Theorem in detail'''===<br />
Let <math>\,\phi</math> be a mapping to a high dimensional [http://en.wikipedia.org/wiki/Hilbert_space Hibert Space] <math>\,H</math><br /><br />
<br />
<br />
<math>\phi:x \in \Re^d \rightarrow H </math><br /><br /><br />
<br />
The transformed coordinates can be defined as,<br /><br />
<br />
<math>\phi_1(x)\dots\phi_d(x)\dots </math><br /><br /><br />
<br />
By Hilbert - Schmidt theory we can represent an inner product in Hilbert space as,<br /><br /><br />
<br />
<math>\,\phi(x_i)^T\phi(x_j) = \sum_{r=1}^{\infty}a_k\phi_r(x_i)\phi(x_j) \Leftrightarrow K(x_i,x_j), \ a_r \ge 0 </math><br /><br /><br />
where K is symmetric, then Mercer's theorem gives necessary and sufficient conditions on K for it to satisfy the above relation.<br><br><br />
<br />
'''Mercer's Theorem'''<br />
<br />
Let C be a compact subset of <math>\Re^d</math> and K a function <math> \in L^2(C) </math>, if<br /><br /><br />
<br />
<math>\, \int_C\int_C K(u,v)g(u)g(v)dudv \ge 0, \ \forall g \in L^2(C)</math> <br /><br /><br />
<br />
then,<br /><br /><br />
<br />
<math>\sum_{r=1}^{\infty}a_k\phi_r(u)\phi(v)</math> converges absolutely and uniformly to a symmetric function <math>\,K(u,v)</math><br />
<br />
References:<br />
Vapnik, V., 1998. Statistical Learning Theory. John Wiley & Sons, {423}<br />
<br />
==='''Kernel Functions'''===<br />
There are various kernel functions, for example:<br />
<br />
* Linear kernel: <math>\,k(x,y)=x \cdot y</math><br />
* Polynomial kernel: <math>\,k(x,y)=(x \cdot y)^d</math><br />
* Gaussian kernel: <math>e^{-\frac{|x-y|^2}{2\sigma^2}}</math><br />
<br />
If <math>\,X</math> is a <math>\,d \times n</math> matrix in the original space, and <math>\,\phi(X)</math> is a <math>\,D \times n</math> matrix in the [http://en.wikipedia.org/wiki/Hilbert_space Hilbert space] (good explanation video: [http://www.youtube.com/watch?v=V2pBdH7YzX0 part 1] [http://www.youtube.com/watch?v=YRY5xlk3TC0 part 2]), then <math>\,\phi^T(X) \cdot \phi(X)</math> is an <math>\,n \times n</math> matrix. <br />
The inner product is also illustrated as correlation, which measures the similarity between data points. This gives us some insight in how to choose the kernel. The choice depends on certain prior knowledge of the problem and on how we believe the similarity of our data should be measured. In practice, the Gaussian (RBF) kernel usually works best. Besides the most common kernel functions mentioned above, many novel kernels are also suggested for different problem domains like text classification, gene classification and so on.<br />
<br />
These kernel functions can be applied to many algorithms to derive the "kernel version". For example, kernel PCA, kernel LDA, etc..<br />
<br />
==='''SVM: non-separable case'''===<br />
We have seen how SVMs are able to find an optimally separating hyperplane of two separable classes of data, in which case the margin contains no data points. However, in the real world, data of different classes are usually mixed together at the boundary and it's hard to find a perfect boundary to totally separate them. To address this problem, we slack the classification rule to allow data cross the margin. Mathematically the problem becomes,<br />
:<math>\min_{\beta, \beta_0} \frac{1}{2}|\beta|^2</math><br />
:<math>\,y_i(\beta^Tx_i+\beta_0) \geq 1-\xi_i</math><br />
:<math>\xi_i \geq 0</math><br />
<br />
Now each data point can have some error <math>\,\xi_i</math>. However, we only want data to cross the boundary when they have to and make the minimum sacrifice; thus, a penalty term is added correspondingly in the objective function to constrain the number of points that cross the margin. The optimization problem now becomes:<br />
<br />
:<math>\min_{\alpha} \frac{1}{2}|\beta|^2+\gamma\sum_{i=1}^n{\xi_i}</math><br />
:<math>\,s.t.</math> <math>y_i(\beta^Tx+\beta_0) \geq 1-\xi_i</math> <br />
:<math>\xi_i \geq 0</math><br />
<br />
[[File:non-separable.JPG|350px|thumb|right|Figure non-separable case]]<br />
<br />
<br\>Note that <math>\,\xi_i</math> is not necessarily smaller than one, which means data can not only enter the margin but can also cross the separating hyperplane.<br />
<br />
References:<br />
<br />
Mercer, J., 1909. Functions of positive and negative type and their connection<br />
with the theory of integral equations. Philos. Trans. Roy. Soc. London, A<br />
209:415{446}<br />
<br />
==Support Vector Machine algorithm for non-separable cases - November 23, 2009==<br />
<br />
With the program formulation above, we can form the lagrangian, apply KKT conditions, and come up with a new function to optimize. As we will see, the equation that we will attempt to optimize in the SVM algorithm for non-separable data sets is the same as the optimization for the separable case, with slightly different conditions.<br />
<br />
===Forming the Lagrangian===<br />
<br />
:<math>L: \frac{1}{2} |\beta|^2 + \gamma \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i[y_i(\beta^T x_i+\beta_0)-1+\xi_i]-\sum_{i=1}^n \alpha_i \xi_i</math><br />
:<math>\alpha_i \geq 0, \lambda_i \geq 0</math><br />
<br />
===Applying KKT conditions===<br />
# <math>\frac{\partial L}{\partial \beta}=\beta - \sum_{i=1}^n \alpha_i y_i x_i = 0 \Rightarrow \beta=\sum_{i=1}^n\alpha_i y_i x_i</math> <br\><math>\frac{\partial L}{\partial \beta_0}=-\sum_{i=1}^n \alpha_i y_i =0 \Rightarrow \sum_{i=1}^n \alpha_i y_i =0</math> since the sign does not make a difference<br />
#<math>\frac{\partial L}{\partial \xi_i}=\gamma - \alpha_i - \lambda_i \Rightarrow \gamma = \alpha_i+\lambda_i</math><br />
#<math>\,\alpha_i[y_i(\beta^T x_i+\beta_0)-1+\xi_i]=0</math> and <math>\,\alpha_i \xi_i=0</math><br />
<br />
===Putting it all together===<br />
<br />
With our KKT conditions and the Lagrangian equation, we can now use quadratic programming to find <math>\,\alpha</math>. <br\> Similar to what we did for the separable case after apply KKT conditions, replace the primal variables in terms of dual variables into the Lagrangian equations and simplify.<br />
<br />
<br />
In matrix form, we want to solve the following optimization:<br />
:<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:<math>\,s.t.</math> <math>\underline{0} \leq \underline{\alpha} \leq \gamma</math>, <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Solving this gives us <math>\,\underline{\alpha}</math>, which we can use to find <math>\,\underline{\beta}</math> as before:<br />
:<math>\,\underline{\beta} = \sum{\alpha_i y_i \underline{x_i}}</math><br />
<br />
However, we cannot find <math>\,\beta_0</math> in the same way as before, even if we choose a point with <math>\,\alpha_i > 0</math>, because we do not know the value of <math>\,\xi_i</math> in the equation<br />
:<math>\,y_i(\underline{\beta}^Tx_i + \beta_0) - 1 + \xi_i = 0</math><br />
<br />
From our discussion on the KKT conditions, we know that <math>\,\lambda_i \xi_i = 0</math> and <math>\,\gamma = \alpha_i + \lambda_i</math>.<br />
<br />
So, if <math>\,\alpha_i < \gamma</math> then <math>\,\lambda_i > 0</math> and consequently <math>\,\xi_i = 0</math>.<br />
<br />
Therefore, we can solve for <math>\,\beta_0</math> if we choose a point where:<br />
:<math>\,0 < \alpha_i < \gamma</math><br />
<br />
====The SVM algorithm for non-separable data sets====<br />
<br />
The algorithm, then, for non-separable data sets is:<br />
<br />
# Use <code>quadprog</code> (or another quadratic programming technique) to solve the above optimization and find <math>\,\alpha</math><br />
# Find <math>\,\underline{\beta}</math> by solving <math>\,\underline{\beta} = \sum{\alpha_i y_i x_i}</math><br />
# Find <math>\,\beta_0</math> by choosing a point where <math>\,0 < \alpha_i < \gamma</math> and then solving <math>\,y_i(\underline{\beta}^Tx_i + \beta_0) - 1 = 0</math></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=5512stat8412009-11-24T00:10:29Z<p>Ipargaru: /* Applying KKT conditions */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rise of fields such as data-mining, bioinformatics, and machine learning, classification has becomes a fast-developing topic. In the age of information, vast amounts of data are generated constantly, and the goal of classification is to learn from data. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
'''Definition''': The problem of Prediction a discrete random variable <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math> is called Classification.<br />
<br />
In classification,, we attempt to approximate a function <math>\,h</math>, by using a training data set, which will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>d</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data which are identical independent distributions, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, for instance, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability. Intuitively speaking, to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we find <math>\,y\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y|X=x)</math>, and classify <math>\,X</math> into class <math>\,y</math>. In order to calculate the value of <math>\,P(Y=y|X=x)</math>, we use ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posterior probability, <math>\,P(Y=y)</math> as the prior probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two classes, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<sub></sub><br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''3 different approaches to classification''':<br />
<br />
1) Empirical Risk Minimization: Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of <math>\,L(h)</math><br />
<br />
2) Regression: Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3) Density Estimation: estimate <math>\,P(X=x|Y=0)</math> and <math>\,P(X=x|Y=1)</math> (less popular in high-dimension cases)<br />
<br />
<br />
<br />
'''Bayes Classification Rule Optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when<math></math> the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
<br />
:<math>\, h^*(X)= \left\{\begin{matrix} <br />
1 & if P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
Remark:<br />
<br />
1)Bayes classification rule is optimal. Proof:[http://www.ee.columbia.edu/~vittorio/BayesProof.pdf]<br />
<br />
2)We still need any other method, since we cannot define prior probability in realistic.<br />
<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, since it is generally impossible for us to know the prior <math>\,P(Y=1)</math>, and class conditional density <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], tree-augmented naive Bayes (TAN), Bayesian network augmented naive Bayes (BAN) and general Bayesian network (GBN).<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as having objective existence. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayesian and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayesian'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In Bayesian method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
'''Multi-class Classification''':<br />
<br />
Y takes on more than two values.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
''Theorem'': Suppose that <math>\,Y \in \mathcal{Y}= \{1,\dots,k\}</math>, the optimal rule is :<math>\,h^*(X) = \arg\max_{k}{P(Y = k|X = x)}</math><br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
1 Empirical Risk Minimization:Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
<br />
2 Regression:Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3 Density estimation, estimate <math>P(X = x|Y = 0)</math> and <math>P(X = x|Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
It is quadratic because there is no boundaries.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on [http://academicearth.org/lectures/advice-for-applying-machine-learning LDA and QDA] so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
LDA is for classification and FDA is used for feature extraction.<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
<br />
=== logistic function ===<br />
A logistic function or logistic curve is the most common sigmoid curve. <br />
<br />
:<math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
1. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
2. <math>y(0) = \frac{1}{2}</math><br />
<br />
3. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
4. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
5. The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood, using Pr(Y|X). The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Seeing these equations as a weighted least squares problem makes them easier to derivate.<br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== Perceptron (Foundation of Neural Network) ===<br />
<br />
==== Separating Hyperplane Classifiers ====<br />
Separating hyperplane trys to separate the data using linear decision boundaries. When the classes overlap, it can be generalized to support vector machine, which constructs nonlinear boundaries by constructing a linear boundary in an enlarged and transformed feature space.<br />
<br />
==== Perceptron ====<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
Least Squares returns the sign of a linear combination of data points as the class label<br />
<br />
sign(<math>(\underline{\beta}^T \underline{x} + {\beta}_0)) = sign(\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2})</math><br />
<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the [http://en.wikipedia.org/wiki/Artificial_neural_network Artificial Neural Network] models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br />
<br />
<br />
If we find a hyperplane that is not unique between 2 classes, there will be infinitely many solutions obtained from the perceptron algorithm.<br />
<br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
A Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math>, where <math>\,I</math> indicates the sign of the expression and returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math> (initial guess). Its goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach], which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
====Problems with the Algorithm and Issues Affecting Convergence====<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. This problem can be eliminated by using basis expansions technique. To be specific, we try to find a hyperplane not in the original space, but in the enlarged space obtained by using some basis functions.<br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#A perfect separation is not always available even desirable. If observations comes from different classes sharing the same imput, the classification model seems to be overfitting and will generally have poor predictive performance.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
<br />
</ref>.<br />
====Comment on gradient descent algorithm====<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction in which the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and you initially stand in the middle, then you will finally arrive at the saddle point (local minimum) and get stuck there.<br />
<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over <math>\,i</math> (all data points). Actually, this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent) known as Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>\,{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
*A Perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection with branches that "fan out" onto as many connections as desired, each carrying the same signal - the processing element output signal. <br />
<br />
<ref> Haykin, Simon (2009). Neural Networks and Learning Machines. Pearson Education, Inc. </ref><br />
A neural network resembles the brain in two respects:<br />
1. Knowledge is acquired by the network from its environment through a learning process.<br />
2. Interneuron connection strengths, known as synaptic weights, are used to store the acquired knowledge.<br />
<br />
<ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref> It is a multistage regression or classification model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
A regression problem typically has only one unit in the output layer. In a k-class classification problem, there are usually k units in the output layer that each represent the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1).<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classification by NN. <br />
<br />
In perceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] and is not continuous at 0. Thus, we replace it by a smooth function <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it the '''activation function'''.<br />
<br>The choice of this function <math>\displaystyle \sigma </math> is determined by the properties of the data and the assumed distribution of target variables, but for multiple binary classification problems the logistic function, also known as inverse-logit, is often used: <br />
<math>\sigma(a)=\frac {1}{1+e^{-a}}</math><br />
<br />
There are some important properties for the activation function.<br />
<br />
# Activation function is nonlinear. It can be shown that if the activation function of the hidden units is linear, a three-layer neural network is equivalent to a two layer one. <br />
# Activation function saturate, which means there are maximum and minimum output value. This property ensures that the weights are bounded and therefore the searching time is limited. <br />
# Activation function is continuous and smooth.<br />
# Activation function is monotonic. This property is not necessary, since we know that RBF networks is also a kind of power model. <br />
<br />
'''Note:''' A key difference between a perceptron and a neural network is that a neural network uses continuous nonlinearities in the units, for the purpose of differentiation, whereas the perceptron often uses a non-differentiable activation function. The neural network function is differentiable with respect to the network parameters so that a gradient descent method can be used in training. Moreover, a perceptron is a linear classifier, whereas a neural network, by combining layers of perceptrons, is able to classify non-linear problems through proper training.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes into the perceptron, to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural network is called [http://en.wikipedia.org/wiki/Feedforward_neural_network Feed-Forward Neural Network]. Applications to Feed-Forward Neural Networks include data reduction, speech recognition, sensor signal processing, and ECG abnormality detection, to name a few. <ref>J. Annema, Feed-Forward Neural Networks, (Springer 1995), pp. 9 </ref><br />
<br />
===Back-propagation===<br />
For a while, the Neural Network model was just an idea, since there were no algorithms for training the model until 1986, when Geoffrey Hinton <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> devised an algorithm called '''back-propagation''' [http://en.wikipedia.org/wiki/Backpropagation#Algorithm]. After that, a number of other training algorithms and various configurations of neural networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied a gradient descent algorithm for optimizing weights. Back-propagation uses this idea of gradient descent to train a neural network based on the chain rule in calculus. <br />
<br />
Assume that the last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
For simplicity, there is only 1 unit at the end and assume for the moment we are doing regression.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each unit we have a function <math>\displaystyle z_i=\sigma(a_i)</math> that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classification and the resulting classification output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
''i j are reversed!''<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\delta_j \cdot z_l</math><br />
<br />
Note that a change in <math>\,a_j</math> causes changes in all <math>\,a_i</math> in the next layer on which the error is based, so we need to sum over i in the chain:<br />
<math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} =\sum_i \delta_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\,\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math> Using the activation function <math>\,\sigma(\cdot)</math><br />
<br />
So <math>\delta_j = \sum_i \delta_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \delta_i \cdot u_{ij}</math><br />
<br />
We can propagate the error calculated in the output back through the previous layers and adjust weights to minimize error.<br />
<br />
==Neural Networks (NN) - October 30, 2009 ==<br />
<br />
=== Back-propagation ===<br />
The idea is that we first feed an input from the training set to the Neural Network, then find the error rate at the output and then we propagate the error to previous layers and for each edge of weight <math>\,u_{ij}</math> we find <math>\frac{\partial \mathrm{err}}{\partial u_{ij}}</math>. Having the error rates at hand we adjust the weight of each edge by taking steps proportional to the negative of the gradient to decrease the error at output. The next step is to apply the next input from the training set and go through the described adjustment procedure.<br />
The overview of Back-propagation algorithm:<br />
#Feed a point <math>\,x</math> in the training set to the network, and find the output of all the nodes.<br />
#Evaluate <math>\,\delta_k=y_k-\hat{y_k}</math> for all output units, where <math>y_k</math> is the expected output and <math>\hat{y_k}</math> is the real output.<br />
#By propagating to the previous layers evaluate all <math>\,\delta_j</math>s for hidden units: <math>\,\delta_j=\sigma'(a_j)\sum_i \delta_i u_{ij}</math> where <math>i</math> is associated to the previous layer.<br />
#Using <math>\frac{\partial \mathrm{err}}{\partial u_{jl}} = \delta_j\cdot z_l</math> find all the derivatives.<br />
#Adjust each weight by taking steps proportional to the negative of the gradient: <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} -\rho \frac{\partial \mathrm{err}}{u_{ij}}</math><br />
#Feed the next point in the training set and repeat the above steps.<br />
==== How to initialize the weights ====<br />
This still leaves the question of how to initialize the weights <math>\,u_{ij}, w_i</math>. The method of choosing weights mentioned in class was to randomize the weights before the first step. This is not likely to be near the optimal solution in every case, but is simple to implement. To be more specific, random values near zero will be a good choice for the initial weights(usually from [-1,1]). In this case, the model evolves from a nearly linear one to a nonlinear one as we desired. An alternative is to use an orthogonal least squares method to find the initial weights <ref>http://www.mitpressjournals.org/doi/abs/10.1162/neco.1995.7.5.982</ref>. Regression is performed on the weights and output by using a linear approximation of <math>\,\sigma(a_i)</math>, and finds optimal weights in the linear model. Back propagation is used afterward to find the optimal solution, since the NN is non-linear.<br />
<br />
==== How to set learning rates ====<br />
The learning rate <math>\,\rho</math> is usually a constant. <br />
<br />
If we use On-line learning, as a form of stochastic approximation process, <math>\,\rho</math> should decrease as the iteration increase.<br />
<br />
<br />
Choosing too large learning rate may cause the unstability of the system, while the too small learning rate may lead to a very slow convergence rate(very long time in learning phase). However, the advantage of small learning rate is that it can guarantee the convergence. Thus, generally, it is better to choose a relatively small learning rate to ensure the stability. Usually, choose <math>\,\rho</math> between 0.01 and 0.7<br />
<br />
==== How to determine the number of hidden units ====<br />
<br />
Here we will mainly discuss how to estimate the number of hidden units at very beginning. Obviously, we should adjust it to be more precise using CV, LOO or other complexity control methods. <br />
<br />
Basically, if the patterns are well separated, few hidden units are fairly enough. If the patterns are drawn from some highly complicated mixture models, more hidden units are really needed. <br />
<br />
Actually, the number of hidden units determines the size of the model, and therefore the total number of the weights in the model. Typically speaking, the number of weights should not be larger than the number of training data, say N. Thus, sometimes, N/10 is a good choice. However, in pratice, many well performed models will use more hidden units.<br />
<br />
=== Dimensionality reduction application ===<br />
[[File:NN-bottelneck.png|350px|thumb|right|Figure 1: Bottleneck configuration for applying dimensionality reduction.]]<br />
One possible application of Neural Networks is to perform dimensionality reduction, like other techniques, e.g., PCA, MDS, LLE and Isomap.<br />
<br />
Consider the following configuration as shown in figure 1:<br />
As we go forward in layers of this Neural Network, the number of nodes is reduced, until we reach a layer with the number of nodes representing the desired dimensionality. However, note that at the very first few layers the number of nodes may not be strictly decreasing, as long as finally it can reach a layer with less nodes. From now on in the Neural Network <br />
the previous layers are mirrored. So at the output layer we have the same number of states as we have in the input layer. Now note that if we feed the network with each point and get an output approximately equal to the fed input, that means at the output the same input is reconstructed from the middle layer units. So the output of the middle layer units can represent the input with less dimensions.<br />
<br />
To train this Neural Network, we feed the network with a training point and through back propagation we adjust the network weights based on the error between the input layer and the reconstruction at the output layer. Our low dimensional mapping will be the observed output from the middle layer. Data reconstruction consists of putting the low dimensional data through the second half of the network.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
== Complexity Control October 30, 2009 ==<br />
<br />
[[File:overfitting-model.png|500px|thumb|right|Figure 2. The overfitting model passes through all the points of the training set, but has poor predictive power for new points.<br />
In exchange the line model has some error on the training points but has extracted the main characteristic of the training points, and has good predictive power.]]<br />
There are [http://academicearth.org/lectures/underfitting-and-overfitting two issues] that we have to avoid in Machine Learning:<br />
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]<br />
#Underfitting<br />
<br />
Overfitting occurs when our model is heavily complex with so many degrees of freedom, that we can learn every detail of the training set. Such a model will have very high precision on the training set but will show very poor ability to predict outcomes of new instances, especially outside the domain of the training set.<br />
<br />
In a Neural Network if the depth is too much, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will show a very precise outcome of the training set but will not be able to generalize the commonality of the training set to predict the outcome of new cases.<br />
<br />
Underfitting occurs when the model we picked to describe the data is not complex enough, and has high error rate on the training set.<br />
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.<br />
<br />
'''Example'''<br />
#Consider the example showed in the figure. We have a training set and we want to find a model which fits it the best. We can find a polynomial of high degree which almost passes through all the points in the training set. But, in fact the training set is coming from a line model. Now the problem is although the complex model has less error on the training set it diverges from the line in other ranges which we have no training point. Because of that the high degree polynomail has very poor predictive result on test cases. This is an example of overfitting model.<br />
#Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem.<br />
#Consider a simple classification example. If our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is a yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features typical of bananas, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie., we have overfit the data. This occurs when we have considered so many features that we have perfectly described the existing bananas, but if presented with a new banana of slightly different shape than the existing bananas, for example, it cannot be detected. This is the tradeoff; what is the right level of complexity?<br />
<br />
== Complexity Control - Nov 2, 2009 ==<br />
<br />
Overfitting occurs when the model becomes too complex and underfitting occurs when it is not complex enough, both of which are not desirable. To control complexity, it is necessary to make assumptions for the model before fitting the data. Assumptions that we can make for a model are with polynomials or a neural network. There are other ways as well.<br />
<br />
[[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 1: An example of a model with a family of polynomials]]<br />
We do not want a model to get too complex, so we control it by making an assumption on the model. With complexity control, we want a model or a classifier with a low error rate.<br />
<br />
=== '''How do we choose a good classifier?''' ===<br />
<br />
Our goal is to find a classifier that minimizes the true error rate. <br />
Recall the empirical error rate<br />
<br />
<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math><br />
<br />
<math>\,h</math> is a classifier and we want to minimize the error rate. So we apply <math>\displaystyle x_1</math> to <math>\displaystyle x_n</math> to <math>\displaystyle h</math>, and take the average to get the empirical true error rate estimation of probability that <br />
<math>h(x_{i}) \neq y_{i}</math>.<br />
<br />
<span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 2]]</span><br />
There is a downward bias to this estimate meaning that it is always less than the true error rate. <br />
<br />
If there is a change in our complexity from low to high, our error rate is always decreasing. When we apply our model to the test data, our error rate will start to decrease to a point, but then it will increase since the model hasn't seen it before. This can be explained since training error will decrease when we fit the model better by increasing its complexity, but as we have seen, this complex model will not generalize well, resulting in a larger test error. <br />
<br />
We use our test data (from the test sample line shown on Figure 2) to get our empirical error rate.<br />
Right complexity is defined as where error rate of the test data is minimum; and this is one idea behind complexity control.<br />
<br />
<br />
<br />
[[File:Bias.jpg|200px|thumb|left|Figure 3]]<br />
<br />
We assume that we have samples <math>\,X_1, . . . ,X_n</math> that follow some (possibly unknown) distribution. We want to estimate a parameter <math>\,f</math> of the unknown distribution. This parameter may be the mean <math>\,E(X_i)</math>, the variance <math>\,var(X_i)</math> or some other quantity.<br />
<br />
The unknown parameter <math>\,f</math> is a fixed real number <math>f\in R</math>. To estimate it, we use an estimator which is a<br />
function of our observations, <math>\hat{f}(X_1,...,X_n)</math>. <br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f</math><br />
<br />
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]</math><br />
<br />
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math><br />
<br />
One property we desire of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>.<br />
However, there is a more important property for an estimator than just being unbiased: the mean squared error. In statistics, there are problems for which it may be good to use an estimator with a small bias. In some cases, an estimator with a small bias may have lesser mean squared error or be median-unbiased (rather than mean-unbiased, the standard unbiasedness property). The property of median-unbiasedness is invariant under transformations while the property of mean-unbiasedness may be lost under nonlinear transformations. For example, while using an unbiased estimator with large mean square error to estimate the parameter, we highly risk a big error. In contrast, a biased estimator with small mean square error will well improve the precision of our prediction.<br />
<br />
Hence, our goal is to minimize <math>MSE (\hat{f})</math>.<br />
<br />
From figure 3, we can see that the relationship of the three parameters is:<br />
<math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.<br />
<br />
A Test error is a good estimation on MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias.<br />
<br />
<br />
Referring to Figure 2, overfitting happens after the point where training data (training sample line) starts to decrease and test data (test sample line) starts to increase. There are 2 main approaches to avoid overfitting:<br />
<br />
1. Estimating error rate<br />
<br />
<math>\hookrightarrow</math> Empirical training error is not a good estimation<br />
<br />
<math>\hookrightarrow</math> Empirical test error is a better estimation<br />
<br />
<math>\hookrightarrow</math> Cross-Validation is fast<br />
<br />
<math>\hookrightarrow</math> Computing error bound (analytically) using some probability inequality.<br />
<br />
We will not discuss computing the error bound in class; however, a popular method for doing this computation is called VC Dimension (short for Vapnik–Chervonenkis Dimension). Information can be found from [http://www.autonlab.org/tutorials/vcdim.html Andrew Moore] and [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.7171&rep=rep1&type=pdf Steve Gunn].<br />
<br />
2. Regularization<br />
<br />
<math>\hookrightarrow</math> Use of shrinkage method<br />
<br />
<math>\hookrightarrow</math> Decrease the chance of overfitting by controlling the weights<br />
<br />
=== '''Example of under and overfitting in R''' ===<br />
<br />
To give further intuition of over and underfitting, consider this example. A simple quadratic data set with some random noise is generated, and then polynomials of varying degrees are fitted. The errors for the training set and a test set are calculated.<br />
[[File:Curvefitting-rex2.png|250px|thumb|right|Polynomial fits to curved data set.]]<br />
<br />
>> x <- rnorm(200,0,1)<br />
>> y <- x^2-0.5*x+rnorm(200,0,0.3)<br />
>> xtest <- rnorm(50,1,1)<br />
>> ytest <- xtest^2-0.5*xtest+rnorm(50,0,0.3)<br />
>> p1 <- lm(y~x)<br />
>> p2 <- lm(y ~ poly(x,2))<br />
>> pn <- lm(y ~ poly(x,10))<br />
>> psi <- lm(y~I(sin(x))+I(cos(x)))<br />
<br />
: <code>x</code> values for the training set are based on a <math>\,N(0,1)</math> distribution, while the test set has a <math>\,N(1,1)</math> distribution. <code>y</code> values are determined by <math>\,y = x^2 + x + N(0,0.3)</math>, a quadratic function with some random variation. Polynomial least square fits of degree 1, 2, and 10 are calculated, as well as a fit of <math>\,sin(x)+cos(x)</math>.<br />
<br />
>> > # calculate the mean squared error of degree 1 poly<br />
>> > sum((y-predict(p1,data.frame(x)))^2)/length(y)<br />
>> [1] 1.576042<br />
>> > sum((ytest-predict(p1,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 7.727615<br />
: Training and test mean squared errors for the linear fit. These are both quite high - and since the data is non-linear, the different mean value of the test data increases the error quite a bit.<br />
>> > # calculate the mean squared error of degree 2 poly<br />
>> > sum((y-predict(p2,data.frame(x)))^2)/length(y)<br />
>> [1] 0.08608467<br />
>> > sum((ytest-predict(p2,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 0.08407432<br />
: This fit is far better - and there is not much difference between the training and test error, either.<br />
>> > # calculate the mean squared error of degree 10 poly<br />
>> > sum((y-predict(pn,data.frame(x)))^2)/length(y)<br />
>> [1] 0.07967558<br />
>> > sum((ytest-predict(pn,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 156.7139<br />
: With a high-degree polynomial, the training error continues to decrease, but not by much - and the test set error has risen again. The overfitting makes it a poor predictor. As the degree of the polynomial rises further, the accuracy of the computer becomes an issue - and a good fit is not even consistently produced for the training data.<br />
>> > # calculate mse of sin/cos fit<br />
>> > sum((y-predict(psi,data.frame(x)))^2)/length(y)<br />
>> [1] 0.1105446<br />
>> > sum((ytest-predict(psi,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 1.320404<br />
: Fitting a function of the form sin(x)+cos(x) works pretty well on the training set, but because it is not the real underlying function, it fails on test data which doesn't lie on the same domain.<br />
<br />
== ''' Cross-Validation (CV) - Introduction ''' ==<br />
<br />
[[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]]<br />
Cross-Validation is used to estimate the error rate of a classifier with respect to test data rather than data used in the model. Here is a general introduction to CV:<br />
<br />
<math>\hookrightarrow</math> We have a set of collected data for which we know the proper labels<br />
<br />
<math>\hookrightarrow</math> We divide it into 2 parts, Training data (T) and Validation data (V)<br />
<br />
<math>\hookrightarrow</math> For our calculation, we pretend that we do not know the label of V and we use data in T to train the classifier<br />
<br />
<math>\hookrightarrow</math> We estimate an empirical error rate on V since the model hasn't seen V yet and we know the proper label of all elements in V to know how many were misclassified.<br />
<br />
CV has different implementations which can reduce the variance of the calculated error rate, but sometimes with a tradeoff of a higher calculation time.<br />
<br />
== ''' Complexity Control - Nov 4, 2009''' ==<br />
<br />
== Cross-validation ==<br />
[[File:Cross-validation.png|350px|thumb|right|Figure 1: Classical/Standard cross-validation]]<br />
Cross-validation is the simplest and most widely used method to estimate the true error. It comes from the observation that although training error always decreases with the increasing complexity of the model, the test error starts to increase from a certain point, which is noted as overfitting (see [[#prediction-error|figure 2]] above). Since test error estimates MSE (mean square error) best, people came up with the idea to divide the data set into three parts: training set, validation set, and test set. training set is used to build the model, validation set is used to deside the parameters and the optimal model, and the test set is used to estimate the performance of the chosen model. A classical division is 50% for training set, and 25% each for validation set and test set. All of them are randomly selected from the original data set. <br />
<br />
Then, we only use the part of our data marked as the "training set" to train our algorithm, while keeping the remaining marked as the "validation set" untouched. As a result, the validation set will be totally unknown to the trained model. The error rate is then estimated by:<br />
<br />
<math>\hat L(h) = \frac{1}{|\nu|}\sum_{X_i \in \nu}(h(x_i) \neq y_i)</math>, where <math>\,|\nu|</math> is the cardinality of the validation set.<br />
<br />
When we change the complexity, the error generated by the validation set will have the same behavior as the test set, so we are able to choose the best parameters to get the lowest error.<br />
<br />
<br />
=== K-fold Cross-validation ===<br />
[[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]]<br />
Above is the simplest form of complexity control. However, in reality, it may be hard to collect data ??and we usually suffer from the curse of dimensionality??, and a larger data set may be hard to come by. Consequently, we may not be able to afford to sacrifice part of the limited resources. In this case we use another method that addresses this problem, K-fold cross-validation. We divide the data set into <math>\,K</math> subsets roughly equal in size. The usual choice is <math>\,K = 10</math>.<br />
<br />
Generally, how to choose <math>\,K</math>:<br />
<br />
if <math>\,K=n</math>, leave one out, low bias, high variance. Each subset contains a single element, so the model is trained with all except one point, and then validated using that point.<br />
<br />
if <math>\,K=2</math>, say 2-fold, 5-fold, high bias, low variance. Each subset contains approximately <math>\,\frac{1}{2}</math> or <math>\,\frac{1}{5}</math> of the data.<br />
<br />
For every <math>\,k</math>th <math>( \,k \in [ 1, K ] )</math> part, we use the other <math>\,K-1</math> parts to fit the model and test on the <math>\,k</math>th part to estimate the prediction error <math>\hat L_k</math>, where<br />
<br />
<math>\hat L(h) = \frac{1}{K}\sum_{k=1}^K\hat L_k</math><br />
<br />
For example, suppose we want to fit a polynomial model to the data set and split the set into four equal subsets as shown in Figure 2. First we choose the degree to be 1, i.e. a linear model. Next we use the first three sets as training sets and the last as validation set, then the 1st, 2nd, 4th subsets as training set and the 3rd as validation set, so on and so forth until all the subsets have been the validation set once (all observations are used for both training and validation). After we get <math>\hat L_1, \hat L_2, \hat L_3, \hat L_4</math>, we can calculate the average <math>\hat L</math> for degree 1 model. Similarly, we can estimate the error for n degree model and generate a simulating curve. Now we are able to choose the right degree which corresponds to the minimum error. Also, we can use this method to find the optimal unit number of hidden layers of neural networks. We can begin with 1 unit number, then 2, 3 and so on and so forth. Then find the unit number of hidden layers with lowest average error.<br />
<br />
=== Generalized Cross-validation ===<br />
If the vector of observed values is denoted by <math>\mathbf{y}</math>, and the vector of fitted values by <math>\hat\mathbf{y}</math>.<br />
<br />
<math>\mathbf{y} = \mathbf{H}\hat\mathbf{y}</math>, <br />
<br />
where the hat matrix is given by<br />
<br />
<math>\mathbf{H} = \mathbf{X}( \mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X}^{T}</math>,<br />
<br />
<math> \frac{1}{N}\sum_{i=1}^{N}[y_{i} - \hat f^{-i}(\mathbf{x}_{i})]^{2}=\frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-\mathbf{H}_{ii}}]^{2}</math>,<br />
<br />
Then the GCV approximation is given by<br />
<br />
<math> GCV(\hat f) = \frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-trace(\mathbf{H})/N}]^{2}</math>,<br />
<br />
Thus, one of the biggest advantages of the GCV is that the trace is more easily computed.<br />
<br />
=== Leave-one-out Cross-validation ===<br />
Leave-one-out cross-validation involves using all but one data point in the original training data set to train our model, then using the data point that we initially left out as a means to estimate true error. But repeating this process for every data point in our original data set, we can obtain a good estimation of true error.<br />
<br />
In other words, leave-one-out cross-validation is k-fold cross-validation in which we set the subset number <math>\,K</math> to be the cardinality of the whole data set.<br />
<br />
In the above example, we can see that k-fold cross-validation can be computationally expensive: for every possible value of the parameter, we must train the model <math>\,K</math> times. This deficiency is even more obvious in leave-one-out cross-validation, where we must train the model <math>\,n</math> times, where <math>\,n</math> is the number of data point in the data set.<br />
<br />
Fortunately, when adding data points to the classifier is reversible, calculating the difference between two classifiers is computationally more efficient than calculating the two classifiers separately. So, if the classifier on all the data points is known, we simply undo the changes from a data point <math>\,K</math> times to calculate the leave-one-out cross-validation error rate.<br />
<br />
== Regularization for Neural Network — Weight Decay ==<br />
[[File:figure 2.png|350px|thumb|right|Figure 1: activation function]]<br />
Weight decay training is suggested as an implementation for achieving a robust neural network which is insensitive to noise. Since the number of hidden layers in NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting.<br />
<br />
It can be seen from Figure 1 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, we can avoid overfitting by constraining the weights to be small. This gives us a hint to initialize the random weights to be close to zero.<br />
<br />
Formally, we penalize nonlinear weights by adding a penalty term in the error function. Now the regularized error function becomes:<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}|w_i|^2 + \sum_{jk}|u_{jk}|^2)</math>, where <math>\,err</math> is the original error in back-propagation; <math>\,w_i</math> is the weights of the output layer; <math>\,u_{jk}</math> is the weights of the hidden layers.<br />
<br />
Usually, too large <math>\,\lambda</math> will make the weights <math>\,w_i</math> and <math>\,u_{jk}</math> too small. We can use cross validation to estimate <math>\,\lambda</math>.<br />
<br />
A similar penalty, weight elimination, is given by,<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}\frac{|w_i|^2}{1 + |w_i|^2} + \sum_{jk}\frac{|u_{jk}|^2}{1+|u_{jk}|^2})</math>.<br />
<br />
As in back-propagation, we take partial derivative with respect to the weights:<br />
<br />
<math>\frac{\partial REG}{\partial w_i} = \frac{\partial err}{\partial w_i} + 2\lambda w_i</math><br />
<br />
<math>\frac{\partial REG}{\partial u_{jk}} = \frac{\partial err}{\partial u_{jk}} + 2\lambda u_{jk}</math><br />
<br />
<math>w^{new} \leftarrow w^{old} - \rho\left(\frac{\partial err}{\partial w} + 2\lambda w\right)</math><br />
<br />
<math>u^{new} \leftarrow u^{old} - \rho\left(\frac{\partial err}{\partial u} + 2\lambda u\right)</math><br />
<br />
Note that here <math>\,\lambda</math> serves as a trade-off parameter, tuning between the error rate and the linearity. Actually, we may also set <math>\,\lambda</math> by cross-validation. The tuning parameter is important since weights of zero will lead to zero derivatives and the algorithm will not change. On the other hand, starting with weights that are too large means starting with a nonlinear model which can often lead to poor solutions. <ref>Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning (Springer 2009) pp.398</ref><br />
<br />
== Radial Basis Function (RBF) Networks - November 6, 2009 ==<br />
<br />
[[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]]<br />
<br />
=== Introduction === <br />
<br />
A Radial Basis Function (RBF) network [http://en.wikipedia.org/wiki/Radial_basis_function_network] is a type of artificial neural network with an output layer and a single hidden layer, with weights from the hidden layer to the output layer, and can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. One choice that has been widely used is that of radial basis functions, which have the property that each basis function depends only on the radial distance (typically Euclidean) from a center <math>\displaystyle\mu_{j}</math>, so that <math>\phi_{j}(x)= h({\Vert x - \mu_{j}\Vert})</math>.<br />
<br />
<br />
The output of an RBF network can be expressed as a weighted sum of its radial basis functions as follows:<br />
<br />
<math>\hat y_{k} = \sum_{j=1}^M\phi_{j}(x) w_{jk}</math><br />
<br />
The radial basis function is: <br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br /><br />
(Gaussian without a normalization constant)<br /><br /><br />
'''note:'''The hidden layer has a variable number of neurons (the optimal number is determined by the training process). As usual the more neurons in the hidden layer, the higher the model complexity. Each neuron consists of a radial basis function centered on a point with the same dimensions as the input data. The radii of the RBF functions may be different. The centers and radii can be determined through clustering or an EM algorithm. When the x vector is given from the input layer, the hidden neuron computes the radial distance from the neuron’s center point and then applies RBF function to this distance. The resulting value is passed to the the output layer and weighed together to form the output. <br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat{Y}_{n,k} = \left[ \begin{matrix}<br />
\hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\<br />
\hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\<br />
\vdots &\vdots & \ddots & \vdots \\<br />
\hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k}<br />
\end{matrix}\right] </math> is the matrix of output variables. <br />
<br />
:<math>\Phi_{n,m} = \left[ \begin{matrix}<br />
\phi_{1,1} & \phi_{1,2} & \cdots & \phi_{1,m} \\<br />
\phi_{2,1} & \phi_{2,2} & \cdots & \phi_{2,m} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
\phi_{n,1} & \phi_{n,2} & \cdots & \phi_{n,m}<br />
\end{matrix}\right] </math> is the matrix of Radial Basis Functions.<br />
<br />
:<math>W_{m,k} = \left[ \begin{matrix}<br />
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\<br />
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
w_{m,1} & w_{m,2} & \cdots & w_{m,k}<br />
\end{matrix}\right] </math> is the matrix of weights.<br />
<br />
Here, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>k = 1</math>, <math>\hat Y</math> and <math>W</math> are column vectors.<br />
<br />
''related reading'':<br />
<br />
Introduction of the Radial Basis Function (RBF) Networks [http://axiom.anu.edu.au/~daa/courses/GSAC6017/rbf.pdf]<br />
<br />
Radial Basis Function (RBF) Networks [http://documents.wolfram.com/applications/neuralnetworks/index6.html] [http://lcn.epfl.ch/tutorial/english/rbf/html/index.html]<br />
<br />
=== Estimation of weight matrix W ===<br />
<br />
We minimize the training error, <math>\Vert Y - \hat{Y}\Vert^2</math> in order to find <math>\,W</math>.<br /><br /><br />
From a previous result in linear algebra we know that <br />
<br />
<math>\Vert A \Vert^2 = Tr(A^{T}A)</math><br />
<br />
Thus we have a problem similar to linear regression:<br />
<br />
<math>\ err = \Vert Y - \Phi W\Vert^{2} = Tr[(Y - \Phi W)^{T}(Y - \Phi W)]</math><br />
<br />
<math>\ err = Tr[Y^{T}Y - Y^{T}\Phi W - W^{T} \Phi^{T} Y + W^{T}\Phi^{T} \Phi W]</math><br />
<br />
<br />
==== Useful properties of matrix differentiation ====<br />
<br />
<br />
<math>\frac{\partial Tr(AX)}{\partial X} = A^{T}</math><br />
<br />
<math>\frac{\partial Tr(X^{T}A)}{\partial X} = A</math><br />
<br />
<math>\frac{\partial Tr(X^{T}AX)}{\partial X} = (A^{T} + A)X</math><br />
<br />
==== Solving for W ====<br />
<br />
We find the minimum over <math>\,W</math> by setting <math>\frac{\partial err}{\partial W}</math> equal to zero and using the aforementioned properties of matrix differentiation.<br />
<br />
<math>\frac{\partial err}{\partial W} = 0</math><br />
<br />
<math>\ 0 - \Phi^{T}Y - \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ -2 \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math><br />
<br />
<math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY</math><br />
<br />
where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}</math><br />
<br />
<math>\,H</math> is the hat matrix for this model. This gives us a nice results since the solution has a closed form and we do not have to worry about convexity problems in this case.<br />
<br />
=== Including an additional bias ===<br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat Y = \left[ \begin{matrix}<br />
y_{11} & y_{12} & \cdots & y_{1k} \\<br />
y_{21} & y_{22} & \cdots & y_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
y_{n1} & y_{n2} & \cdots & y_{nk}<br />
\end{matrix}\right] </math> is the matrix(n by k) of output variables.<br />
<br />
:<math>\Phi = \left[ \begin{matrix}<br />
\phi_{10} &\phi_{11} & \phi_{12} & \cdots & \phi_{1M} \\<br />
\phi_{20} & \phi_{21} & \phi_{22} & \cdots & \phi_{2M} \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{n0} &\phi_{n1} & \phi_{n2} & \cdots & \phi_{nM}<br />
\end{matrix}\right] </math> is the matrix(n by M+1) of Radial Basis Functions.<br />
<br />
:<math>W = \left[ \begin{matrix}<br />
w_{01} & w_{02} & \cdots & w_{0k} \\<br />
w_{11} & w_{12} & \cdots & w_{1k} \\<br />
w_{21} & w_{22} & \cdots & w_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
w_{M1} & w_{M2} & \cdots & w_{Mk}<br />
\end{matrix}\right] </math> is the matrix(M+1 by k) of weights.<br />
<br />
where the extra basis function <math>\Phi_{0}</math> is set to 1.<br />
<br />
==== Normalized RBF ====<br />
<br />
In addition to the above unnormalized architecture, the normalized RBF can be represented as:<br />
<br />
<math>\hat{y}_{k}(X) = \frac{\sum_{j=1}^{M} w_{jk}\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math><br /><br /><br />
<br />
<br />
Actually, <math>\Phi^{\ast}_{j}(X) = \frac{\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math> is known as a normalized radial basis function. Giving the familiar form,<br /><br />
<br />
<math>\hat{y}_{k}(X) = \sum_{j=1}^{M} w_{jk}\Phi^{\ast}_{j}(X)</math><br /><br /><br />
<br />
=== Conceptualizing RBF networks ===<br />
<br />
In the past, we have classified data using models that were explicitly linear, quadratic, or otherwise definite. In RBF networks, like in Neural Networks, we can fit an arbitrary model. How can we do this without changing the equations being used?<br />
<br />
Recall a [[#Trick:_Using_LDA_to_do_QDA_-_October_7.2C_2009|trick]] that was discussed in the October 7 lecture: if we add new features to our original data set, we can project into higher dimensions, use a linear algorithm, and get a quadratic result by collapsing to a lower dimension afterward. In RBF networks, something similar can happen.<br />
<br />
Think of <math>\,\Phi</math>, our matrix of radial basis functions, as a feature space of the input. Each hidden unit, then, can be thought to represent a feature; we can see that, if there are more hidden units than input units, we can essentially project to a higher-dimensional space, as we did in our earlier trick. However, this does not mean that an RBF network will actually do this, it is merely a way to convince yourself that RBF networks (and neural networks) can fit arbitrary models. Nevertheless, it is also noticed that just because of such great power, the problem of overfitting appears more important. We have to control its complexity so that it does not fit to an arbitrary training model but to a general one.<br />
<br />
=== RBF networks for classification -- a probabilistic paradigm ===<br />
<br />
[[File:Rbf_graphical_model.png|350px|thumb|left|Figure 1: RBF graphical model]]<br />
<br />
An RBF network is akin to fitting a Gaussian mixture model to data. We assume that each class can be modelled by a single function <math>\,\phi</math> and data is generated by a mixture model. According to Bayes Rule,<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(x|y_{k})*Pr(y_{k})}{Pr(x)}</math><br />
<br />
While all classifiers that we have seen thus far in the course have been in discriminative form, the RBF network is a generative model that can be represented using a directed graph.<br />
<br />
We can replace the class conditional density in the above conditional probability expression by marginalizing <math>\,x</math> over <math>\,j</math>:<br />
<math>\Pr(x|y_{k}) = \sum_{j} Pr(x|j)*Pr(j|y_{k})</math><br />
<br />
<br />
<br />
<br/><br/><br />
*'''Note''' We made the assumption that each class can be modelled by a single function <math>\displaystyle\Phi</math> and that the data was generated by a mixture model. The Gaussian mixture model has the form:<br />
<math>f(x)=\sum_{m=1}^M \alpha_m \phi(x;\mu_m,\Sigma_m)</math> where <math>\displaystyle\alpha_m</math> are mixing proportions, <math>\displaystyle\sum_m \alpha_m=1</math>, and <math>\displaystyle\mu_m</math> and <math>\displaystyle\Sigma_m</math> are the mean and covariance of each Gaussian density respectively. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009), pp. 214. </ref> The generative model in Figure 1 shows graphically how each Gaussian in the mixture model is chosen to sample from.<br />
<br />
== '''Radial Basis Function (RBF) Networks - November 9th, 2009''' ==<br />
<br />
=== RBF Network for classification (A probabilistic point of view) ===<br />
Using RBF Network to do classification, we usually treat it as regression problem. We want to set a threshold to decide what the data’s class membership is. However, to find some insight to the classification problem of what we are doing in terms of RBF Network, we often think of mixture models and make certain assumptions.<br />
<br />
[[File:RBF.png|350px|thumb|right|Figure 26.1: RBF Network Classification Demo]] <br />
<br />
We assume, as we can see in the graph on the right hand side, that we have three random variables, <math>\displaystyle y_k</math>, <math>\displaystyle j</math>, and <math>\displaystyle x</math> where <math>\displaystyle y_k</math> denotes class <math>\,k</math>, <math>\displaystyle x</math> is what we observed here, and <math>\displaystyle j</math> is some hidden random variable. There is a process that there are different classes, and each class can trigger a different hidden random variable <math>\displaystyle j</math>. To understand this, we can assume that, for instance, this random variable <math>\displaystyle j</math> has a Gaussian distribution (it could have any other distribution as well) and that all the <math>\displaystyle j</math>’s have the same distribution (Gaussian), but with different parameters. From each Gaussian distribution triggered by each class, we are going to sample some data points. Therefore, in the end, we are going to get a set of data, which are not strictly Gaussian, but are actually a mixture of Gaussians.<br />
<br />
Again, we look at the posterior distribution from [http://en.wikipedia.org/wiki/Bayes'_theorem Bayes' Rule].<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(X = x | Y = y_{k})*Pr(Y = y_{k})}{Pr(X = x)}</math><br />
<br />
Since we made the assumption that the data has been generated from a mixture model, we can estimate this conditional probability by<br />
<br />
<math>\Pr(X = x | Y = y_{k}) = \sum_{j} Pr(X = x | j)*Pr(j | Y = y_{k})</math>, <br />
<br />
which is the class conditional distribution (or probability) of the mixture model. Note, here, if we only have a simple model from <math>\displaystyle y_k</math> to <math>\displaystyle x</math>, then we won’t have this summation.<br />
<br />
We can substitute this class conditional distribution into Bayes' formula. We can see that the posterior of class <math>\displaystyle k</math> is the summation over <math>\displaystyle j</math> of the probability of <math>\displaystyle x</math> given <math>\displaystyle j</math> times the probability of <math>\displaystyle j</math> given <math>\displaystyle y_k</math>, times the prior distribution of class <math>\displaystyle k</math>, and lastly divided by the marginal probability of <math>\displaystyle x</math>. That is,<br />
<br />
<math>\Pr(y_k | x) = \frac {\sum_{j} Pr(x | j)*Pr(j | y_{k})*Pr(y_{k})}{Pr(x)}</math>.<br />
<br />
Since, the prior probability of class <math>\displaystyle k</math>, <math>\displaystyle Pr(y_{k})</math>, does not have an index of <math>\displaystyle j</math>, it can be taken out of the summation. This yields,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)}</math>.<br />
<br />
We multiply this by <math>\displaystyle 1 = \frac {Pr(j)}{Pr(j)}</math>. Then, it becomes,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)} * \frac {Pr(j)}{Pr(j)}</math>.<br />
<br />
Next, note that <math>\displaystyle Pr(j | x) = \frac {Pr(x | j)*Pr(j)}{Pr(x)}</math>, and <math>\displaystyle Pr(y_k | j) = \frac {Pr(j | y_k)*Pr(y_k)}{Pr(j)}</math>. Then rearranging the terms, we finally have the posterior:<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} Pr(j | x)Pr(y_k | j)</math>.<br />
<br />
Interestingly, this is just the product of the posterior of the two functions that are summed.<br />
<br />
==== Interpretation of RBF Network classification ====<br />
<br />
[[File:2.png|350px|thumb|right|Figure 26.1.2(2): RBF Nerwork ]]<br />
<br />
We want to relate the results that we derived above to our RBF Network. In a RBF Network, as we can see on the right hand side, we have a set of data, <math>\displaystyle x_1</math> to <math>\displaystyle x_d</math>, and the hidden basis function, <math>\displaystyle \phi_{1}</math> to <math>\displaystyle \phi_{M}</math>, and then we have some output, <math>\displaystyle y_1</math> to <math>\displaystyle y_k</math>. Also, we have weights from the hidden layer to output layer. The output is just the linear sum of <math>\displaystyle \phi</math>’s. <br />
<br />
Now consider probability of <math>\displaystyle j</math> given <math>\displaystyle x</math> to be <math>\displaystyle \phi</math>, and the probability of <math>\displaystyle y_k</math> given <math>\displaystyle j</math> to be the weights <math>\displaystyle w_{jk}</math>, then the posterior can be written as,<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math>.<br />
<br />
[[File:3.png|350px|thumb|left|Figure 26.1.2(1): Gaussian mixture ]]<br />
<br />
Now, let us look at an example in one dimensional case. Suppose,<br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, and <math>\displaystyle j</math> is from 1 to 2. <br />
<br />
We know that <math>\displaystyle \phi</math> is a radial basis function. It's as if we put some Gaussian over data. And for each Gaussian, we consider the center <math>\displaystyle \mu</math>. Then, what <math>\displaystyle \phi</math> computes is the similarity of any data point to the center. <br />
<br />
We can see the graph on the left which plots the density of <math>\displaystyle \phi_{1}</math> and <math>\displaystyle \phi_{2}</math>. Take <math>\displaystyle \phi_{1}</math> for instance, if the point gets far from the center <math>\displaystyle \mu_{1}</math>, then it will reduce <math>\displaystyle \phi_{1}</math> to become nearly zero. Remember that, we can usually find a non-linear regression or classification of input space by doing a linear one in some extended space or some feature space (more details in Aside). Here, the <math>\displaystyle \phi</math>’s actually produce that feature space. <br />
<br />
So, one way to look at this is that this <math>\displaystyle \phi</math> is telling us that given an input, how likely the probability of presence of a particular feature is. Say, for example, we define the features as the centers of these Gaussian distributions. Then, this <math>\displaystyle \phi</math> function somehow computes the possibility given certain data points, of this kind of feature appearing. If the data point is right at the center, then the value of that <math>\displaystyle \phi</math> would be one, i.e. the probability is 1. If the point is far from the center, then the probability (<math>\displaystyle \phi</math> function value) will be close to zero, that is, it’s less likely. Therefore, we can treat <math>\displaystyle Pr(j | x)</math> as the probability of a particular feature given data. <br />
<br />
When we have those features, then <math>\displaystyle y</math> is the linear combination of the features. Hence, any of the weights <math>\displaystyle w</math>, which is equal to <math>\displaystyle Pr(y_k | j)</math>, tells us how likely this particular <math>\displaystyle y</math> will appear given those features. Therefore, the weight <math>\displaystyle w_{jk}</math> shows the probability of class membership given feature. <br />
<br />
Hence, we have found a probabilistic point of view to look at RBF Network!<br />
<br />
*'''Note''' There are some inconsistencies with this probabilistic point of view. There are no restrictions that force <math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math> to be between 0 and 1. So if least squares is used to solve this, <math>\displaystyle w_{jk}</math> cannot be interpreted as a probability. <br />
<br />
<br />
''' Aside '''<br />
*Feature Space:<br />
:One way to produce a feature space is LDA<br />
:Suppose, we have n data points <math>\mathbf{x}_1</math> to <math>\mathbf{x}_n </math>. Each data point has d features. And these n data points consist of the <math>X</math> matrix, <br />
:<math>X = \left[ \begin{matrix}<br />
x_{11} & x_{21} & \cdots & x_{n1} \\<br />
x_{12} & x_{22} & \cdots & x_{n2} \\<br />
\vdots & & \ddots & \vdots \\<br />
x_{1d} & x_{2d} & \cdots & x_{nd}<br />
\end{matrix}\right] </math><br />
:Also, we have feature space,<br />
:<math>\Phi^{T} = \left[ \begin{matrix}<br />
\phi_{1}(\mathbf{x_1}) & \phi_{1}(\mathbf{x_2})& \cdots & \phi_{1}(\mathbf{x_n})\\<br />
\phi_{2}(\mathbf{x_1})& \phi_{2}(\mathbf{x_2})& \cdots & \phi_{2}(\mathbf{x_n}) \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{M}(\mathbf{x_1}) & \phi_{M}(\mathbf{x_2}) & \cdots & \phi_{M}(\mathbf{x_n})<br />
\end{matrix}\right] </math> <br />
:If we want to solve a regression problem for the input data, we don’t perform Least Square on this <math>\displaystyle X</math> matrix, we do Least Square on the feature space, i.e. on the <math>\displaystyle \Phi^{T}</math> matrix. The dimensionality of <math>\displaystyle \Phi^{T}</math> is M by n.<br />
:Now, we still have n data points, but we define these n data points in terms of a new set of features. So, originally, we define our data points by d features, but now, we define them by M features. And what are those M features telling us? <br />
:Let us look at the first column of <math>\displaystyle \Phi^{T}</math> matrix. The first entry is <math>\displaystyle \phi_1</math> applied to <math>\mathbf{x_1}</math>, and so on, until the last entry is <math>\displaystyle \phi_M</math> applied to <math>\mathbf{x_1}</math>. Suppose each of these <math>\displaystyle \phi_j</math> is defined by<br />
:<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>.<br />
:Then, each <math>\displaystyle \phi_j</math> checks the similarity of the data point with its center. Hence, the new set of features are actually representing M centers in our data set, and for each data point, its new features check how this point is similar to the first center; how it is similar to the second center; and how it is similar to the <math>\displaystyle M^{th}</math> center. And this checking process will apply to all data points. Therefore, feature space gives another representation of our data set. <br />
<br />
</noinclude><br />
<br />
=== Model selection or complexity control for RBF Network - a brief introduction ===<br />
In order to obtain a better fit for the training data, we often want to increase the complexity of our RBF Network. By its construction, we know that to change the complexity of a RBF Network, the only way is to add or decrease the number of basis functions. A large number of basis function yields a more complex network. In theory, if we add enough basis functions, the RFB Network would work for any training; however, it doesn't mean this RBF Network model can generalize well. Therefore, for the purpose of avoiding overfitting problem (see Notes below), we only want to increase the number of basis function to certain point, i.e. its optimal level. <br />
<br />
For the model selection, what we usually do is estimate the training error. After working through the training error, we’ll see that the training error in fact can be decomposed, and one component of training error is called Mean Squared Error (MSE). In the later notes, we will find that our final goal is to get a good estimate of MSE. Moreover, in order to find an optimal model for our data, we select the model with the smallest MSE.<br />
<br />
Now, let us introduce some notations that we will use in the analysis:<br />
*<math>\hat f</math> -- the prediction model estimated by a RBF network from the training data<br />
*<math>\displaystyle f</math> -- the real model (not null), and ideally, we want <math>\hat f</math> to be close to <math>\displaystyle f</math><br />
*<math>\displaystyle err</math> -- the training error<br />
*<math>\displaystyle Err</math> -- the testing error<br />
*<math>\displaystyle MSE</math> -- the Mean Squared Error<br />
<br />
''' Notes '''<br />
<br />
[[File:overfitting.png|350px|thumb|left|Figure 26.2: Overfitting]]<br />
<br />
*Being more complex isn’t always a good thing. Sometime, [http://en.wikipedia.org/wiki/Overfitting overfitting] causes the model to lose its generality. For example in the graph on left hand side, the data points are sampled from the model <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle f(x_i)</math> is a linear function, which is shown by the blue line, and <math>\displaystyle \epsilon_i</math>is additive Gaussian noise from <math>~N(0,\sigma^2)</math>. The red curve displayed in the graph shows the over-fitted model. Clearly, this over-fitted model only works for any training data, and is useless for any further prediction when new data points are introduced.<br />
<br />
> n<-20;<br />
> x<-seq(1,10,length=n);<br />
> alpha<-2.5;<br />
> beta<-1.75;<br />
> y<-alpha+beta*x+rnorm(n);<br />
> plot(y~x, pch=16, lwd=3, cex=0.5, main='Overfitting');<br />
> abline(alpha, beta, col='blue');<br />
> lines(spline(x, y), col = 2);<br />
<br />
*More details on this topic later on.<br />
<br />
<br />
<br />
</noinclude><br />
<br />
<br />
<br />
<br />
<br />
<br />
== '''Model Selection(Stein's Unbiased Risk Estimate)- November 11th, 2009''' ==<br />
<br />
===Model Selection===<br />
<br />
Model selection is a task of selecting a model of optimal complexity for a given data. Learning a radial basis function network from data is a parameter estimation problem. One difficulty with this problem is selecting parameters that show good performance on both training and testing data. In principle, a model is selected to have parameters associated with the best observed performance on training data, although our goal really is to achieve good performance on unseen testing data. Not surprisingly, a model selected on the basis of training data does not necessarily exhibit comparable performance on the testing data. When squared error is used as the performance index, a zero-error model on the training data can always be achieved by using a sufficient number of basis functions.<br />
<br />
<br />
But, training error and testing error do not demonstrate a linear relationship. In particular, a smaller training error does do not necessarily result in a smaller testing error. In practice, one often observes that, up to a certain point, the model error on testing data tends to decrease as the training error decreases. However, if one attempts to decrease the training error too far by increasing model complexity, the testing error often can take a dramatic increase.<br />
<br />
<br />
The basic reason behind this phenomenon is that in the process of minimizing training error, after a certain point, the model begins to over-fit the training set. Over-fitting in this context means fitting the model to training data at the expense of losing generality. In the extreme form, a set of <math>\displaystyle N</math> training data points can be modeled exactly with <math>\displaystyle N</math> radial basis functions. Such a model follows the training data perfectly. However, the model is not representative features of the true underlying data source, and this is why it fails to correctly model new data points.<br />
<br />
<br />
In general, the training error rate will be less than the testing error on the new data. A model typically adapts to the training data, and hence the training error will be an overly optimistic estimate of the testing error. An obvious way to well estimate testing err is to add a penalty term to the training error to compensate. Actually, SURE is developed based on this.<br />
<br />
<br />
<br />
===Stein's unbiased risk estimate (SURE)===<br />
<br />
<br />
====Important Notation====<br />
<br />
Let:<br />
*<math>\hat f(X)</math> denote the ''prediction model'', which is estimated from a training sample by the RBF neural network model.<br />
*<math>\displaystyle f(X)</math> denote the ''true model''.<br />
*<math>\displaystyle err=\sum_{i=1}^N (\hat y_i-y_i)^2 </math> denote the ''training error'',which is the average loss over the training sample.<br />
*<math>\displaystyle Err=\sum_{i=1}^M (\hat y_i-y_i)^2 </math> denote the ''test error'', which is the expected prediction error on an independent test sample.<br />
*<math>\displaystyle MSE=E(\hat f-f)^2</math> denote the ''mean squared error'', where <math>\hat f(X)</math> is the estimated model and <math>\displaystyle f(X)</math> is the true model.<br />
<br />
<br />
<br />
Suppose the observations <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>.We need to estimate <math>\hat f</math> from training data set <math>T=\{(x_i,y_i)\}^2_{i=1}</math>. Let <math>\hat f_i=\hat f_(x_i)</math> and <math>\displaystyle f_i= f(x_i)</math>, then <br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i-\epsilon_i)^2]</math><math>=E[(\hat f_i-f_i)^2]+E[\epsilon_i^2]-2E[\epsilon_i(\hat f_i-f_i)]</math><br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2E[\epsilon_i(\hat f_i-f_i)]</math> <math>\displaystyle (1)</math><br />
<br />
The last term can be written as:<br />
<br />
<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=E[(y_i-f_i)(\hat f_i-f_i)]=cov(y_i,\hat f)</math>, where<math>\displaystyle y_i</math> and <math>\hat f_i</math> both have same mean <math>\displaystyle f_i</math>.<br />
<br />
<br />
<br />
====Stein's Lemma====<br />
<br />
If <math>\,Z</math> is <math>\,N(\mu,\sigma^2)</math> and if <math>\displaystyle g(Z)</math> is weakly differentiable,such that<math>\displaystyle E[\vert g'(Z)\vert]<\infty</math>, then <math>\displaystyle E[g(Z)(Z-\mu)]=\sigma^2E(g'(Z))</math>.<br />
<br />
<br />
According to Stein's Lemma, the last cross term of <math>\displaystyle (1)</math>, <math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]</math> can be written as <math>\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math>. The derivation is as follows.<br />
<br />
<math>\displaystyle Proof</math>: Let <math>\,Z = \epsilon</math>. Then <math>g(Z) = \hat f-f</math>, since <math>\hat y = f + \epsilon</math>, and <math>\,f</math> is a constant. So <math>\,\mu = 0</math> and <math>\,\sigma^2</math> is the variance in <math>\,\epsilon</math>.<br />
<math>\displaystyle E[g(Z)(Z-\mu)]=E[(\hat f-f)\epsilon]=\sigma^2E(g'(Z))=\sigma^2 E[\frac {\partial (\hat f-f)}{\partial y_i}]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}-\frac {\partial f}{\partial y_i}]</math><br />
<br />
<br />
Since <math>\displaystyle f</math> is the true model, not the function of the observations <math>\displaystyle y_i</math>, then <math>\frac {\partial f}{\partial y_i}=0</math>.<br />
<br />
So,<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math> <math>\displaystyle (2)</math><br />
<br />
====Two Different Cases====<br />
SURE in RBF,<br />
[http://www.cs.ualberta.ca/~papersdb/uploaded_files/801/paper_automatic-basis-selection-for.pdf Automatic basis selection for RBF networks using Stein’s unbiased risk estimator,Ali Ghodsi Dale Schuurmans]<br />
<br />
<br />
=====''Case 1''=====<br />
<br />
Consider the case in which a new data point has been introduced to the estimated model, i.e. <math>(x_i,y_i)\not\in\tau</math>; this new point belong to the validation set <math>\displaystyle \nu</math>,i.e.<math>(x_i,y_i)\in\nu</math>. Since <math>\displaystyle y_i</math> is a new point, <math>\hat f</math> and <math>\displaystyle y_i</math> are independent. Therefore <math>\displaystyle cov(y_i,\hat f)=0</math> (or think about <math>\frac{\partial \hat f}{\partial y_i}</math>, when <math>\,y_i</math>is a new point, then it has nothing with <math>\hat f</math> because the estimation of <math>\hat f</math> is from the training data, so <math>\frac{\partial \hat f}{\partial y_i}=0</math> ) and <math>\displaystyle (1)</math> in this case can be written as:<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2</math>. <br />
<br />
This expectation means <math>\frac {1}{m}\sum_{i=1}^m (\hat y_i-y_i)^2 = \frac {1}{m}\sum_{i=1}^m (\hat f_i-f_i)^2+ \sigma^2</math>.<br />
<br />
<math>\sum_{i=1}^m (\hat y_i-y_i)^2 = \sum_{i=1}^m (\hat f_i-f_i)^2+ m\sigma^2</math><br />
<br />
Based on the notation we denote above, then we obtain:<br />
<math>\displaystyle MSE=Err-m\sigma^2</math><br />
<br />
<br />
<br />
This is the justification behind the technique of cross validation. since <math>\displaystyle \sigma^2</math> is constant, to minimize <math>\displaystyle MSE</math> is equal to minimize the test err <math>\displaystyle Err</math>. In cross vaildation to avoid overfitting or underfitting, a validation data set is independent from the estimated model.<br />
<br />
<br />
=====''Case 2''=====<br />
<br />
A more interesting case is the case in which we do not use new data points to assess the performance of the estimated model. and the training data is used for both estimating and assessing a model <math>\hat f_i</math>. In this case the cross term in <math>\displaystyle (1)</math> cannot be ignored because <math>\hat f_i</math> and <math>\displaystyle y_i</math> are not independent. Therefore the cross term can be estimated by Stein's lemma, which was originally proposed to estimated the mean of a Guassian distribution.<br />
<br />
<br />
Suppose <math>(x_i,y_i)\in\tau</math>, then by applying Stein's lemma, we obtain <math>\displaystyle (2)</math> proved above.<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2\sigma^2E[\frac {\partial \hat f}{\partial y_i}]</math>.<br />
<br />
This expectation means <math>\frac {1}{N}\sum_{i=1}^N (\hat y_i-y_i)^2 = \frac {1}{N}\sum_{i=1}^N (\hat f_i-f_i)^2+ \sigma^2-\frac {2}{N}\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<br />
<math>\sum_{i=1}^N (\hat y_i-y_i)^2 = \sum_{i=1}^N (\hat f_i-f_i)^2+ N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<math>\displaystyle err=MSE+N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math><br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math> <math>\displaystyle (3)</math><br />
<br />
In statistics, this is known as [http://www.reference.com/browse/Stein%27s+unbiased+risk+estimate Stein's unbiased risk estimate (SURE)] is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter, and thus cannot be determined completely.<br />
<br />
===SURE for RBF Network===<br />
<br />
Based on SURE, the optimum number of basis functions should be assigned to have the minimum generalization err <math>\displaystyle err</math>. Based on the Radial Basis Function Network, by setting <math>\frac{\partial err}{\partial W}</math> equal to zero , we get the least squared solution of<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math>.Then we have <math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}Y = HY</math>,where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}</math>, <math>\,H</math> is the hat matrix for this model.<br />
<br />
<br />
<math>\hat f=\,H_{i1}y_2+\,H_{i2}y_2+\cdots+\,H_{in}y_n</math> <math>\displaystyle (3)</math><br />
<br />
where <math>\,H</math> depends on the input vector <math>\displaystyle x_i</math> but not on <math>\displaystyle y_i</math>. <br />
<br />
By taking the derivative of <math>\hat f_i</math> with respect to <math>\displaystyle y_i</math>, we can easily obtain:<br />
<br />
<math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Now, substituing this into<math>\displaystyle (3)</math> ,we get<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Here, we can tell that <math>\sum_{i=1}^N \,H_{ii}= \,Trace(H)</math>, the sum of the diagonal elements of <math>\,H</math>. Thus, we can obtain the further simplification that <math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1})\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=d</math>, where<math>\displaystyle d</math> is the dimension of <math>\displaystyle \Phi</math>. Since <math>\displaystyle \Phi</math> is a projection of input matrix <math>\,X</math> onto a basis set spanned by <math>\,M</math>, the number of basis functions. If considering intercept, then <math>\,Trace(H)= M+1</math>.<br />
<br />
Then,<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1)</math>.<br />
<br />
===SURE Algorithm===<br />
<br />
<br />
[[File:27.1.jpg|350px|thumb|right|Figure 27.1]]<br />
<br />
We use this method to find the optimum number of basis function by choosing the model with smallest MSE over the set of models considered. Given a set of models <math>\hat f_M(x)</math> indexed by the number of basis functions, <math>\displaystyle err(M)</math>. <br />
<br />
Then, <math>\displaystyle MSE(M)=err(M)-N\sigma^2+2\sigma^2(M+1)</math><br />
<br />
where <math>\displaystyle N</math> is the number of training samples and the noise,<math>\sigma^2</math>, can be estimated from the training data as<br />
<br />
<math>\hat \sigma^2=\frac {1}{N-1}\sum_{i=1}^N (\hat y-y)^2</math>.<br />
<br />
<br />
By applying SURE algorithm to SPECT Heart data, we get the optimal number of basis functions is <math>\displaystyle M=4</math>.<br />
<br />
<br />
Pls take a look at figure 27.1 on the right, which shows that<math>\displaystyle MSE</math> is smallest when <math>\displaystyle M=4</math>.<br />
<br />
<br />
Calculating the SURE value is easy if you have access to <math>\,\sigma</math>.<br />
<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
If <math>\,\sigma</math> is not known, it can be estimated using the error.<br />
<br />
error = (output - expected_output) .^ 2;<br />
sigma = (dot(err, ones(1, num_data_point)) / (num_data_point)) / (num_data_point - 1);<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
=='''SURE for RBF network & Support Vector Machine - November 13th, 2009'''==<br />
<br />
===SURE for RBF network===<br />
<br />
====Minimizing MSE====<br />
<br />
By Stein's unbiased risk estimate (SURE) for Radial Basis Function (RBF) Network<br />
we get:<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1) </math> (28.1)<br />
<br />
*<math>\displaystyle MSE</math>(mean square error)= <math>\sum_{i=1}^N (\hat y_i-y_i)^2 </math><br />
*<math>\displaystyle err</math>(training error)= <math>\sum_{i=1}^N (\hat f_i-f_i)^2 </math><br />
*<math>\displaystyle (M+1) </math>( number of hidden units)= <math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math><br />
<br />
<br />
'''Goal''': To minimize MSE<br />
<br />
1. If <math>\displaystyle \sigma </math> is known, then it is no impact (i.e. a constant),<br />
and we can ignore it. Only need to minimize <math>\displaystyle MSE=err +2\sigma^2(M+1)</math>.<br />
<br />
2. In reality, we do not know <math>\displaystyle \sigma</math>, and the estimate <math>\,\hat \sigma</math> changes when <math>\displaystyle (M+1) </math> changes. However, we can estimate <math>\displaystyle \sigma </math>.<br />
<br />
<math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>. Suppose we do not know the variance of <math>\displaystyle \epsilon</math>. Then, <br />
<br />
<math>\displaystyle \sigma^2=\frac{1}{N-1}\sum_{i=1}^N (\hat y_i-y_i)^2 =\frac{1}{N-1}err</math> (28.2)<br />
<br />
Substitute (28.2) into (28.1), get<br />
<br />
<math>\displaystyle MSE=err-N\frac{1}{N-1}err+2\frac{1}{N-1}err(M+1)</math><br />
<br />
<math>\displaystyle MSE=err(1-\frac{N}{N-1}+\frac{2(M+1)}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{N-1-N+2M+2}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{2M+1}{N-1}) </math> (28.3) <br />
<br />
<br />
[[File:28.1.jpg|350px|thumb|Figure 28.1: MSE vs err]]<br />
<br />
Figure 28.1: the training error will decrease and the MSE will increase when increasing the number of hidden units (i.e. the model is more complex).<br />
<br />
<br />
When the number of hidden units gets larger and larger, the training error will decrease until it approaches to <math>\displaystyle 0 </math>. If training error equals <math>\displaystyle 0 </math>, then no matter how large <math>\displaystyle (M+1) </math> is, from (28.3) we can see the estimate of MSE will approach <math>\displaystyle 0 </math> as well. However, in fact it does not happen since when training error is close to <math>\displaystyle 0 </math> [http://en.wikipedia.org/wiki/Overfitting overfitting] happens, and MSE should increase instead of being close to <math>\displaystyle 0 </math>. We can see it from the Figure 28.1. <br />
<br />
<br />
We can see <math>\displaystyle \sigma^2 </math> is the average of <math>\displaystyle err </math>. In order to deal with this problem, we can take the average for <math>\displaystyle err</math> of each hidden unit. For example: we can first take 1 hidden unit, and take 10 hidden units in the next. Since in reality the value of <math>\, \sigma^2</math> is a constant adjustment to the data points, and doesn't depend on <math>\,M+1</math>, using the average <math>\,\sigma^2</math> value for 1 to 10 hidden units has a firm theoretical basis.<br />
<br />
We can also see that unlike the classical Cross Validation (CV) or Leave one out (LOO) techniques, the SURE technique does not need to do the validation to find the optimal model. Hence, SURE technique uses less data than CV or LOO. It is suitable for the case that there is not enough data for validation. However, to implement SURE we need to find <math>\frac {\partial \hat f}{\partial y_i}</math>, which may not be trivial for models that do not have a closed-form solution.<br />
<br />
====Kmeans Clustering====<br />
<br />
Discription:<br /> [http://en.wikipedia.org/wiki/K-means_clustering Kmeans clustering] is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.<br />
<br />
*The number of hidden units is same as the number of clusters, which is <math>\displaystyle \phi </math><br />
<br />
*<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, we set it same for all clusters.<br />
<br />
The basic details for <math>K</math>-means clustering are given:<br />
<br />
The <math>K</math> initial centers are randomly chosen from the training data.<br />
<br />
Then the following two steps are iterated alternatively until convergernce.<br />
<br />
1. for each existing center, we reidentify its cluster (every points in this cluster should be closer to this center than others).<br />
<br />
2. compute the mean for each cluster and make it as the new center for each cluster.<br />
<br />
<br />
Example:<br /><br />
Partition data into 2 clusters (2 hidden values)<br />
<br />
<br />
>> X=rand(30,80); <br />
>> [IDX,C,sumD,D]=kmeans(X,2); <br />
>> size(IDX) <br />
>> 30 1<br />
>> size(C) <br />
>> 2 80<br />
>> size(sumD) <br />
>> 2 1<br />
>> c1=sum(IDX==1)<br />
>> 14<br />
>> c2=sum(IDX==2)<br />
>> 16<br />
>> sumD<br />
>> 85.6643<br />
>> 101.0419<br />
>> v1=sumD(1,1)/c1 <br />
>> 6.1189<br />
>> v2=sumD(2,1)/c2 <br />
>> 6.3151 <br />
<br />
<br />
<br />
Comments:<br />
<br />
We create <math>X</math> randomly as a train set with 80 data points and 30 dimensions, and then apply “kmeans” method to separate X into 2 clusters.IDX is a vector contains 1 or 2 which indicates 2 clusters, and its size is 30*1. <math>\displaystyle C </math> is the center (mean) of each cluster with size 2*80; sumD is sum of the square distance between the data points and center of its cluster. The <math>\displaystyle c1 </math> and <math>\displaystyle c2 </math> indicate the number of data points in cluster 1 and 2. <math>\displaystyle V1 </math> is the variance of the first cluster <math>\displaystyle (v1=\sigma_1)</math>; <math>\displaystyle V2 </math> is the variance of the second cluster <math>\displaystyle (v2=\sigma_2)</math>. Now we can get <math>\displaystyle \phi </math>, <math>\displaystyle w </math>, hat matrix <math>\displaystyle H </math> and <math>\displaystyle \hat Y </math> by following equations. Finally, we will get the <math>\displaystyle MSE </math> and predict the test set. <br />
<br />
<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br />
<br />
<math>\displaystyle W = (\phi^{T}\Phi)^{-1}\phi^{T}Y</math><br />
<br />
<math>\displaystyle H = \phi(\Phi^{T}\Phi)^{-1}\phi^{T}</math><br />
<br />
<math>\displaystyle \hat{Y} = \phi W = \phi(\phi^{T}\phi)^{-1}\phi^{T}Y = HY</math><br />
<br />
<br />
<br />
Aside:<br />
<br />
Similar in spirit to <math>K</math>-means, there is EM algorithm with respect to Gaussian mixture model. Generally speaking, the Gaussian mixture model is referred to as a soft clustering while <math>K</math>-means is hard clustering.<br />
<br />
Similar to <math>K</math>-means, the following two steps are iterated alternatively until convergernce.<br />
<br />
E-step, each point is assigned a weight for each cluster based on the likelihood of each of the corresponding Gaussians. Observations is assigned 1 for one cluster if they are closer to the center of that cluster, and is assigned 0 for other clusters. <br />
<br />
M-step, compute the weighted means and covariances and make them as the new means and covariances for every cluster.<br />
<br />
>>[P,mu,phi,1Pxtr]=mdgEM(X,2,200,0);<br />
<br />
===Support Vector Machine===<br />
<br />
====Introduction====<br />
We have seen that linear discriminate analysis and logistic regression both estimate linear decision boundaries in similar but slightly different ways. Separating hyperplane classifiers provide the basis for support vector classifier. It constructs linear decision boundaries that explicitly try to separate the data into different classes as well as possible. The techniques that make the extensions to the nonseparable case, where the classes overlap are generalized to what is known as the support vector machine. It produces nonlinear boundaries by constructing a linear boundary in a high dimensional, transformed version of the feature space.<br />
<br />
The original basis for SVM was published in the 1960s by [http://en.wikipedia.org/wiki/Vapnik Vapnik], Chervonenkis and co, however the ideas did not gain any attention until strong results were shown in the early 90s.<br />
<br />
Definition: <br /><br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machine (SVM)] are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a Maximum margin hyperplane or set of hyperplanes in a higher or infinite dimensional space. The set of points near the class boundaries, support vectors, define the model. which can be used for classification, regression or other tasks.<br />
<br />
====Optimal Seperating Hyperplane====<br />
<br />
[[File:28.2.jpg|350px|thumb|right|Figure 28.2]]<br />
<br />
Figure 28.2 An example with two classes separated by a hyperplane. The blue line is the least squares solution, which misclassifies one of the training points. Also shown are the black separating hyperplanes found by the [http://en.wikipedia.org/wiki/Perceptron perceptron] learning algorithm with different random starts.<br /><br />
<br />
We can see the data points can be separated by a linear boundary are in two classes in <math>\displaystyle \mathbb{R}^{2} </math>. Suppose a dataset is indeed linearly separable, then there exits infinitely many possible separating hyperplanes including the black lines in the figure as two of them for training data. However, which solution is the best when we introduce the new data. <br /><br />
<br />
Aside: <br /><br />
The blue line is the least squares solution to the problem,obtained by regressing the <math>\displaystyle -1/+1 </math> response <math>\displaystyle Y </math> on <math>\displaystyle X </math> (with intercept); the line is given by<br />
<math>\displaystyle {X:\hat\beta_0+\hat\beta_1X_1+\hat\beta_2X_2=0}</math>.<br />
This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by linear discriminant analysis, in light of its equivalence with linear regression in the two-class case.<br />
<br />
Classifiers such as (28.4) that compute a linear combination of the input features and return the sign were called ''perceptrons'' in the engineering literature in the late 1950s. <br />
<br />
<br />
Identifications:<br />
<br />
*Hyperplane: seprate two classes <br />
<br />
<math>\displaystyle x^{T}\beta+\beta_0=0</math><br />
<br />
*Margin: the distance between the hyperplane and the closest point.<br />
<br />
<math>\displaystyle d_i=x_i^{T}\beta+\beta_0 </math> where <math>\displaystyle i=1,....,N</math><br />
<br />
Note: since distance is positive, if the data is on <math>\displaystyle +1 </math> side the distance is <math>\displaystyle d_i(+1)</math>. If the data is on the <math>\displaystyle -1 </math> side the distance is <math>\displaystyle d_i(-1)</math>.<br />
<br />
*Data points: <math>\displaystyle y_i\in\{-1,+1\}</math>we can classify points as<math>\displaystyle sign\{d_i\}</math> if <math>\displaystyle \beta,\beta_0 </math> is known.<br /><br />
<br />
====Maximum Margin Classifiers in the Linearly separable case====<br />
Choose the line farthest from both classes or Choose the line which has the maximum distance from the closest point. (i.e maximize the margin)<br /><br />
<br />
<math>\displaystyle Margin=min\{y_id_i\}</math> <math>\displaystyle i=1,2,....,N </math> <br />
where <math>\displaystyle y_i </math> is label and <math>\displaystyle d_i </math> is distance<br /><br />
<br />
[[File:28.3.jpg|350px|thumb|right|Figure 28.3 The linear algebra of a hyperplane]]<br />
<br />
<br />
<br />
Fiqure 28.3 depicts a hyperplane defined by the equation <math>\displaystyle x^{T}\beta+\beta_0=0</math>. Since they are in <math>\displaystyle \mathbb{R}^{2} </math>, the hyperplane is a line.<br /><br />
<br />
<br />
Let us rewrite <math>\displaystyle Margin=min\{y_id_i\}</math> by using the following properties:<br /><br />
<br />
1. <math>\displaystyle \beta </math> is orthogonal to the hyperplane <br /><br />
<br />
Two points<math>\displaystyle x_1,x_2</math> lying on the hyperplane.<br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0\beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}(x_1-x_2)=0</math><br />
<br />
Hence,<math>\displaystyle \beta </math> is orthogonal to <math>\displaystyle (x_1-x_2)</math>, and<math>\displaystyle \beta^*=\frac{\beta}{\|\beta\|} </math> is the vector normal to the hyperplane.<br /><br />
<br />
2. For any point <math>\displaystyle x_0 </math> on the hyperplane, <br />
<br />
<math>\displaystyle \beta^{T}x_0+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1=-\beta_0</math><br />
For any point on the hyperplane <math>\displaystyle \beta^{T}</math> gives negative value of the intercept of the hyperplane. <br/><br />
<br />
<br />
3. The signed distance for any point <math>\displaystyle x </math> to the hyperplane is <math>\displaystyle d_i=\beta^{T}(x_i-x_0)</math>. <br/>Since the length of <math>\displaystyle \beta </math> is not known, let it be unit vector.<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}(x_i-x_0)}{\|\beta\|} </math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i-\beta^{T}x_0}{\|\beta\|} </math><br />
<br />
by property 2<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i+\beta_0}{\|\beta\|} </math><br />
<br />
<br />
<br />
[[File:4.jpg|350px|thumb|right|Figure 28.4]]<br />
<br />
<br />
We had <math>\displaystyle Margin=min(y_id_i)</math> <math>\displaystyle i=1,2,....,N </math>, and since we now know how to compute <math>\displaystyle d_i \Rightarrow</math> <br />
<br />
<math>\displaystyle Margin=min\{y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\} </math><br />
<br />
Suppose <math>\displaystyle x_i </math> is not on the hyperplan<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)>0 </math><br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq c </math>for <math>\displaystyle c>0 </math><br />
<br />
<br />
<math>\displaystyle y_i(\frac{\beta^{T}x_i}{c}+\frac{\beta_0}{c})\geq1</math> <br />
<br />
This is known as the cnonical representation of the decision hyperplane.<br />
<br />
For <math>\displaystyle \beta^{T} </math> only the direction is important, so <math>\displaystyle \frac{\beta^{T}}{c} </math> does not change its direction, the hyperplance will be the same.<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq1 </math><br />
<br />
<math>\displaystyle y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\geq\frac{1}{\|\beta\|} </math><br />
<br />
<math>\displaystyle Margin=min(\frac{1}{\|\beta\|} )</math><br />
<br />
which is equivalent as <math>\displaystyle Margin=max(\|\beta\| )</math><br />
<br />
<br />
<br />
<br />
Refrence:<br /><br />
Hastie,T.,Tibshirani,R., Friedman,J.,(2008).The Elements of Statistical Learning:129-130<br />
<br />
=='''Optimizing The Support Vector Machine - November 16th, 2009'''==<br />
We currently derive Support Vector Machine for the case where two classes are separable in the given feature space. This margin can be written as <math>\,min\{y_id_i\}</math>, or the distance of each point from the hyperplane, where <math>\,d_i</math> is the distance and <math>\,y_i</math> is used as the sign.<br />
===Margin Maximizing Problem for the Support Vector Machine===<br />
<math>\,Margin=min\{y_id_i\}</math> can be rewritten as <math>\,min\left\{\frac{y_i\left(\beta^Tx_i+\beta_0\right)}{|\beta|}\right\}</math>. <br />
<br\>Note that the term <math>\,y_i\left(\beta^Tx_i+\beta_0\right) = 0</math> if <math>\,x_i</math> is on the hyperplane, but <math>\,y_i\left(\beta^Tx_i+\beta_0\right) > 0</math> if <math>\,x_i</math> is <math>\displaystyle not</math> on the hyperplane<br />
<br />
This implies <math>\,\exists C>0</math> such that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq C</math>.<br />
<br />
Divide through by C to produce <math>\,y_i\left(\frac{\beta^T}{C}x_i + \frac{\beta_0}{C}\right) \geq 1</math>. <br />
<br />
<math>\,\beta, \beta_0</math> compose a hyperplane that can have different values but we care about the direction, dividng through by a constant does not change the direction of the hyperplane. Thus, by assuming scaled values for <math>\,\beta, \beta_0</math> we eliminate C, so that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq 1</math>. Implying that the lower bound on <math>\,y_i\left(\beta^Tx_i+\beta_0\right)</math> is <math>\displaystyle 1</math><br />
<br />
Now in order to maximize the margin, we simply need to find <math>\,min\left\{\frac{1}{|\beta|}\right\}</math>. <br />
<br />
In other words, our optimization problem is now to find maximum <math>\,|\beta|</math>, under the constraint that <math>\,min_i\{y_i(\beta^Tx_i+\beta_0)\} = 1</math>.<br />
<br />
Note that we're dealing with the norm of <math>\,\beta</math>. There are many different choices of possible norms, in general[http://en.wikipedia.org/wiki/P-norm#p-norm p-norm]. The 1-norm of a vector is simply the sum of the absolute value of each element (also known as the taxicab or Manhattan distance), and is apparently more accurate, but also has a discontinuity in the derivative. 2-norm or the Euclidean norm (the intuitive measure of the length of a vector), is easier to work with - that is <math>\,\|\beta\|_2 = (\beta^T\beta)^{1/2}</math>. For convenience, we will maximize <math>\,\frac{1}{2}\|\beta\|_2^2 = \frac{1}{2}\beta^T\beta</math> where the constant 1/2 has been added for simplification and that the maximizing the function is the same as maximizing the square root of that function.<br />
<br />
This is an example of a quadratic programming problem and we will minimize a quadratic function respective to a linear inequality constraints.<br />
<br />
<br />
====Writing Lagrangian Form of Support Vector Machine====<br />
The Lagrangian form is introduced to ensure that the conditions are satisfied, as well as finding an optimal solution(The optimal saddle point of the lagrangian for the classic quadratic optimization). The problem will be solved in dual space by introducing <math>\,\alpha_i</math> as dual constraints, this is in contrast to solving the problem in primal space as function of the betas. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.<br />
<br />
<math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math>. To find the optimal value, set the derivative equal to zero.<br />
<br />
<math>\,\frac{\partial L}{\partial \beta} = 0</math>, <math>\,\frac{\partial L}{\partial \beta_0} = 0</math>. Note that <math>\,\frac{\partial L}{\partial \alpha_i}</math> is equivalent to the constraints <math>\left(y_i(\beta^Tx_i+\beta_0)-1\right) \geq 0, \,\forall\, i</math><br />
<br />
First, <math>\,\frac{\partial L}{\partial \beta} = \frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\left\{\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i)+\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0-\frac{\partial}{\partial \beta}\alpha_iy_i\right\}}</math><br />
<br />
: <math>\frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 = \beta</math>.<br />
<br />
: <math>\,\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i) = \alpha_iy_ix_i</math><br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0 = 0</math>. <br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i = 0</math>.<br />
<br />
So this simplifies to <math>\,\frac{\partial L}{\partial \beta} = \beta - \sum_{i=1}^n{\alpha_iy_ix_i} = 0</math>. In other words,<br />
<br />
<math>\,\beta = \sum_{i=1}^n{\alpha_iy_ix_i}</math>, <math>\,\beta^T = \sum_{i=1}^n{\alpha_iy_ix_i^T}</math><br />
<br />
Similarly, <math>\,\frac{\partial L}{\partial \beta_0} = -\sum_{i=1}^n{\alpha_iy_i} = 0</math>.<br />
<br />
This allows us to rewrite the Lagrangian without <math>\,\beta</math>.<br />
<br />
<math>\,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} - \sum_{i=1}^n{\alpha_iy_i(\sum_{j=1}^n{\alpha_jy_jx_j^Tx_i} + \beta_0) - 1}</math>. <br />
<br />
Because <math>\,\sum_{i=1}^n{\alpha_iy_i} = 0</math>, and <math>\,\beta_0</math> is constant, <math>\,\sum_{i=1}^n{\alpha_iy_i\beta_0} = 0</math>. So this simplifies further, to<br />
<br />
<math>L(\alpha) = \,-\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} + \sum_{i=1}^n{\alpha_i}</math><br />
A dual representation of the maximum margin.<br />
<br />
Because <math>\,\alpha_i</math> is the Lagrangian multiplier, <math>\,\alpha_i \geq 0 \forall i</math>.<br />
<br />
This is a much simpler optimization problem.<br />
<br />
=='''The Support Vector Machine algorithm - November 18, 2009'''==<br />
<br />
===Solving the Lagrangian===<br />
<br />
Continuing from the above derivation, we now have the equation that we need to minimize, as well as two constraints.<br />
<br />
The Support Vector Machine problem boils down to:<br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math><br />
:such that <math>\alpha_i \geq 0</math><br />
:and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
<br />
We are looking to minimize <math>\,\alpha</math>, which is our only unknown. Once we know <math>\,\alpha</math>, we can easily find <math>\,\beta</math> and <math>\,\beta_0</math> (see the Support Vector algorithm below for complete details).<br />
<br />
If we examine the Lagrangian equation, we can see that <math>\,\alpha</math> is multiplied by itself; that is, the Lagrangian is quadratic with respect to <math>\,\alpha</math>. Our constraints are linear. This is therefore a problem that can be solved through [http://en.wikipedia.org/wiki/Quadratic_programming quadratic programming] techniques. We will examine how to do this in Matlab shortly.<br />
<br />
We can write the Lagrangian equation in matrix form:<br />
<br />
<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:such that <math>\underline{\alpha} \geq \underline{0}</math><br />
:and <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Where:<br />
* <math>\underline{\alpha}</math> denotes an <math>\,n \times 1</math> vector; <math>\underline{\alpha}^T = [\alpha_1, ..., \alpha_n]</math><br />
* Matrix <math>S = y_iy_jx_i^Tx_j = (y_ix_i)^T(y_ix_i)</math><br />
* <math>\,\underline{0}</math> and <math>\,\underline{1}</math> are vectors containing all 0s or all 1s respectively<br />
<br />
Using this matrix notation, we can use Matlab's built in quadratic programming routine, [http://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/quadprog.html quadprog].<br />
<br />
===Quadprog example===<br />
<br />
Let's use <code>quadprog</code> to find the solution to <math>\,L(\alpha)</math>.<br />
<br />
Matlab's <code>quadprog</code> function minimizes an equation of the following form:<br />
:<math>\min_x\frac{1}{2}x^THx+f^Tx</math><br />
:such that: <math>\,A \cdot x \leq b</math>, <math>\,Aeq \cdot x = beq</math> and <math>\,lb \leq x \leq ub</math><br />
<br />
We can now see why we kept the <math>\frac{1}{2}</math> constant in the original derivation of the equation.<br />
<br />
The function is called as such: <code>x = quadprog(H,f,A,b,Aeq,beq,lb,ub)</code>. The variables correspond to values in the equation above.<br />
<br />
We can now try to find the solution to <math>\,L(\alpha)</math> (though, it should be noted, that in <math>\,L(\alpha)</math> we subtract the first term rather than add it; I'm not sure if this difference would change how we call <code>quadprog</code>).<br />
<br />
We'll use a simple one-dimensional data set, which is essentially y = -1 or 1 + Gaussian noise. (Note: you could easily put the values straight into the quadprog equation; they are separated for clarity.)<br />
<br />
x = [mvnrnd([-1],[0.01],100); mvnrnd([1],[0.01],100)]';<br />
y = [-ones(100,1); ones(100,1)];<br />
S = (y*x)' * (y*x);<br />
f = ones(200,1);<br />
A = -ones(1,200);<br />
b = 0;<br />
Aeq = y';<br />
beq = 0;<br />
lb = 0;<br />
ub = []; % There is no upper bound<br />
alpha = quadprog(S,f,A,b,Aeq,beq,lb,ub);<br />
<br />
This gives us the optimal <math>\,\alpha</math>... or at least I think it should, but it does not appear to work for me (that is, despite setting the lower boundary of 0, a number of values are still negative. Whether this is just the nature of <code>quadprog</code> or an error on my part is an exercise for the reader).<br />
<br />
===Examining K.K.T. conditions===<br />
<br />
[http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker conditions] ([http://webrum.uni-mannheim.de/mokuhn/public/KarushKuhnTucker.pdf more info]) give us a closer look into the Lagrangian equation and the associated conditions.<br />
<br />
Suppose we are looking to minimize <math>\,f(x)</math> such that <math>\,g_i(x) \geq 0, \forall{x}</math>. If <math>\,f</math> and <math>\,g</math> are differentiable, then the ''necessary'' conditions for <math>\hat{x}</math> to be a local minimum are:<br />
<br />
# At the optimal point, <math>\frac{\partial L}{\partial \hat{x}} = 0</math>; i.e. <math>f'(\hat{x}) - \sum{\alpha_ig'(\hat{x})}=0</math><br />
# <math>\alpha_i \geq 0</math>. (Dual Feasibility) <br />
# <math>\alpha_ig_i(\hat{x}) = 0, \forall{i}</math> (Complementary Slackness)<br />
# <math>g_i(\hat{x}) \geq 0</math> (Primal Feasibility)<br />
<br />
If any of these conditions are violated, then the problem is deemed not feasible.<br />
<br />
These are all trivial except for condition 4. Let's examine it further in our support vector machine problem.<br />
<br />
===Support Vectors===<br />
<br />
Basically, the support vectors are the training points that actually determine the optimal separating hyperplane which we are look for. Also, they are the most difficult points to classify or the most informative for the classification.<br />
<br />
In our case, the <math>g_i(\hat{x})</math> function is:<br />
:<math>\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1</math><br />
<br />
Substituting <math>\,g_i</math> into KKT condition 3, we get <math>\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0</math>. <br\>In order for this condition to be satisfied either <br/><math>\,\alpha_i= 0</math> or <br/><math>\,y_i(\beta^Tx_i+\beta_0)=1</math><br />
<br />
All points <math>\,x_i</math> will be either 1 or greater than 1 distance away from the hyperplane.<br />
<br />
'''Case 1: a point <math>\displaystyle x_i > 1</math> away from the margin'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) > 1 \Rightarrow \alpha_i = 0</math>.<br />
<br />
If point <math>\, x_i</math> is not on the margin, then the corresponding <math>\,\alpha_i=0</math>.<br />
<br />
'''Case 2: a point <math>\displaystyle x_i = 1</math> away from the margin'''<br />
<br />
If <math>\,\alpha_i > 0 \Rightarrow y_i(\beta^Tx_i+\beta_0) = 1</math> <br />
<br\>If point <math>\, x_i</math> is on the margin, then the corresponding <math>\,\alpha_i>0</math>.<br />
<br />
<br />
Points on the margin, with corresponding <math>\,\alpha_i > 0</math>, are called '''''support vectors'''''.<br />
<br />
===Using support vectors===<br />
<br />
Support vectors are important because they allow the support vector machine algorithm to be insensitive to outliers. If <math>\,\alpha = 0</math>, then the cost function is also 0, and won't contribute to the solution of the SVM problem; only points on the margin -- support vectors -- contribute. Hence the model given by SVM in entirely defined by the set of support vectors, a subset of the entire training set. This is interesting because in the NN methods(and can be generalize to classical statistical learning) previous to this the configuration of the network needed to be specified. In this case we have a data-driven or 'nonparametric' model in which is the training set and algorithm will determine the support vectors.<br />
<br />
References:<br />
Wang, L, 2005. Support Vector Machines: Theory and Applications, Springer, 3<br />
<br />
====The support vector machine algorithm====<br />
<br />
# Solve the quadratic programming problem:<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
## Use Matlab's quadprog to find the optimal <math>\,\underline{\alpha}</math><br />
# Find <math>\beta = \sum_{i=1}^n{\alpha_iy_i\underline{x_i}}</math><br />
# Find <math>\,\beta_0</math> by choosing a support vector (a point with <math>\,\alpha_i > 0</math>) and solving <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math><br />
<br />
===Example in Matlab===<br />
<br />
The following code, taken verbatim from the lecture, shows how to use Matlab built-in SVM routines (found in the Bioinformatics toolkit) to do classification through support vector machines.<br />
<br />
load 2_3;<br />
[U,Y] = princomp(X');<br />
data = Y(:,1:2);<br />
l = [-ones(1,200) ones(1,200)];<br />
[train,test] = crossvalind('holdOut',400);<br />
% Gives indices of train and test; so, train is a matrix of 0 or 1, 1 where the point should be used as part of the training set<br />
svmStruct = svmtrain(data(train,:), l(train), 'showPlot', true);<br />
<br />
[[File:Svm1.png|frame|center|The plot produced by training on some of the 2_3 data's first two features.]]<br />
<br />
yh = svmclassify(svmStruct, data(test,:), 'showPlot', true);<br />
<br />
[[File:Svm2.png|frame|center|The plot produced by testing some of the 2_3 data.]]<br />
=='''Non-linear hypersurfaces and Non-Separable classes - November 20, 2009'''==<br />
==='''Kernel Trick'''===<br />
We have talked about the curse of dimensions at the beginning of this course, however, now we turn to the power of high dimensions in order to make find a linearly separable hyperplane between two classes of data points. To understand this, imagine a two dimensional prison where a two dimensional person is constrained. Suppose magically we give the person a third dimension, then he can escape from the prison. In other words, the prison and the person are linearly separable now with respect to the third dimension. The intuition behind the "kernel trick" is basically to map data to a higher dimension so that they are linearly separable by a hyperplane.<br />
<br />
We have seen SVM as a linear classification problem finding the max margin hyperplane in the given input space. However, for many real world problems a more complex decision boundary is required. The following simple method was devised is order to solve the same linear classification problem but in a usually higher dimensional space, 'feature space' under which the max margin hyperplane is better suited.<br />
<br />
Let <math>\,\phi</math> be a mapping,<br />
<br />
<math>\phi:\Re^d \rightarrow \Re^D </math><br /><br /><br />
<br />
We wish to find a <math>\,\phi</math> such that our data will be suited for separation by a hyperplane. Given this function, we are lead to solving the previous constrained quadratic optimization on the transformed dataset,<br /><br /><br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_j\phi(x_i)^T\phi(x_j)}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br /><br /><br />
<br />
The solution to this optimization problem is now well know; however a workable <math>\,\phi</math> must be determined. Possibly the largest drawback in this method is that we must compute the inner product of two vectors in the high dimensional space. As the number of dimensions in the initial data set increases, the inner product becomes computationally intensive or impossible.<br />
<br />
However, we have a very useful result that says that there exists a class of functions, <math>\,\Phi</math>, which satisfy the above requirements and that for any function <math>\,\phi \in \Phi</math>,<br />
<br />
<math>\,\phi(x_i)^T\phi(x_j) = K(x_i,x_j) </math><br /><br /><br />
<br />
Where K is the kernel function in the input space satisfying [http://en.wikipedia.org/wiki/Mercer%27s_condition Mercer's condition] (to guarantee that it indeed corresponds to certain mapping function <math>\,\phi</math>). As a result, if the objective function depends on inner products but not on coordinates, we can always use the kernel function to implicitly calculate in the feature space without storing the huge data. Not only does this solve the computation problems but it no longer requires us to explicitly determine a specific mapping function in order to use this method. In fact, it is now possible to use an infinite dimensional feature space in SVM without even knowing the function <math>\,\phi</math>.<br />
<br />
==='''Mercer's Theorem in detail'''===<br />
Let <math>\,\phi</math> be a mapping to a high dimensional [http://en.wikipedia.org/wiki/Hilbert_space Hibert Space] <math>\,H</math><br /><br />
<br />
<br />
<math>\phi:x \in \Re^d \rightarrow H </math><br /><br /><br />
<br />
The transformed coordinates can be defined as,<br /><br />
<br />
<math>\phi_1(x)\dots\phi_d(x)\dots </math><br /><br /><br />
<br />
By Hilbert - Schmidt theory we can represent an inner product in Hilbert space as,<br /><br /><br />
<br />
<math>\,\phi(x_i)^T\phi(x_j) = \sum_{r=1}^{\infty}a_k\phi_r(x_i)\phi(x_j) \Leftrightarrow K(x_i,x_j), \ a_r \ge 0 </math><br /><br /><br />
where K is symmetric, then Mercer's theorem gives necessary and sufficient conditions on K for it to satisfy the above relation.<br><br><br />
<br />
'''Mercer's Theorem'''<br />
<br />
Let C be a compact subset of <math>\Re^d</math> and K a function <math> \in L^2(C) </math>, if<br /><br /><br />
<br />
<math>\, \int_C\int_C K(u,v)g(u)g(v)dudv \ge 0, \ \forall g \in L^2(C)</math> <br /><br /><br />
<br />
then,<br /><br /><br />
<br />
<math>\sum_{r=1}^{\infty}a_k\phi_r(u)\phi(v)</math> converges absolutely and uniformly to a symmetric function <math>\,K(u,v)</math><br />
<br />
References:<br />
Vapnik, V., 1998. Statistical Learning Theory. John Wiley & Sons, {423}<br />
<br />
==='''Kernel Functions'''===<br />
There are various kernel functions, for example:<br />
<br />
* Linear kernel: <math>\,k(x,y)=x \cdot y</math><br />
* Polynomial kernel: <math>\,k(x,y)=(x \cdot y)^d</math><br />
* Gaussian kernel: <math>e^{-\frac{|x-y|^2}{2\sigma^2}}</math><br />
<br />
If <math>\,X</math> is a <math>\,d \times n</math> matrix in the original space, and <math>\,\phi(X)</math> is a <math>\,D \times n</math> matrix in the [http://en.wikipedia.org/wiki/Hilbert_space Hilbert space] (good explanation video: [http://www.youtube.com/watch?v=V2pBdH7YzX0 part 1] [http://www.youtube.com/watch?v=YRY5xlk3TC0 part 2]), then <math>\,\phi^T(X) \cdot \phi(X)</math> is an <math>\,n \times n</math> matrix. <br />
The inner product is also illustrated as correlation, which measures the similarity between data points. This gives us some insight in how to choose the kernel. The choice depends on certain prior knowledge of the problem and on how we believe the similarity of our data should be measured. In practice, the Gaussian (RBF) kernel usually works best. Besides the most common kernel functions mentioned above, many novel kernels are also suggested for different problem domains like text classification, gene classification and so on.<br />
<br />
These kernel functions can be applied to many algorithms to derive the "kernel version". For example, kernel PCA, kernel LDA, etc..<br />
<br />
==='''SVM: non-separable case'''===<br />
We have seen how SVMs are able to find an optimally separating hyperplane of two separable classes of data, in which case the margin contains no data points. However, in the real world, data of different classes are usually mixed together at the boundary and it's hard to find a perfect boundary to totally separate them. To address this problem, we slack the classification rule to allow data cross the margin. Mathematically the problem becomes,<br />
:<math>\min_{\beta, \beta_0} \frac{1}{2}|\beta|^2</math><br />
:<math>\,y_i(\beta^Tx_i+\beta_0) \geq 1-\xi_i</math><br />
:<math>\xi_i \geq 0</math><br />
<br />
Now each data point can have some error <math>\,\xi_i</math>. However, we only want data to cross the boundary when they have to and make the minimum sacrifice; thus, a penalty term is added correspondingly in the objective function to constrain the number of points that cross the margin. The optimization problem now becomes:<br />
<br />
:<math>\min_{\alpha} \frac{1}{2}|\beta|^2+\gamma\sum_{i=1}^n{\xi_i}</math><br />
:<math>\,s.t.</math> <math>y_i(\beta^Tx+\beta_0) \geq 1-\xi_i</math> <br />
:<math>\xi_i \geq 0</math><br />
<br />
[[File:non-separable.JPG|350px|thumb|right|Figure non-separable case]]<br />
<br />
<br\>Note that <math>\,\xi_i</math> is not necessarily smaller than one, which means data can not only enter the margin but can also cross the separating hyperplane.<br />
<br />
References:<br />
<br />
Mercer, J., 1909. Functions of positive and negative type and their connection<br />
with the theory of integral equations. Philos. Trans. Roy. Soc. London, A<br />
209:415{446}<br />
<br />
==Support Vector Machine algorithm for non-separable cases - November 23, 2009==<br />
<br />
With the program formulation above, we can form the lagrangian, apply KKT conditions, and come up with a new function to optimize. As we will see, the equation that we will attempt to optimize in the SVM algorithm for non-separable data sets is the same as the optimization for the separable case, with slightly different conditions.<br />
<br />
===Forming the Lagrangian===<br />
<br />
:<math>L: \frac{1}{2} |\beta|^2 + \gamma \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i[y_i(\beta^T x_i+\beta_0)-1+\xi_i]-\sum_{i=1}^n \alpha_i \xi_i</math><br />
:<math>\alpha_i \geq 0, \lambda_i \geq 0</math><br />
<br />
===Applying KKT conditions===<br />
# <math>\frac{\partial L}{\partial \beta}=\beta - \sum_{i=1}^n \alpha_i y_i x_i = 0 \Rightarrow \beta=\sum_{i=1}^n\alpha_i y_i x_i</math> <br\><math>\frac{\partial L}{\partial \beta_0}=-\sum_{i=1}^n \alpha_i y_i =0 \Rightarrow \sum_{i=1}^n \alpha_i y_i =0</math> since the sign does not make a difference<br />
#<math>\frac{\partial L}{\partial \xi_i}=\gamma - \alpha_i - \lambda_i \Rightarrow \gamma = \alpha_i+\lambda_i</math><br />
#<math>\,\alpha_i[y_i(\beta^T x_i+\beta_0)-1+\xi_i]=0</math> and <math>\,\alpha_i \xi_i=0</math><br />
<br />
===Putting it all together===<br />
<br />
With our KKT conditions and the Lagrangian equation, we can now use quadratic programming to find <math>\,\alpha</math>.<br />
<br />
In matrix form, we want to solve the following optimization:<br />
:<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:<math>\,s.t.</math> <math>\underline{0} \leq \underline{\alpha} \leq \gamma</math>, <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Solving this gives us <math>\,\underline{\alpha}</math>, which we can use to find <math>\,\underline{\beta}</math> as before:<br />
:<math>\,\underline{\beta} = \sum{\alpha_i y_i \underline{x_i}}</math><br />
<br />
However, we cannot find <math>\,\beta_0</math> in the same way as before, even if we choose a point with <math>\,\alpha_i > 0</math>, because we do not know the value of <math>\,\xi_i</math> in the equation<br />
:<math>\,y_i(\underline{\beta}^Tx_i + \beta_0) - 1 + \xi_i = 0</math><br />
<br />
From our discussion on the KKT conditions, we know that <math>\,\lambda_i \xi_i = 0</math> and <math>\,\gamma = \alpha_i + \lambda_i</math>.<br />
<br />
So, if <math>\,\alpha_i < \gamma</math> then <math>\,\lambda_i > 0</math> and consequently <math>\,\xi_i = 0</math>.<br />
<br />
Therefore, we can solve for <math>\,\beta_0</math> if we choose a point where:<br />
:<math>\,0 < \alpha_i < \gamma</math><br />
<br />
====The SVM algorithm for non-separable data sets====<br />
<br />
The algorithm, then, for non-separable data sets is:<br />
<br />
# Use <code>quadprog</code> (or another quadratic programming technique) to solve the above optimization and find <math>\,\alpha</math><br />
# Find <math>\,\underline{\beta}</math> by solving <math>\,\underline{\beta} = \sum{\alpha_i y_i x_i}</math><br />
# Find <math>\,\beta_0</math> by choosing a point where <math>\,0 < \alpha_i < \gamma</math> and then solving <math>\,y_i(\underline{\beta}^Tx_i + \beta_0) - 1 = 0</math></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=5511stat8412009-11-24T00:08:58Z<p>Ipargaru: /* Applying KKT conditions */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rise of fields such as data-mining, bioinformatics, and machine learning, classification has becomes a fast-developing topic. In the age of information, vast amounts of data are generated constantly, and the goal of classification is to learn from data. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
'''Definition''': The problem of Prediction a discrete random variable <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math> is called Classification.<br />
<br />
In classification,, we attempt to approximate a function <math>\,h</math>, by using a training data set, which will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>d</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data which are identical independent distributions, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, for instance, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability. Intuitively speaking, to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we find <math>\,y\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y|X=x)</math>, and classify <math>\,X</math> into class <math>\,y</math>. In order to calculate the value of <math>\,P(Y=y|X=x)</math>, we use ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posterior probability, <math>\,P(Y=y)</math> as the prior probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two classes, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<sub></sub><br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''3 different approaches to classification''':<br />
<br />
1) Empirical Risk Minimization: Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of <math>\,L(h)</math><br />
<br />
2) Regression: Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3) Density Estimation: estimate <math>\,P(X=x|Y=0)</math> and <math>\,P(X=x|Y=1)</math> (less popular in high-dimension cases)<br />
<br />
<br />
<br />
'''Bayes Classification Rule Optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when<math></math> the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
<br />
:<math>\, h^*(X)= \left\{\begin{matrix} <br />
1 & if P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
Remark:<br />
<br />
1)Bayes classification rule is optimal. Proof:[http://www.ee.columbia.edu/~vittorio/BayesProof.pdf]<br />
<br />
2)We still need any other method, since we cannot define prior probability in realistic.<br />
<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, since it is generally impossible for us to know the prior <math>\,P(Y=1)</math>, and class conditional density <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], tree-augmented naive Bayes (TAN), Bayesian network augmented naive Bayes (BAN) and general Bayesian network (GBN).<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as having objective existence. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayesian and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayesian'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In Bayesian method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
'''Multi-class Classification''':<br />
<br />
Y takes on more than two values.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
''Theorem'': Suppose that <math>\,Y \in \mathcal{Y}= \{1,\dots,k\}</math>, the optimal rule is :<math>\,h^*(X) = \arg\max_{k}{P(Y = k|X = x)}</math><br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
1 Empirical Risk Minimization:Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
<br />
2 Regression:Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3 Density estimation, estimate <math>P(X = x|Y = 0)</math> and <math>P(X = x|Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
It is quadratic because there is no boundaries.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on [http://academicearth.org/lectures/advice-for-applying-machine-learning LDA and QDA] so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
LDA is for classification and FDA is used for feature extraction.<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
<br />
=== logistic function ===<br />
A logistic function or logistic curve is the most common sigmoid curve. <br />
<br />
:<math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
1. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
2. <math>y(0) = \frac{1}{2}</math><br />
<br />
3. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
4. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
5. The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood, using Pr(Y|X). The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Seeing these equations as a weighted least squares problem makes them easier to derivate.<br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== Perceptron (Foundation of Neural Network) ===<br />
<br />
==== Separating Hyperplane Classifiers ====<br />
Separating hyperplane trys to separate the data using linear decision boundaries. When the classes overlap, it can be generalized to support vector machine, which constructs nonlinear boundaries by constructing a linear boundary in an enlarged and transformed feature space.<br />
<br />
==== Perceptron ====<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
Least Squares returns the sign of a linear combination of data points as the class label<br />
<br />
sign(<math>(\underline{\beta}^T \underline{x} + {\beta}_0)) = sign(\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2})</math><br />
<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the [http://en.wikipedia.org/wiki/Artificial_neural_network Artificial Neural Network] models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br />
<br />
<br />
If we find a hyperplane that is not unique between 2 classes, there will be infinitely many solutions obtained from the perceptron algorithm.<br />
<br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
A Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math>, where <math>\,I</math> indicates the sign of the expression and returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math> (initial guess). Its goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach], which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
====Problems with the Algorithm and Issues Affecting Convergence====<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. This problem can be eliminated by using basis expansions technique. To be specific, we try to find a hyperplane not in the original space, but in the enlarged space obtained by using some basis functions.<br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#A perfect separation is not always available even desirable. If observations comes from different classes sharing the same imput, the classification model seems to be overfitting and will generally have poor predictive performance.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
<br />
</ref>.<br />
====Comment on gradient descent algorithm====<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction in which the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and you initially stand in the middle, then you will finally arrive at the saddle point (local minimum) and get stuck there.<br />
<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over <math>\,i</math> (all data points). Actually, this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent) known as Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>\,{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
*A Perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection with branches that "fan out" onto as many connections as desired, each carrying the same signal - the processing element output signal. <br />
<br />
<ref> Haykin, Simon (2009). Neural Networks and Learning Machines. Pearson Education, Inc. </ref><br />
A neural network resembles the brain in two respects:<br />
1. Knowledge is acquired by the network from its environment through a learning process.<br />
2. Interneuron connection strengths, known as synaptic weights, are used to store the acquired knowledge.<br />
<br />
<ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref> It is a multistage regression or classification model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
A regression problem typically has only one unit in the output layer. In a k-class classification problem, there are usually k units in the output layer that each represent the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1).<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classification by NN. <br />
<br />
In perceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] and is not continuous at 0. Thus, we replace it by a smooth function <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it the '''activation function'''.<br />
<br>The choice of this function <math>\displaystyle \sigma </math> is determined by the properties of the data and the assumed distribution of target variables, but for multiple binary classification problems the logistic function, also known as inverse-logit, is often used: <br />
<math>\sigma(a)=\frac {1}{1+e^{-a}}</math><br />
<br />
There are some important properties for the activation function.<br />
<br />
# Activation function is nonlinear. It can be shown that if the activation function of the hidden units is linear, a three-layer neural network is equivalent to a two layer one. <br />
# Activation function saturate, which means there are maximum and minimum output value. This property ensures that the weights are bounded and therefore the searching time is limited. <br />
# Activation function is continuous and smooth.<br />
# Activation function is monotonic. This property is not necessary, since we know that RBF networks is also a kind of power model. <br />
<br />
'''Note:''' A key difference between a perceptron and a neural network is that a neural network uses continuous nonlinearities in the units, for the purpose of differentiation, whereas the perceptron often uses a non-differentiable activation function. The neural network function is differentiable with respect to the network parameters so that a gradient descent method can be used in training. Moreover, a perceptron is a linear classifier, whereas a neural network, by combining layers of perceptrons, is able to classify non-linear problems through proper training.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes into the perceptron, to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural network is called [http://en.wikipedia.org/wiki/Feedforward_neural_network Feed-Forward Neural Network]. Applications to Feed-Forward Neural Networks include data reduction, speech recognition, sensor signal processing, and ECG abnormality detection, to name a few. <ref>J. Annema, Feed-Forward Neural Networks, (Springer 1995), pp. 9 </ref><br />
<br />
===Back-propagation===<br />
For a while, the Neural Network model was just an idea, since there were no algorithms for training the model until 1986, when Geoffrey Hinton <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> devised an algorithm called '''back-propagation''' [http://en.wikipedia.org/wiki/Backpropagation#Algorithm]. After that, a number of other training algorithms and various configurations of neural networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied a gradient descent algorithm for optimizing weights. Back-propagation uses this idea of gradient descent to train a neural network based on the chain rule in calculus. <br />
<br />
Assume that the last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
For simplicity, there is only 1 unit at the end and assume for the moment we are doing regression.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each unit we have a function <math>\displaystyle z_i=\sigma(a_i)</math> that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classification and the resulting classification output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
''i j are reversed!''<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\delta_j \cdot z_l</math><br />
<br />
Note that a change in <math>\,a_j</math> causes changes in all <math>\,a_i</math> in the next layer on which the error is based, so we need to sum over i in the chain:<br />
<math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} =\sum_i \delta_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\,\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math> Using the activation function <math>\,\sigma(\cdot)</math><br />
<br />
So <math>\delta_j = \sum_i \delta_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \delta_i \cdot u_{ij}</math><br />
<br />
We can propagate the error calculated in the output back through the previous layers and adjust weights to minimize error.<br />
<br />
==Neural Networks (NN) - October 30, 2009 ==<br />
<br />
=== Back-propagation ===<br />
The idea is that we first feed an input from the training set to the Neural Network, then find the error rate at the output and then we propagate the error to previous layers and for each edge of weight <math>\,u_{ij}</math> we find <math>\frac{\partial \mathrm{err}}{\partial u_{ij}}</math>. Having the error rates at hand we adjust the weight of each edge by taking steps proportional to the negative of the gradient to decrease the error at output. The next step is to apply the next input from the training set and go through the described adjustment procedure.<br />
The overview of Back-propagation algorithm:<br />
#Feed a point <math>\,x</math> in the training set to the network, and find the output of all the nodes.<br />
#Evaluate <math>\,\delta_k=y_k-\hat{y_k}</math> for all output units, where <math>y_k</math> is the expected output and <math>\hat{y_k}</math> is the real output.<br />
#By propagating to the previous layers evaluate all <math>\,\delta_j</math>s for hidden units: <math>\,\delta_j=\sigma'(a_j)\sum_i \delta_i u_{ij}</math> where <math>i</math> is associated to the previous layer.<br />
#Using <math>\frac{\partial \mathrm{err}}{\partial u_{jl}} = \delta_j\cdot z_l</math> find all the derivatives.<br />
#Adjust each weight by taking steps proportional to the negative of the gradient: <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} -\rho \frac{\partial \mathrm{err}}{u_{ij}}</math><br />
#Feed the next point in the training set and repeat the above steps.<br />
==== How to initialize the weights ====<br />
This still leaves the question of how to initialize the weights <math>\,u_{ij}, w_i</math>. The method of choosing weights mentioned in class was to randomize the weights before the first step. This is not likely to be near the optimal solution in every case, but is simple to implement. To be more specific, random values near zero will be a good choice for the initial weights(usually from [-1,1]). In this case, the model evolves from a nearly linear one to a nonlinear one as we desired. An alternative is to use an orthogonal least squares method to find the initial weights <ref>http://www.mitpressjournals.org/doi/abs/10.1162/neco.1995.7.5.982</ref>. Regression is performed on the weights and output by using a linear approximation of <math>\,\sigma(a_i)</math>, and finds optimal weights in the linear model. Back propagation is used afterward to find the optimal solution, since the NN is non-linear.<br />
<br />
==== How to set learning rates ====<br />
The learning rate <math>\,\rho</math> is usually a constant. <br />
<br />
If we use On-line learning, as a form of stochastic approximation process, <math>\,\rho</math> should decrease as the iteration increase.<br />
<br />
<br />
Choosing too large learning rate may cause the unstability of the system, while the too small learning rate may lead to a very slow convergence rate(very long time in learning phase). However, the advantage of small learning rate is that it can guarantee the convergence. Thus, generally, it is better to choose a relatively small learning rate to ensure the stability. Usually, choose <math>\,\rho</math> between 0.01 and 0.7<br />
<br />
==== How to determine the number of hidden units ====<br />
<br />
Here we will mainly discuss how to estimate the number of hidden units at very beginning. Obviously, we should adjust it to be more precise using CV, LOO or other complexity control methods. <br />
<br />
Basically, if the patterns are well separated, few hidden units are fairly enough. If the patterns are drawn from some highly complicated mixture models, more hidden units are really needed. <br />
<br />
Actually, the number of hidden units determines the size of the model, and therefore the total number of the weights in the model. Typically speaking, the number of weights should not be larger than the number of training data, say N. Thus, sometimes, N/10 is a good choice. However, in pratice, many well performed models will use more hidden units.<br />
<br />
=== Dimensionality reduction application ===<br />
[[File:NN-bottelneck.png|350px|thumb|right|Figure 1: Bottleneck configuration for applying dimensionality reduction.]]<br />
One possible application of Neural Networks is to perform dimensionality reduction, like other techniques, e.g., PCA, MDS, LLE and Isomap.<br />
<br />
Consider the following configuration as shown in figure 1:<br />
As we go forward in layers of this Neural Network, the number of nodes is reduced, until we reach a layer with the number of nodes representing the desired dimensionality. However, note that at the very first few layers the number of nodes may not be strictly decreasing, as long as finally it can reach a layer with less nodes. From now on in the Neural Network <br />
the previous layers are mirrored. So at the output layer we have the same number of states as we have in the input layer. Now note that if we feed the network with each point and get an output approximately equal to the fed input, that means at the output the same input is reconstructed from the middle layer units. So the output of the middle layer units can represent the input with less dimensions.<br />
<br />
To train this Neural Network, we feed the network with a training point and through back propagation we adjust the network weights based on the error between the input layer and the reconstruction at the output layer. Our low dimensional mapping will be the observed output from the middle layer. Data reconstruction consists of putting the low dimensional data through the second half of the network.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
== Complexity Control October 30, 2009 ==<br />
<br />
[[File:overfitting-model.png|500px|thumb|right|Figure 2. The overfitting model passes through all the points of the training set, but has poor predictive power for new points.<br />
In exchange the line model has some error on the training points but has extracted the main characteristic of the training points, and has good predictive power.]]<br />
There are [http://academicearth.org/lectures/underfitting-and-overfitting two issues] that we have to avoid in Machine Learning:<br />
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]<br />
#Underfitting<br />
<br />
Overfitting occurs when our model is heavily complex with so many degrees of freedom, that we can learn every detail of the training set. Such a model will have very high precision on the training set but will show very poor ability to predict outcomes of new instances, especially outside the domain of the training set.<br />
<br />
In a Neural Network if the depth is too much, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will show a very precise outcome of the training set but will not be able to generalize the commonality of the training set to predict the outcome of new cases.<br />
<br />
Underfitting occurs when the model we picked to describe the data is not complex enough, and has high error rate on the training set.<br />
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.<br />
<br />
'''Example'''<br />
#Consider the example showed in the figure. We have a training set and we want to find a model which fits it the best. We can find a polynomial of high degree which almost passes through all the points in the training set. But, in fact the training set is coming from a line model. Now the problem is although the complex model has less error on the training set it diverges from the line in other ranges which we have no training point. Because of that the high degree polynomail has very poor predictive result on test cases. This is an example of overfitting model.<br />
#Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem.<br />
#Consider a simple classification example. If our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is a yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features typical of bananas, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie., we have overfit the data. This occurs when we have considered so many features that we have perfectly described the existing bananas, but if presented with a new banana of slightly different shape than the existing bananas, for example, it cannot be detected. This is the tradeoff; what is the right level of complexity?<br />
<br />
== Complexity Control - Nov 2, 2009 ==<br />
<br />
Overfitting occurs when the model becomes too complex and underfitting occurs when it is not complex enough, both of which are not desirable. To control complexity, it is necessary to make assumptions for the model before fitting the data. Assumptions that we can make for a model are with polynomials or a neural network. There are other ways as well.<br />
<br />
[[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 1: An example of a model with a family of polynomials]]<br />
We do not want a model to get too complex, so we control it by making an assumption on the model. With complexity control, we want a model or a classifier with a low error rate.<br />
<br />
=== '''How do we choose a good classifier?''' ===<br />
<br />
Our goal is to find a classifier that minimizes the true error rate. <br />
Recall the empirical error rate<br />
<br />
<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math><br />
<br />
<math>\,h</math> is a classifier and we want to minimize the error rate. So we apply <math>\displaystyle x_1</math> to <math>\displaystyle x_n</math> to <math>\displaystyle h</math>, and take the average to get the empirical true error rate estimation of probability that <br />
<math>h(x_{i}) \neq y_{i}</math>.<br />
<br />
<span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 2]]</span><br />
There is a downward bias to this estimate meaning that it is always less than the true error rate. <br />
<br />
If there is a change in our complexity from low to high, our error rate is always decreasing. When we apply our model to the test data, our error rate will start to decrease to a point, but then it will increase since the model hasn't seen it before. This can be explained since training error will decrease when we fit the model better by increasing its complexity, but as we have seen, this complex model will not generalize well, resulting in a larger test error. <br />
<br />
We use our test data (from the test sample line shown on Figure 2) to get our empirical error rate.<br />
Right complexity is defined as where error rate of the test data is minimum; and this is one idea behind complexity control.<br />
<br />
<br />
<br />
[[File:Bias.jpg|200px|thumb|left|Figure 3]]<br />
<br />
We assume that we have samples <math>\,X_1, . . . ,X_n</math> that follow some (possibly unknown) distribution. We want to estimate a parameter <math>\,f</math> of the unknown distribution. This parameter may be the mean <math>\,E(X_i)</math>, the variance <math>\,var(X_i)</math> or some other quantity.<br />
<br />
The unknown parameter <math>\,f</math> is a fixed real number <math>f\in R</math>. To estimate it, we use an estimator which is a<br />
function of our observations, <math>\hat{f}(X_1,...,X_n)</math>. <br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f</math><br />
<br />
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]</math><br />
<br />
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math><br />
<br />
One property we desire of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>.<br />
However, there is a more important property for an estimator than just being unbiased: the mean squared error. In statistics, there are problems for which it may be good to use an estimator with a small bias. In some cases, an estimator with a small bias may have lesser mean squared error or be median-unbiased (rather than mean-unbiased, the standard unbiasedness property). The property of median-unbiasedness is invariant under transformations while the property of mean-unbiasedness may be lost under nonlinear transformations. For example, while using an unbiased estimator with large mean square error to estimate the parameter, we highly risk a big error. In contrast, a biased estimator with small mean square error will well improve the precision of our prediction.<br />
<br />
Hence, our goal is to minimize <math>MSE (\hat{f})</math>.<br />
<br />
From figure 3, we can see that the relationship of the three parameters is:<br />
<math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.<br />
<br />
A Test error is a good estimation on MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias.<br />
<br />
<br />
Referring to Figure 2, overfitting happens after the point where training data (training sample line) starts to decrease and test data (test sample line) starts to increase. There are 2 main approaches to avoid overfitting:<br />
<br />
1. Estimating error rate<br />
<br />
<math>\hookrightarrow</math> Empirical training error is not a good estimation<br />
<br />
<math>\hookrightarrow</math> Empirical test error is a better estimation<br />
<br />
<math>\hookrightarrow</math> Cross-Validation is fast<br />
<br />
<math>\hookrightarrow</math> Computing error bound (analytically) using some probability inequality.<br />
<br />
We will not discuss computing the error bound in class; however, a popular method for doing this computation is called VC Dimension (short for Vapnik–Chervonenkis Dimension). Information can be found from [http://www.autonlab.org/tutorials/vcdim.html Andrew Moore] and [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.7171&rep=rep1&type=pdf Steve Gunn].<br />
<br />
2. Regularization<br />
<br />
<math>\hookrightarrow</math> Use of shrinkage method<br />
<br />
<math>\hookrightarrow</math> Decrease the chance of overfitting by controlling the weights<br />
<br />
=== '''Example of under and overfitting in R''' ===<br />
<br />
To give further intuition of over and underfitting, consider this example. A simple quadratic data set with some random noise is generated, and then polynomials of varying degrees are fitted. The errors for the training set and a test set are calculated.<br />
[[File:Curvefitting-rex2.png|250px|thumb|right|Polynomial fits to curved data set.]]<br />
<br />
>> x <- rnorm(200,0,1)<br />
>> y <- x^2-0.5*x+rnorm(200,0,0.3)<br />
>> xtest <- rnorm(50,1,1)<br />
>> ytest <- xtest^2-0.5*xtest+rnorm(50,0,0.3)<br />
>> p1 <- lm(y~x)<br />
>> p2 <- lm(y ~ poly(x,2))<br />
>> pn <- lm(y ~ poly(x,10))<br />
>> psi <- lm(y~I(sin(x))+I(cos(x)))<br />
<br />
: <code>x</code> values for the training set are based on a <math>\,N(0,1)</math> distribution, while the test set has a <math>\,N(1,1)</math> distribution. <code>y</code> values are determined by <math>\,y = x^2 + x + N(0,0.3)</math>, a quadratic function with some random variation. Polynomial least square fits of degree 1, 2, and 10 are calculated, as well as a fit of <math>\,sin(x)+cos(x)</math>.<br />
<br />
>> > # calculate the mean squared error of degree 1 poly<br />
>> > sum((y-predict(p1,data.frame(x)))^2)/length(y)<br />
>> [1] 1.576042<br />
>> > sum((ytest-predict(p1,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 7.727615<br />
: Training and test mean squared errors for the linear fit. These are both quite high - and since the data is non-linear, the different mean value of the test data increases the error quite a bit.<br />
>> > # calculate the mean squared error of degree 2 poly<br />
>> > sum((y-predict(p2,data.frame(x)))^2)/length(y)<br />
>> [1] 0.08608467<br />
>> > sum((ytest-predict(p2,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 0.08407432<br />
: This fit is far better - and there is not much difference between the training and test error, either.<br />
>> > # calculate the mean squared error of degree 10 poly<br />
>> > sum((y-predict(pn,data.frame(x)))^2)/length(y)<br />
>> [1] 0.07967558<br />
>> > sum((ytest-predict(pn,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 156.7139<br />
: With a high-degree polynomial, the training error continues to decrease, but not by much - and the test set error has risen again. The overfitting makes it a poor predictor. As the degree of the polynomial rises further, the accuracy of the computer becomes an issue - and a good fit is not even consistently produced for the training data.<br />
>> > # calculate mse of sin/cos fit<br />
>> > sum((y-predict(psi,data.frame(x)))^2)/length(y)<br />
>> [1] 0.1105446<br />
>> > sum((ytest-predict(psi,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 1.320404<br />
: Fitting a function of the form sin(x)+cos(x) works pretty well on the training set, but because it is not the real underlying function, it fails on test data which doesn't lie on the same domain.<br />
<br />
== ''' Cross-Validation (CV) - Introduction ''' ==<br />
<br />
[[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]]<br />
Cross-Validation is used to estimate the error rate of a classifier with respect to test data rather than data used in the model. Here is a general introduction to CV:<br />
<br />
<math>\hookrightarrow</math> We have a set of collected data for which we know the proper labels<br />
<br />
<math>\hookrightarrow</math> We divide it into 2 parts, Training data (T) and Validation data (V)<br />
<br />
<math>\hookrightarrow</math> For our calculation, we pretend that we do not know the label of V and we use data in T to train the classifier<br />
<br />
<math>\hookrightarrow</math> We estimate an empirical error rate on V since the model hasn't seen V yet and we know the proper label of all elements in V to know how many were misclassified.<br />
<br />
CV has different implementations which can reduce the variance of the calculated error rate, but sometimes with a tradeoff of a higher calculation time.<br />
<br />
== ''' Complexity Control - Nov 4, 2009''' ==<br />
<br />
== Cross-validation ==<br />
[[File:Cross-validation.png|350px|thumb|right|Figure 1: Classical/Standard cross-validation]]<br />
Cross-validation is the simplest and most widely used method to estimate the true error. It comes from the observation that although training error always decreases with the increasing complexity of the model, the test error starts to increase from a certain point, which is noted as overfitting (see [[#prediction-error|figure 2]] above). Since test error estimates MSE (mean square error) best, people came up with the idea to divide the data set into three parts: training set, validation set, and test set. training set is used to build the model, validation set is used to deside the parameters and the optimal model, and the test set is used to estimate the performance of the chosen model. A classical division is 50% for training set, and 25% each for validation set and test set. All of them are randomly selected from the original data set. <br />
<br />
Then, we only use the part of our data marked as the "training set" to train our algorithm, while keeping the remaining marked as the "validation set" untouched. As a result, the validation set will be totally unknown to the trained model. The error rate is then estimated by:<br />
<br />
<math>\hat L(h) = \frac{1}{|\nu|}\sum_{X_i \in \nu}(h(x_i) \neq y_i)</math>, where <math>\,|\nu|</math> is the cardinality of the validation set.<br />
<br />
When we change the complexity, the error generated by the validation set will have the same behavior as the test set, so we are able to choose the best parameters to get the lowest error.<br />
<br />
<br />
=== K-fold Cross-validation ===<br />
[[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]]<br />
Above is the simplest form of complexity control. However, in reality, it may be hard to collect data ??and we usually suffer from the curse of dimensionality??, and a larger data set may be hard to come by. Consequently, we may not be able to afford to sacrifice part of the limited resources. In this case we use another method that addresses this problem, K-fold cross-validation. We divide the data set into <math>\,K</math> subsets roughly equal in size. The usual choice is <math>\,K = 10</math>.<br />
<br />
Generally, how to choose <math>\,K</math>:<br />
<br />
if <math>\,K=n</math>, leave one out, low bias, high variance. Each subset contains a single element, so the model is trained with all except one point, and then validated using that point.<br />
<br />
if <math>\,K=2</math>, say 2-fold, 5-fold, high bias, low variance. Each subset contains approximately <math>\,\frac{1}{2}</math> or <math>\,\frac{1}{5}</math> of the data.<br />
<br />
For every <math>\,k</math>th <math>( \,k \in [ 1, K ] )</math> part, we use the other <math>\,K-1</math> parts to fit the model and test on the <math>\,k</math>th part to estimate the prediction error <math>\hat L_k</math>, where<br />
<br />
<math>\hat L(h) = \frac{1}{K}\sum_{k=1}^K\hat L_k</math><br />
<br />
For example, suppose we want to fit a polynomial model to the data set and split the set into four equal subsets as shown in Figure 2. First we choose the degree to be 1, i.e. a linear model. Next we use the first three sets as training sets and the last as validation set, then the 1st, 2nd, 4th subsets as training set and the 3rd as validation set, so on and so forth until all the subsets have been the validation set once (all observations are used for both training and validation). After we get <math>\hat L_1, \hat L_2, \hat L_3, \hat L_4</math>, we can calculate the average <math>\hat L</math> for degree 1 model. Similarly, we can estimate the error for n degree model and generate a simulating curve. Now we are able to choose the right degree which corresponds to the minimum error. Also, we can use this method to find the optimal unit number of hidden layers of neural networks. We can begin with 1 unit number, then 2, 3 and so on and so forth. Then find the unit number of hidden layers with lowest average error.<br />
<br />
=== Generalized Cross-validation ===<br />
If the vector of observed values is denoted by <math>\mathbf{y}</math>, and the vector of fitted values by <math>\hat\mathbf{y}</math>.<br />
<br />
<math>\mathbf{y} = \mathbf{H}\hat\mathbf{y}</math>, <br />
<br />
where the hat matrix is given by<br />
<br />
<math>\mathbf{H} = \mathbf{X}( \mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X}^{T}</math>,<br />
<br />
<math> \frac{1}{N}\sum_{i=1}^{N}[y_{i} - \hat f^{-i}(\mathbf{x}_{i})]^{2}=\frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-\mathbf{H}_{ii}}]^{2}</math>,<br />
<br />
Then the GCV approximation is given by<br />
<br />
<math> GCV(\hat f) = \frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-trace(\mathbf{H})/N}]^{2}</math>,<br />
<br />
Thus, one of the biggest advantages of the GCV is that the trace is more easily computed.<br />
<br />
=== Leave-one-out Cross-validation ===<br />
Leave-one-out cross-validation involves using all but one data point in the original training data set to train our model, then using the data point that we initially left out as a means to estimate true error. But repeating this process for every data point in our original data set, we can obtain a good estimation of true error.<br />
<br />
In other words, leave-one-out cross-validation is k-fold cross-validation in which we set the subset number <math>\,K</math> to be the cardinality of the whole data set.<br />
<br />
In the above example, we can see that k-fold cross-validation can be computationally expensive: for every possible value of the parameter, we must train the model <math>\,K</math> times. This deficiency is even more obvious in leave-one-out cross-validation, where we must train the model <math>\,n</math> times, where <math>\,n</math> is the number of data point in the data set.<br />
<br />
Fortunately, when adding data points to the classifier is reversible, calculating the difference between two classifiers is computationally more efficient than calculating the two classifiers separately. So, if the classifier on all the data points is known, we simply undo the changes from a data point <math>\,K</math> times to calculate the leave-one-out cross-validation error rate.<br />
<br />
== Regularization for Neural Network — Weight Decay ==<br />
[[File:figure 2.png|350px|thumb|right|Figure 1: activation function]]<br />
Weight decay training is suggested as an implementation for achieving a robust neural network which is insensitive to noise. Since the number of hidden layers in NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting.<br />
<br />
It can be seen from Figure 1 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, we can avoid overfitting by constraining the weights to be small. This gives us a hint to initialize the random weights to be close to zero.<br />
<br />
Formally, we penalize nonlinear weights by adding a penalty term in the error function. Now the regularized error function becomes:<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}|w_i|^2 + \sum_{jk}|u_{jk}|^2)</math>, where <math>\,err</math> is the original error in back-propagation; <math>\,w_i</math> is the weights of the output layer; <math>\,u_{jk}</math> is the weights of the hidden layers.<br />
<br />
Usually, too large <math>\,\lambda</math> will make the weights <math>\,w_i</math> and <math>\,u_{jk}</math> too small. We can use cross validation to estimate <math>\,\lambda</math>.<br />
<br />
A similar penalty, weight elimination, is given by,<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}\frac{|w_i|^2}{1 + |w_i|^2} + \sum_{jk}\frac{|u_{jk}|^2}{1+|u_{jk}|^2})</math>.<br />
<br />
As in back-propagation, we take partial derivative with respect to the weights:<br />
<br />
<math>\frac{\partial REG}{\partial w_i} = \frac{\partial err}{\partial w_i} + 2\lambda w_i</math><br />
<br />
<math>\frac{\partial REG}{\partial u_{jk}} = \frac{\partial err}{\partial u_{jk}} + 2\lambda u_{jk}</math><br />
<br />
<math>w^{new} \leftarrow w^{old} - \rho\left(\frac{\partial err}{\partial w} + 2\lambda w\right)</math><br />
<br />
<math>u^{new} \leftarrow u^{old} - \rho\left(\frac{\partial err}{\partial u} + 2\lambda u\right)</math><br />
<br />
Note that here <math>\,\lambda</math> serves as a trade-off parameter, tuning between the error rate and the linearity. Actually, we may also set <math>\,\lambda</math> by cross-validation. The tuning parameter is important since weights of zero will lead to zero derivatives and the algorithm will not change. On the other hand, starting with weights that are too large means starting with a nonlinear model which can often lead to poor solutions. <ref>Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning (Springer 2009) pp.398</ref><br />
<br />
== Radial Basis Function (RBF) Networks - November 6, 2009 ==<br />
<br />
[[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]]<br />
<br />
=== Introduction === <br />
<br />
A Radial Basis Function (RBF) network [http://en.wikipedia.org/wiki/Radial_basis_function_network] is a type of artificial neural network with an output layer and a single hidden layer, with weights from the hidden layer to the output layer, and can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. One choice that has been widely used is that of radial basis functions, which have the property that each basis function depends only on the radial distance (typically Euclidean) from a center <math>\displaystyle\mu_{j}</math>, so that <math>\phi_{j}(x)= h({\Vert x - \mu_{j}\Vert})</math>.<br />
<br />
<br />
The output of an RBF network can be expressed as a weighted sum of its radial basis functions as follows:<br />
<br />
<math>\hat y_{k} = \sum_{j=1}^M\phi_{j}(x) w_{jk}</math><br />
<br />
The radial basis function is: <br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br /><br />
(Gaussian without a normalization constant)<br /><br /><br />
'''note:'''The hidden layer has a variable number of neurons (the optimal number is determined by the training process). As usual the more neurons in the hidden layer, the higher the model complexity. Each neuron consists of a radial basis function centered on a point with the same dimensions as the input data. The radii of the RBF functions may be different. The centers and radii can be determined through clustering or an EM algorithm. When the x vector is given from the input layer, the hidden neuron computes the radial distance from the neuron’s center point and then applies RBF function to this distance. The resulting value is passed to the the output layer and weighed together to form the output. <br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat{Y}_{n,k} = \left[ \begin{matrix}<br />
\hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\<br />
\hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\<br />
\vdots &\vdots & \ddots & \vdots \\<br />
\hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k}<br />
\end{matrix}\right] </math> is the matrix of output variables. <br />
<br />
:<math>\Phi_{n,m} = \left[ \begin{matrix}<br />
\phi_{1,1} & \phi_{1,2} & \cdots & \phi_{1,m} \\<br />
\phi_{2,1} & \phi_{2,2} & \cdots & \phi_{2,m} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
\phi_{n,1} & \phi_{n,2} & \cdots & \phi_{n,m}<br />
\end{matrix}\right] </math> is the matrix of Radial Basis Functions.<br />
<br />
:<math>W_{m,k} = \left[ \begin{matrix}<br />
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\<br />
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
w_{m,1} & w_{m,2} & \cdots & w_{m,k}<br />
\end{matrix}\right] </math> is the matrix of weights.<br />
<br />
Here, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>k = 1</math>, <math>\hat Y</math> and <math>W</math> are column vectors.<br />
<br />
''related reading'':<br />
<br />
Introduction of the Radial Basis Function (RBF) Networks [http://axiom.anu.edu.au/~daa/courses/GSAC6017/rbf.pdf]<br />
<br />
Radial Basis Function (RBF) Networks [http://documents.wolfram.com/applications/neuralnetworks/index6.html] [http://lcn.epfl.ch/tutorial/english/rbf/html/index.html]<br />
<br />
=== Estimation of weight matrix W ===<br />
<br />
We minimize the training error, <math>\Vert Y - \hat{Y}\Vert^2</math> in order to find <math>\,W</math>.<br /><br /><br />
From a previous result in linear algebra we know that <br />
<br />
<math>\Vert A \Vert^2 = Tr(A^{T}A)</math><br />
<br />
Thus we have a problem similar to linear regression:<br />
<br />
<math>\ err = \Vert Y - \Phi W\Vert^{2} = Tr[(Y - \Phi W)^{T}(Y - \Phi W)]</math><br />
<br />
<math>\ err = Tr[Y^{T}Y - Y^{T}\Phi W - W^{T} \Phi^{T} Y + W^{T}\Phi^{T} \Phi W]</math><br />
<br />
<br />
==== Useful properties of matrix differentiation ====<br />
<br />
<br />
<math>\frac{\partial Tr(AX)}{\partial X} = A^{T}</math><br />
<br />
<math>\frac{\partial Tr(X^{T}A)}{\partial X} = A</math><br />
<br />
<math>\frac{\partial Tr(X^{T}AX)}{\partial X} = (A^{T} + A)X</math><br />
<br />
==== Solving for W ====<br />
<br />
We find the minimum over <math>\,W</math> by setting <math>\frac{\partial err}{\partial W}</math> equal to zero and using the aforementioned properties of matrix differentiation.<br />
<br />
<math>\frac{\partial err}{\partial W} = 0</math><br />
<br />
<math>\ 0 - \Phi^{T}Y - \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ -2 \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math><br />
<br />
<math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY</math><br />
<br />
where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}</math><br />
<br />
<math>\,H</math> is the hat matrix for this model. This gives us a nice results since the solution has a closed form and we do not have to worry about convexity problems in this case.<br />
<br />
=== Including an additional bias ===<br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat Y = \left[ \begin{matrix}<br />
y_{11} & y_{12} & \cdots & y_{1k} \\<br />
y_{21} & y_{22} & \cdots & y_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
y_{n1} & y_{n2} & \cdots & y_{nk}<br />
\end{matrix}\right] </math> is the matrix(n by k) of output variables.<br />
<br />
:<math>\Phi = \left[ \begin{matrix}<br />
\phi_{10} &\phi_{11} & \phi_{12} & \cdots & \phi_{1M} \\<br />
\phi_{20} & \phi_{21} & \phi_{22} & \cdots & \phi_{2M} \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{n0} &\phi_{n1} & \phi_{n2} & \cdots & \phi_{nM}<br />
\end{matrix}\right] </math> is the matrix(n by M+1) of Radial Basis Functions.<br />
<br />
:<math>W = \left[ \begin{matrix}<br />
w_{01} & w_{02} & \cdots & w_{0k} \\<br />
w_{11} & w_{12} & \cdots & w_{1k} \\<br />
w_{21} & w_{22} & \cdots & w_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
w_{M1} & w_{M2} & \cdots & w_{Mk}<br />
\end{matrix}\right] </math> is the matrix(M+1 by k) of weights.<br />
<br />
where the extra basis function <math>\Phi_{0}</math> is set to 1.<br />
<br />
==== Normalized RBF ====<br />
<br />
In addition to the above unnormalized architecture, the normalized RBF can be represented as:<br />
<br />
<math>\hat{y}_{k}(X) = \frac{\sum_{j=1}^{M} w_{jk}\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math><br /><br /><br />
<br />
<br />
Actually, <math>\Phi^{\ast}_{j}(X) = \frac{\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math> is known as a normalized radial basis function. Giving the familiar form,<br /><br />
<br />
<math>\hat{y}_{k}(X) = \sum_{j=1}^{M} w_{jk}\Phi^{\ast}_{j}(X)</math><br /><br /><br />
<br />
=== Conceptualizing RBF networks ===<br />
<br />
In the past, we have classified data using models that were explicitly linear, quadratic, or otherwise definite. In RBF networks, like in Neural Networks, we can fit an arbitrary model. How can we do this without changing the equations being used?<br />
<br />
Recall a [[#Trick:_Using_LDA_to_do_QDA_-_October_7.2C_2009|trick]] that was discussed in the October 7 lecture: if we add new features to our original data set, we can project into higher dimensions, use a linear algorithm, and get a quadratic result by collapsing to a lower dimension afterward. In RBF networks, something similar can happen.<br />
<br />
Think of <math>\,\Phi</math>, our matrix of radial basis functions, as a feature space of the input. Each hidden unit, then, can be thought to represent a feature; we can see that, if there are more hidden units than input units, we can essentially project to a higher-dimensional space, as we did in our earlier trick. However, this does not mean that an RBF network will actually do this, it is merely a way to convince yourself that RBF networks (and neural networks) can fit arbitrary models. Nevertheless, it is also noticed that just because of such great power, the problem of overfitting appears more important. We have to control its complexity so that it does not fit to an arbitrary training model but to a general one.<br />
<br />
=== RBF networks for classification -- a probabilistic paradigm ===<br />
<br />
[[File:Rbf_graphical_model.png|350px|thumb|left|Figure 1: RBF graphical model]]<br />
<br />
An RBF network is akin to fitting a Gaussian mixture model to data. We assume that each class can be modelled by a single function <math>\,\phi</math> and data is generated by a mixture model. According to Bayes Rule,<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(x|y_{k})*Pr(y_{k})}{Pr(x)}</math><br />
<br />
While all classifiers that we have seen thus far in the course have been in discriminative form, the RBF network is a generative model that can be represented using a directed graph.<br />
<br />
We can replace the class conditional density in the above conditional probability expression by marginalizing <math>\,x</math> over <math>\,j</math>:<br />
<math>\Pr(x|y_{k}) = \sum_{j} Pr(x|j)*Pr(j|y_{k})</math><br />
<br />
<br />
<br />
<br/><br/><br />
*'''Note''' We made the assumption that each class can be modelled by a single function <math>\displaystyle\Phi</math> and that the data was generated by a mixture model. The Gaussian mixture model has the form:<br />
<math>f(x)=\sum_{m=1}^M \alpha_m \phi(x;\mu_m,\Sigma_m)</math> where <math>\displaystyle\alpha_m</math> are mixing proportions, <math>\displaystyle\sum_m \alpha_m=1</math>, and <math>\displaystyle\mu_m</math> and <math>\displaystyle\Sigma_m</math> are the mean and covariance of each Gaussian density respectively. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009), pp. 214. </ref> The generative model in Figure 1 shows graphically how each Gaussian in the mixture model is chosen to sample from.<br />
<br />
== '''Radial Basis Function (RBF) Networks - November 9th, 2009''' ==<br />
<br />
=== RBF Network for classification (A probabilistic point of view) ===<br />
Using RBF Network to do classification, we usually treat it as regression problem. We want to set a threshold to decide what the data’s class membership is. However, to find some insight to the classification problem of what we are doing in terms of RBF Network, we often think of mixture models and make certain assumptions.<br />
<br />
[[File:RBF.png|350px|thumb|right|Figure 26.1: RBF Network Classification Demo]] <br />
<br />
We assume, as we can see in the graph on the right hand side, that we have three random variables, <math>\displaystyle y_k</math>, <math>\displaystyle j</math>, and <math>\displaystyle x</math> where <math>\displaystyle y_k</math> denotes class <math>\,k</math>, <math>\displaystyle x</math> is what we observed here, and <math>\displaystyle j</math> is some hidden random variable. There is a process that there are different classes, and each class can trigger a different hidden random variable <math>\displaystyle j</math>. To understand this, we can assume that, for instance, this random variable <math>\displaystyle j</math> has a Gaussian distribution (it could have any other distribution as well) and that all the <math>\displaystyle j</math>’s have the same distribution (Gaussian), but with different parameters. From each Gaussian distribution triggered by each class, we are going to sample some data points. Therefore, in the end, we are going to get a set of data, which are not strictly Gaussian, but are actually a mixture of Gaussians.<br />
<br />
Again, we look at the posterior distribution from [http://en.wikipedia.org/wiki/Bayes'_theorem Bayes' Rule].<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(X = x | Y = y_{k})*Pr(Y = y_{k})}{Pr(X = x)}</math><br />
<br />
Since we made the assumption that the data has been generated from a mixture model, we can estimate this conditional probability by<br />
<br />
<math>\Pr(X = x | Y = y_{k}) = \sum_{j} Pr(X = x | j)*Pr(j | Y = y_{k})</math>, <br />
<br />
which is the class conditional distribution (or probability) of the mixture model. Note, here, if we only have a simple model from <math>\displaystyle y_k</math> to <math>\displaystyle x</math>, then we won’t have this summation.<br />
<br />
We can substitute this class conditional distribution into Bayes' formula. We can see that the posterior of class <math>\displaystyle k</math> is the summation over <math>\displaystyle j</math> of the probability of <math>\displaystyle x</math> given <math>\displaystyle j</math> times the probability of <math>\displaystyle j</math> given <math>\displaystyle y_k</math>, times the prior distribution of class <math>\displaystyle k</math>, and lastly divided by the marginal probability of <math>\displaystyle x</math>. That is,<br />
<br />
<math>\Pr(y_k | x) = \frac {\sum_{j} Pr(x | j)*Pr(j | y_{k})*Pr(y_{k})}{Pr(x)}</math>.<br />
<br />
Since, the prior probability of class <math>\displaystyle k</math>, <math>\displaystyle Pr(y_{k})</math>, does not have an index of <math>\displaystyle j</math>, it can be taken out of the summation. This yields,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)}</math>.<br />
<br />
We multiply this by <math>\displaystyle 1 = \frac {Pr(j)}{Pr(j)}</math>. Then, it becomes,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)} * \frac {Pr(j)}{Pr(j)}</math>.<br />
<br />
Next, note that <math>\displaystyle Pr(j | x) = \frac {Pr(x | j)*Pr(j)}{Pr(x)}</math>, and <math>\displaystyle Pr(y_k | j) = \frac {Pr(j | y_k)*Pr(y_k)}{Pr(j)}</math>. Then rearranging the terms, we finally have the posterior:<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} Pr(j | x)Pr(y_k | j)</math>.<br />
<br />
Interestingly, this is just the product of the posterior of the two functions that are summed.<br />
<br />
==== Interpretation of RBF Network classification ====<br />
<br />
[[File:2.png|350px|thumb|right|Figure 26.1.2(2): RBF Nerwork ]]<br />
<br />
We want to relate the results that we derived above to our RBF Network. In a RBF Network, as we can see on the right hand side, we have a set of data, <math>\displaystyle x_1</math> to <math>\displaystyle x_d</math>, and the hidden basis function, <math>\displaystyle \phi_{1}</math> to <math>\displaystyle \phi_{M}</math>, and then we have some output, <math>\displaystyle y_1</math> to <math>\displaystyle y_k</math>. Also, we have weights from the hidden layer to output layer. The output is just the linear sum of <math>\displaystyle \phi</math>’s. <br />
<br />
Now consider probability of <math>\displaystyle j</math> given <math>\displaystyle x</math> to be <math>\displaystyle \phi</math>, and the probability of <math>\displaystyle y_k</math> given <math>\displaystyle j</math> to be the weights <math>\displaystyle w_{jk}</math>, then the posterior can be written as,<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math>.<br />
<br />
[[File:3.png|350px|thumb|left|Figure 26.1.2(1): Gaussian mixture ]]<br />
<br />
Now, let us look at an example in one dimensional case. Suppose,<br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, and <math>\displaystyle j</math> is from 1 to 2. <br />
<br />
We know that <math>\displaystyle \phi</math> is a radial basis function. It's as if we put some Gaussian over data. And for each Gaussian, we consider the center <math>\displaystyle \mu</math>. Then, what <math>\displaystyle \phi</math> computes is the similarity of any data point to the center. <br />
<br />
We can see the graph on the left which plots the density of <math>\displaystyle \phi_{1}</math> and <math>\displaystyle \phi_{2}</math>. Take <math>\displaystyle \phi_{1}</math> for instance, if the point gets far from the center <math>\displaystyle \mu_{1}</math>, then it will reduce <math>\displaystyle \phi_{1}</math> to become nearly zero. Remember that, we can usually find a non-linear regression or classification of input space by doing a linear one in some extended space or some feature space (more details in Aside). Here, the <math>\displaystyle \phi</math>’s actually produce that feature space. <br />
<br />
So, one way to look at this is that this <math>\displaystyle \phi</math> is telling us that given an input, how likely the probability of presence of a particular feature is. Say, for example, we define the features as the centers of these Gaussian distributions. Then, this <math>\displaystyle \phi</math> function somehow computes the possibility given certain data points, of this kind of feature appearing. If the data point is right at the center, then the value of that <math>\displaystyle \phi</math> would be one, i.e. the probability is 1. If the point is far from the center, then the probability (<math>\displaystyle \phi</math> function value) will be close to zero, that is, it’s less likely. Therefore, we can treat <math>\displaystyle Pr(j | x)</math> as the probability of a particular feature given data. <br />
<br />
When we have those features, then <math>\displaystyle y</math> is the linear combination of the features. Hence, any of the weights <math>\displaystyle w</math>, which is equal to <math>\displaystyle Pr(y_k | j)</math>, tells us how likely this particular <math>\displaystyle y</math> will appear given those features. Therefore, the weight <math>\displaystyle w_{jk}</math> shows the probability of class membership given feature. <br />
<br />
Hence, we have found a probabilistic point of view to look at RBF Network!<br />
<br />
*'''Note''' There are some inconsistencies with this probabilistic point of view. There are no restrictions that force <math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math> to be between 0 and 1. So if least squares is used to solve this, <math>\displaystyle w_{jk}</math> cannot be interpreted as a probability. <br />
<br />
<br />
''' Aside '''<br />
*Feature Space:<br />
:One way to produce a feature space is LDA<br />
:Suppose, we have n data points <math>\mathbf{x}_1</math> to <math>\mathbf{x}_n </math>. Each data point has d features. And these n data points consist of the <math>X</math> matrix, <br />
:<math>X = \left[ \begin{matrix}<br />
x_{11} & x_{21} & \cdots & x_{n1} \\<br />
x_{12} & x_{22} & \cdots & x_{n2} \\<br />
\vdots & & \ddots & \vdots \\<br />
x_{1d} & x_{2d} & \cdots & x_{nd}<br />
\end{matrix}\right] </math><br />
:Also, we have feature space,<br />
:<math>\Phi^{T} = \left[ \begin{matrix}<br />
\phi_{1}(\mathbf{x_1}) & \phi_{1}(\mathbf{x_2})& \cdots & \phi_{1}(\mathbf{x_n})\\<br />
\phi_{2}(\mathbf{x_1})& \phi_{2}(\mathbf{x_2})& \cdots & \phi_{2}(\mathbf{x_n}) \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{M}(\mathbf{x_1}) & \phi_{M}(\mathbf{x_2}) & \cdots & \phi_{M}(\mathbf{x_n})<br />
\end{matrix}\right] </math> <br />
:If we want to solve a regression problem for the input data, we don’t perform Least Square on this <math>\displaystyle X</math> matrix, we do Least Square on the feature space, i.e. on the <math>\displaystyle \Phi^{T}</math> matrix. The dimensionality of <math>\displaystyle \Phi^{T}</math> is M by n.<br />
:Now, we still have n data points, but we define these n data points in terms of a new set of features. So, originally, we define our data points by d features, but now, we define them by M features. And what are those M features telling us? <br />
:Let us look at the first column of <math>\displaystyle \Phi^{T}</math> matrix. The first entry is <math>\displaystyle \phi_1</math> applied to <math>\mathbf{x_1}</math>, and so on, until the last entry is <math>\displaystyle \phi_M</math> applied to <math>\mathbf{x_1}</math>. Suppose each of these <math>\displaystyle \phi_j</math> is defined by<br />
:<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>.<br />
:Then, each <math>\displaystyle \phi_j</math> checks the similarity of the data point with its center. Hence, the new set of features are actually representing M centers in our data set, and for each data point, its new features check how this point is similar to the first center; how it is similar to the second center; and how it is similar to the <math>\displaystyle M^{th}</math> center. And this checking process will apply to all data points. Therefore, feature space gives another representation of our data set. <br />
<br />
</noinclude><br />
<br />
=== Model selection or complexity control for RBF Network - a brief introduction ===<br />
In order to obtain a better fit for the training data, we often want to increase the complexity of our RBF Network. By its construction, we know that to change the complexity of a RBF Network, the only way is to add or decrease the number of basis functions. A large number of basis function yields a more complex network. In theory, if we add enough basis functions, the RFB Network would work for any training; however, it doesn't mean this RBF Network model can generalize well. Therefore, for the purpose of avoiding overfitting problem (see Notes below), we only want to increase the number of basis function to certain point, i.e. its optimal level. <br />
<br />
For the model selection, what we usually do is estimate the training error. After working through the training error, we’ll see that the training error in fact can be decomposed, and one component of training error is called Mean Squared Error (MSE). In the later notes, we will find that our final goal is to get a good estimate of MSE. Moreover, in order to find an optimal model for our data, we select the model with the smallest MSE.<br />
<br />
Now, let us introduce some notations that we will use in the analysis:<br />
*<math>\hat f</math> -- the prediction model estimated by a RBF network from the training data<br />
*<math>\displaystyle f</math> -- the real model (not null), and ideally, we want <math>\hat f</math> to be close to <math>\displaystyle f</math><br />
*<math>\displaystyle err</math> -- the training error<br />
*<math>\displaystyle Err</math> -- the testing error<br />
*<math>\displaystyle MSE</math> -- the Mean Squared Error<br />
<br />
''' Notes '''<br />
<br />
[[File:overfitting.png|350px|thumb|left|Figure 26.2: Overfitting]]<br />
<br />
*Being more complex isn’t always a good thing. Sometime, [http://en.wikipedia.org/wiki/Overfitting overfitting] causes the model to lose its generality. For example in the graph on left hand side, the data points are sampled from the model <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle f(x_i)</math> is a linear function, which is shown by the blue line, and <math>\displaystyle \epsilon_i</math>is additive Gaussian noise from <math>~N(0,\sigma^2)</math>. The red curve displayed in the graph shows the over-fitted model. Clearly, this over-fitted model only works for any training data, and is useless for any further prediction when new data points are introduced.<br />
<br />
> n<-20;<br />
> x<-seq(1,10,length=n);<br />
> alpha<-2.5;<br />
> beta<-1.75;<br />
> y<-alpha+beta*x+rnorm(n);<br />
> plot(y~x, pch=16, lwd=3, cex=0.5, main='Overfitting');<br />
> abline(alpha, beta, col='blue');<br />
> lines(spline(x, y), col = 2);<br />
<br />
*More details on this topic later on.<br />
<br />
<br />
<br />
</noinclude><br />
<br />
<br />
<br />
<br />
<br />
<br />
== '''Model Selection(Stein's Unbiased Risk Estimate)- November 11th, 2009''' ==<br />
<br />
===Model Selection===<br />
<br />
Model selection is a task of selecting a model of optimal complexity for a given data. Learning a radial basis function network from data is a parameter estimation problem. One difficulty with this problem is selecting parameters that show good performance on both training and testing data. In principle, a model is selected to have parameters associated with the best observed performance on training data, although our goal really is to achieve good performance on unseen testing data. Not surprisingly, a model selected on the basis of training data does not necessarily exhibit comparable performance on the testing data. When squared error is used as the performance index, a zero-error model on the training data can always be achieved by using a sufficient number of basis functions.<br />
<br />
<br />
But, training error and testing error do not demonstrate a linear relationship. In particular, a smaller training error does do not necessarily result in a smaller testing error. In practice, one often observes that, up to a certain point, the model error on testing data tends to decrease as the training error decreases. However, if one attempts to decrease the training error too far by increasing model complexity, the testing error often can take a dramatic increase.<br />
<br />
<br />
The basic reason behind this phenomenon is that in the process of minimizing training error, after a certain point, the model begins to over-fit the training set. Over-fitting in this context means fitting the model to training data at the expense of losing generality. In the extreme form, a set of <math>\displaystyle N</math> training data points can be modeled exactly with <math>\displaystyle N</math> radial basis functions. Such a model follows the training data perfectly. However, the model is not representative features of the true underlying data source, and this is why it fails to correctly model new data points.<br />
<br />
<br />
In general, the training error rate will be less than the testing error on the new data. A model typically adapts to the training data, and hence the training error will be an overly optimistic estimate of the testing error. An obvious way to well estimate testing err is to add a penalty term to the training error to compensate. Actually, SURE is developed based on this.<br />
<br />
<br />
<br />
===Stein's unbiased risk estimate (SURE)===<br />
<br />
<br />
====Important Notation====<br />
<br />
Let:<br />
*<math>\hat f(X)</math> denote the ''prediction model'', which is estimated from a training sample by the RBF neural network model.<br />
*<math>\displaystyle f(X)</math> denote the ''true model''.<br />
*<math>\displaystyle err=\sum_{i=1}^N (\hat y_i-y_i)^2 </math> denote the ''training error'',which is the average loss over the training sample.<br />
*<math>\displaystyle Err=\sum_{i=1}^M (\hat y_i-y_i)^2 </math> denote the ''test error'', which is the expected prediction error on an independent test sample.<br />
*<math>\displaystyle MSE=E(\hat f-f)^2</math> denote the ''mean squared error'', where <math>\hat f(X)</math> is the estimated model and <math>\displaystyle f(X)</math> is the true model.<br />
<br />
<br />
<br />
Suppose the observations <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>.We need to estimate <math>\hat f</math> from training data set <math>T=\{(x_i,y_i)\}^2_{i=1}</math>. Let <math>\hat f_i=\hat f_(x_i)</math> and <math>\displaystyle f_i= f(x_i)</math>, then <br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i-\epsilon_i)^2]</math><math>=E[(\hat f_i-f_i)^2]+E[\epsilon_i^2]-2E[\epsilon_i(\hat f_i-f_i)]</math><br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2E[\epsilon_i(\hat f_i-f_i)]</math> <math>\displaystyle (1)</math><br />
<br />
The last term can be written as:<br />
<br />
<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=E[(y_i-f_i)(\hat f_i-f_i)]=cov(y_i,\hat f)</math>, where<math>\displaystyle y_i</math> and <math>\hat f_i</math> both have same mean <math>\displaystyle f_i</math>.<br />
<br />
<br />
<br />
====Stein's Lemma====<br />
<br />
If <math>\,Z</math> is <math>\,N(\mu,\sigma^2)</math> and if <math>\displaystyle g(Z)</math> is weakly differentiable,such that<math>\displaystyle E[\vert g'(Z)\vert]<\infty</math>, then <math>\displaystyle E[g(Z)(Z-\mu)]=\sigma^2E(g'(Z))</math>.<br />
<br />
<br />
According to Stein's Lemma, the last cross term of <math>\displaystyle (1)</math>, <math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]</math> can be written as <math>\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math>. The derivation is as follows.<br />
<br />
<math>\displaystyle Proof</math>: Let <math>\,Z = \epsilon</math>. Then <math>g(Z) = \hat f-f</math>, since <math>\hat y = f + \epsilon</math>, and <math>\,f</math> is a constant. So <math>\,\mu = 0</math> and <math>\,\sigma^2</math> is the variance in <math>\,\epsilon</math>.<br />
<math>\displaystyle E[g(Z)(Z-\mu)]=E[(\hat f-f)\epsilon]=\sigma^2E(g'(Z))=\sigma^2 E[\frac {\partial (\hat f-f)}{\partial y_i}]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}-\frac {\partial f}{\partial y_i}]</math><br />
<br />
<br />
Since <math>\displaystyle f</math> is the true model, not the function of the observations <math>\displaystyle y_i</math>, then <math>\frac {\partial f}{\partial y_i}=0</math>.<br />
<br />
So,<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math> <math>\displaystyle (2)</math><br />
<br />
====Two Different Cases====<br />
SURE in RBF,<br />
[http://www.cs.ualberta.ca/~papersdb/uploaded_files/801/paper_automatic-basis-selection-for.pdf Automatic basis selection for RBF networks using Stein’s unbiased risk estimator,Ali Ghodsi Dale Schuurmans]<br />
<br />
<br />
=====''Case 1''=====<br />
<br />
Consider the case in which a new data point has been introduced to the estimated model, i.e. <math>(x_i,y_i)\not\in\tau</math>; this new point belong to the validation set <math>\displaystyle \nu</math>,i.e.<math>(x_i,y_i)\in\nu</math>. Since <math>\displaystyle y_i</math> is a new point, <math>\hat f</math> and <math>\displaystyle y_i</math> are independent. Therefore <math>\displaystyle cov(y_i,\hat f)=0</math> (or think about <math>\frac{\partial \hat f}{\partial y_i}</math>, when <math>\,y_i</math>is a new point, then it has nothing with <math>\hat f</math> because the estimation of <math>\hat f</math> is from the training data, so <math>\frac{\partial \hat f}{\partial y_i}=0</math> ) and <math>\displaystyle (1)</math> in this case can be written as:<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2</math>. <br />
<br />
This expectation means <math>\frac {1}{m}\sum_{i=1}^m (\hat y_i-y_i)^2 = \frac {1}{m}\sum_{i=1}^m (\hat f_i-f_i)^2+ \sigma^2</math>.<br />
<br />
<math>\sum_{i=1}^m (\hat y_i-y_i)^2 = \sum_{i=1}^m (\hat f_i-f_i)^2+ m\sigma^2</math><br />
<br />
Based on the notation we denote above, then we obtain:<br />
<math>\displaystyle MSE=Err-m\sigma^2</math><br />
<br />
<br />
<br />
This is the justification behind the technique of cross validation. since <math>\displaystyle \sigma^2</math> is constant, to minimize <math>\displaystyle MSE</math> is equal to minimize the test err <math>\displaystyle Err</math>. In cross vaildation to avoid overfitting or underfitting, a validation data set is independent from the estimated model.<br />
<br />
<br />
=====''Case 2''=====<br />
<br />
A more interesting case is the case in which we do not use new data points to assess the performance of the estimated model. and the training data is used for both estimating and assessing a model <math>\hat f_i</math>. In this case the cross term in <math>\displaystyle (1)</math> cannot be ignored because <math>\hat f_i</math> and <math>\displaystyle y_i</math> are not independent. Therefore the cross term can be estimated by Stein's lemma, which was originally proposed to estimated the mean of a Guassian distribution.<br />
<br />
<br />
Suppose <math>(x_i,y_i)\in\tau</math>, then by applying Stein's lemma, we obtain <math>\displaystyle (2)</math> proved above.<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2\sigma^2E[\frac {\partial \hat f}{\partial y_i}]</math>.<br />
<br />
This expectation means <math>\frac {1}{N}\sum_{i=1}^N (\hat y_i-y_i)^2 = \frac {1}{N}\sum_{i=1}^N (\hat f_i-f_i)^2+ \sigma^2-\frac {2}{N}\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<br />
<math>\sum_{i=1}^N (\hat y_i-y_i)^2 = \sum_{i=1}^N (\hat f_i-f_i)^2+ N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<math>\displaystyle err=MSE+N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math><br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math> <math>\displaystyle (3)</math><br />
<br />
In statistics, this is known as [http://www.reference.com/browse/Stein%27s+unbiased+risk+estimate Stein's unbiased risk estimate (SURE)] is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter, and thus cannot be determined completely.<br />
<br />
===SURE for RBF Network===<br />
<br />
Based on SURE, the optimum number of basis functions should be assigned to have the minimum generalization err <math>\displaystyle err</math>. Based on the Radial Basis Function Network, by setting <math>\frac{\partial err}{\partial W}</math> equal to zero , we get the least squared solution of<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math>.Then we have <math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}Y = HY</math>,where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}</math>, <math>\,H</math> is the hat matrix for this model.<br />
<br />
<br />
<math>\hat f=\,H_{i1}y_2+\,H_{i2}y_2+\cdots+\,H_{in}y_n</math> <math>\displaystyle (3)</math><br />
<br />
where <math>\,H</math> depends on the input vector <math>\displaystyle x_i</math> but not on <math>\displaystyle y_i</math>. <br />
<br />
By taking the derivative of <math>\hat f_i</math> with respect to <math>\displaystyle y_i</math>, we can easily obtain:<br />
<br />
<math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Now, substituing this into<math>\displaystyle (3)</math> ,we get<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Here, we can tell that <math>\sum_{i=1}^N \,H_{ii}= \,Trace(H)</math>, the sum of the diagonal elements of <math>\,H</math>. Thus, we can obtain the further simplification that <math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1})\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=d</math>, where<math>\displaystyle d</math> is the dimension of <math>\displaystyle \Phi</math>. Since <math>\displaystyle \Phi</math> is a projection of input matrix <math>\,X</math> onto a basis set spanned by <math>\,M</math>, the number of basis functions. If considering intercept, then <math>\,Trace(H)= M+1</math>.<br />
<br />
Then,<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1)</math>.<br />
<br />
===SURE Algorithm===<br />
<br />
<br />
[[File:27.1.jpg|350px|thumb|right|Figure 27.1]]<br />
<br />
We use this method to find the optimum number of basis function by choosing the model with smallest MSE over the set of models considered. Given a set of models <math>\hat f_M(x)</math> indexed by the number of basis functions, <math>\displaystyle err(M)</math>. <br />
<br />
Then, <math>\displaystyle MSE(M)=err(M)-N\sigma^2+2\sigma^2(M+1)</math><br />
<br />
where <math>\displaystyle N</math> is the number of training samples and the noise,<math>\sigma^2</math>, can be estimated from the training data as<br />
<br />
<math>\hat \sigma^2=\frac {1}{N-1}\sum_{i=1}^N (\hat y-y)^2</math>.<br />
<br />
<br />
By applying SURE algorithm to SPECT Heart data, we get the optimal number of basis functions is <math>\displaystyle M=4</math>.<br />
<br />
<br />
Pls take a look at figure 27.1 on the right, which shows that<math>\displaystyle MSE</math> is smallest when <math>\displaystyle M=4</math>.<br />
<br />
<br />
Calculating the SURE value is easy if you have access to <math>\,\sigma</math>.<br />
<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
If <math>\,\sigma</math> is not known, it can be estimated using the error.<br />
<br />
error = (output - expected_output) .^ 2;<br />
sigma = (dot(err, ones(1, num_data_point)) / (num_data_point)) / (num_data_point - 1);<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
=='''SURE for RBF network & Support Vector Machine - November 13th, 2009'''==<br />
<br />
===SURE for RBF network===<br />
<br />
====Minimizing MSE====<br />
<br />
By Stein's unbiased risk estimate (SURE) for Radial Basis Function (RBF) Network<br />
we get:<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1) </math> (28.1)<br />
<br />
*<math>\displaystyle MSE</math>(mean square error)= <math>\sum_{i=1}^N (\hat y_i-y_i)^2 </math><br />
*<math>\displaystyle err</math>(training error)= <math>\sum_{i=1}^N (\hat f_i-f_i)^2 </math><br />
*<math>\displaystyle (M+1) </math>( number of hidden units)= <math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math><br />
<br />
<br />
'''Goal''': To minimize MSE<br />
<br />
1. If <math>\displaystyle \sigma </math> is known, then it is no impact (i.e. a constant),<br />
and we can ignore it. Only need to minimize <math>\displaystyle MSE=err +2\sigma^2(M+1)</math>.<br />
<br />
2. In reality, we do not know <math>\displaystyle \sigma</math>, and the estimate <math>\,\hat \sigma</math> changes when <math>\displaystyle (M+1) </math> changes. However, we can estimate <math>\displaystyle \sigma </math>.<br />
<br />
<math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>. Suppose we do not know the variance of <math>\displaystyle \epsilon</math>. Then, <br />
<br />
<math>\displaystyle \sigma^2=\frac{1}{N-1}\sum_{i=1}^N (\hat y_i-y_i)^2 =\frac{1}{N-1}err</math> (28.2)<br />
<br />
Substitute (28.2) into (28.1), get<br />
<br />
<math>\displaystyle MSE=err-N\frac{1}{N-1}err+2\frac{1}{N-1}err(M+1)</math><br />
<br />
<math>\displaystyle MSE=err(1-\frac{N}{N-1}+\frac{2(M+1)}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{N-1-N+2M+2}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{2M+1}{N-1}) </math> (28.3) <br />
<br />
<br />
[[File:28.1.jpg|350px|thumb|Figure 28.1: MSE vs err]]<br />
<br />
Figure 28.1: the training error will decrease and the MSE will increase when increasing the number of hidden units (i.e. the model is more complex).<br />
<br />
<br />
When the number of hidden units gets larger and larger, the training error will decrease until it approaches to <math>\displaystyle 0 </math>. If training error equals <math>\displaystyle 0 </math>, then no matter how large <math>\displaystyle (M+1) </math> is, from (28.3) we can see the estimate of MSE will approach <math>\displaystyle 0 </math> as well. However, in fact it does not happen since when training error is close to <math>\displaystyle 0 </math> [http://en.wikipedia.org/wiki/Overfitting overfitting] happens, and MSE should increase instead of being close to <math>\displaystyle 0 </math>. We can see it from the Figure 28.1. <br />
<br />
<br />
We can see <math>\displaystyle \sigma^2 </math> is the average of <math>\displaystyle err </math>. In order to deal with this problem, we can take the average for <math>\displaystyle err</math> of each hidden unit. For example: we can first take 1 hidden unit, and take 10 hidden units in the next. Since in reality the value of <math>\, \sigma^2</math> is a constant adjustment to the data points, and doesn't depend on <math>\,M+1</math>, using the average <math>\,\sigma^2</math> value for 1 to 10 hidden units has a firm theoretical basis.<br />
<br />
We can also see that unlike the classical Cross Validation (CV) or Leave one out (LOO) techniques, the SURE technique does not need to do the validation to find the optimal model. Hence, SURE technique uses less data than CV or LOO. It is suitable for the case that there is not enough data for validation. However, to implement SURE we need to find <math>\frac {\partial \hat f}{\partial y_i}</math>, which may not be trivial for models that do not have a closed-form solution.<br />
<br />
====Kmeans Clustering====<br />
<br />
Discription:<br /> [http://en.wikipedia.org/wiki/K-means_clustering Kmeans clustering] is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.<br />
<br />
*The number of hidden units is same as the number of clusters, which is <math>\displaystyle \phi </math><br />
<br />
*<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, we set it same for all clusters.<br />
<br />
The basic details for <math>K</math>-means clustering are given:<br />
<br />
The <math>K</math> initial centers are randomly chosen from the training data.<br />
<br />
Then the following two steps are iterated alternatively until convergernce.<br />
<br />
1. for each existing center, we reidentify its cluster (every points in this cluster should be closer to this center than others).<br />
<br />
2. compute the mean for each cluster and make it as the new center for each cluster.<br />
<br />
<br />
Example:<br /><br />
Partition data into 2 clusters (2 hidden values)<br />
<br />
<br />
>> X=rand(30,80); <br />
>> [IDX,C,sumD,D]=kmeans(X,2); <br />
>> size(IDX) <br />
>> 30 1<br />
>> size(C) <br />
>> 2 80<br />
>> size(sumD) <br />
>> 2 1<br />
>> c1=sum(IDX==1)<br />
>> 14<br />
>> c2=sum(IDX==2)<br />
>> 16<br />
>> sumD<br />
>> 85.6643<br />
>> 101.0419<br />
>> v1=sumD(1,1)/c1 <br />
>> 6.1189<br />
>> v2=sumD(2,1)/c2 <br />
>> 6.3151 <br />
<br />
<br />
<br />
Comments:<br />
<br />
We create <math>X</math> randomly as a train set with 80 data points and 30 dimensions, and then apply “kmeans” method to separate X into 2 clusters.IDX is a vector contains 1 or 2 which indicates 2 clusters, and its size is 30*1. <math>\displaystyle C </math> is the center (mean) of each cluster with size 2*80; sumD is sum of the square distance between the data points and center of its cluster. The <math>\displaystyle c1 </math> and <math>\displaystyle c2 </math> indicate the number of data points in cluster 1 and 2. <math>\displaystyle V1 </math> is the variance of the first cluster <math>\displaystyle (v1=\sigma_1)</math>; <math>\displaystyle V2 </math> is the variance of the second cluster <math>\displaystyle (v2=\sigma_2)</math>. Now we can get <math>\displaystyle \phi </math>, <math>\displaystyle w </math>, hat matrix <math>\displaystyle H </math> and <math>\displaystyle \hat Y </math> by following equations. Finally, we will get the <math>\displaystyle MSE </math> and predict the test set. <br />
<br />
<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br />
<br />
<math>\displaystyle W = (\phi^{T}\Phi)^{-1}\phi^{T}Y</math><br />
<br />
<math>\displaystyle H = \phi(\Phi^{T}\Phi)^{-1}\phi^{T}</math><br />
<br />
<math>\displaystyle \hat{Y} = \phi W = \phi(\phi^{T}\phi)^{-1}\phi^{T}Y = HY</math><br />
<br />
<br />
<br />
Aside:<br />
<br />
Similar in spirit to <math>K</math>-means, there is EM algorithm with respect to Gaussian mixture model. Generally speaking, the Gaussian mixture model is referred to as a soft clustering while <math>K</math>-means is hard clustering.<br />
<br />
Similar to <math>K</math>-means, the following two steps are iterated alternatively until convergernce.<br />
<br />
E-step, each point is assigned a weight for each cluster based on the likelihood of each of the corresponding Gaussians. Observations is assigned 1 for one cluster if they are closer to the center of that cluster, and is assigned 0 for other clusters. <br />
<br />
M-step, compute the weighted means and covariances and make them as the new means and covariances for every cluster.<br />
<br />
>>[P,mu,phi,1Pxtr]=mdgEM(X,2,200,0);<br />
<br />
===Support Vector Machine===<br />
<br />
====Introduction====<br />
We have seen that linear discriminate analysis and logistic regression both estimate linear decision boundaries in similar but slightly different ways. Separating hyperplane classifiers provide the basis for support vector classifier. It constructs linear decision boundaries that explicitly try to separate the data into different classes as well as possible. The techniques that make the extensions to the nonseparable case, where the classes overlap are generalized to what is known as the support vector machine. It produces nonlinear boundaries by constructing a linear boundary in a high dimensional, transformed version of the feature space.<br />
<br />
The original basis for SVM was published in the 1960s by [http://en.wikipedia.org/wiki/Vapnik Vapnik], Chervonenkis and co, however the ideas did not gain any attention until strong results were shown in the early 90s.<br />
<br />
Definition: <br /><br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machine (SVM)] are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a Maximum margin hyperplane or set of hyperplanes in a higher or infinite dimensional space. The set of points near the class boundaries, support vectors, define the model. which can be used for classification, regression or other tasks.<br />
<br />
====Optimal Seperating Hyperplane====<br />
<br />
[[File:28.2.jpg|350px|thumb|right|Figure 28.2]]<br />
<br />
Figure 28.2 An example with two classes separated by a hyperplane. The blue line is the least squares solution, which misclassifies one of the training points. Also shown are the black separating hyperplanes found by the [http://en.wikipedia.org/wiki/Perceptron perceptron] learning algorithm with different random starts.<br /><br />
<br />
We can see the data points can be separated by a linear boundary are in two classes in <math>\displaystyle \mathbb{R}^{2} </math>. Suppose a dataset is indeed linearly separable, then there exits infinitely many possible separating hyperplanes including the black lines in the figure as two of them for training data. However, which solution is the best when we introduce the new data. <br /><br />
<br />
Aside: <br /><br />
The blue line is the least squares solution to the problem,obtained by regressing the <math>\displaystyle -1/+1 </math> response <math>\displaystyle Y </math> on <math>\displaystyle X </math> (with intercept); the line is given by<br />
<math>\displaystyle {X:\hat\beta_0+\hat\beta_1X_1+\hat\beta_2X_2=0}</math>.<br />
This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by linear discriminant analysis, in light of its equivalence with linear regression in the two-class case.<br />
<br />
Classifiers such as (28.4) that compute a linear combination of the input features and return the sign were called ''perceptrons'' in the engineering literature in the late 1950s. <br />
<br />
<br />
Identifications:<br />
<br />
*Hyperplane: seprate two classes <br />
<br />
<math>\displaystyle x^{T}\beta+\beta_0=0</math><br />
<br />
*Margin: the distance between the hyperplane and the closest point.<br />
<br />
<math>\displaystyle d_i=x_i^{T}\beta+\beta_0 </math> where <math>\displaystyle i=1,....,N</math><br />
<br />
Note: since distance is positive, if the data is on <math>\displaystyle +1 </math> side the distance is <math>\displaystyle d_i(+1)</math>. If the data is on the <math>\displaystyle -1 </math> side the distance is <math>\displaystyle d_i(-1)</math>.<br />
<br />
*Data points: <math>\displaystyle y_i\in\{-1,+1\}</math>we can classify points as<math>\displaystyle sign\{d_i\}</math> if <math>\displaystyle \beta,\beta_0 </math> is known.<br /><br />
<br />
====Maximum Margin Classifiers in the Linearly separable case====<br />
Choose the line farthest from both classes or Choose the line which has the maximum distance from the closest point. (i.e maximize the margin)<br /><br />
<br />
<math>\displaystyle Margin=min\{y_id_i\}</math> <math>\displaystyle i=1,2,....,N </math> <br />
where <math>\displaystyle y_i </math> is label and <math>\displaystyle d_i </math> is distance<br /><br />
<br />
[[File:28.3.jpg|350px|thumb|right|Figure 28.3 The linear algebra of a hyperplane]]<br />
<br />
<br />
<br />
Fiqure 28.3 depicts a hyperplane defined by the equation <math>\displaystyle x^{T}\beta+\beta_0=0</math>. Since they are in <math>\displaystyle \mathbb{R}^{2} </math>, the hyperplane is a line.<br /><br />
<br />
<br />
Let us rewrite <math>\displaystyle Margin=min\{y_id_i\}</math> by using the following properties:<br /><br />
<br />
1. <math>\displaystyle \beta </math> is orthogonal to the hyperplane <br /><br />
<br />
Two points<math>\displaystyle x_1,x_2</math> lying on the hyperplane.<br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0\beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}(x_1-x_2)=0</math><br />
<br />
Hence,<math>\displaystyle \beta </math> is orthogonal to <math>\displaystyle (x_1-x_2)</math>, and<math>\displaystyle \beta^*=\frac{\beta}{\|\beta\|} </math> is the vector normal to the hyperplane.<br /><br />
<br />
2. For any point <math>\displaystyle x_0 </math> on the hyperplane, <br />
<br />
<math>\displaystyle \beta^{T}x_0+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1=-\beta_0</math><br />
For any point on the hyperplane <math>\displaystyle \beta^{T}</math> gives negative value of the intercept of the hyperplane. <br/><br />
<br />
<br />
3. The signed distance for any point <math>\displaystyle x </math> to the hyperplane is <math>\displaystyle d_i=\beta^{T}(x_i-x_0)</math>. <br/>Since the length of <math>\displaystyle \beta </math> is not known, let it be unit vector.<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}(x_i-x_0)}{\|\beta\|} </math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i-\beta^{T}x_0}{\|\beta\|} </math><br />
<br />
by property 2<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i+\beta_0}{\|\beta\|} </math><br />
<br />
<br />
<br />
[[File:4.jpg|350px|thumb|right|Figure 28.4]]<br />
<br />
<br />
We had <math>\displaystyle Margin=min(y_id_i)</math> <math>\displaystyle i=1,2,....,N </math>, and since we now know how to compute <math>\displaystyle d_i \Rightarrow</math> <br />
<br />
<math>\displaystyle Margin=min\{y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\} </math><br />
<br />
Suppose <math>\displaystyle x_i </math> is not on the hyperplan<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)>0 </math><br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq c </math>for <math>\displaystyle c>0 </math><br />
<br />
<br />
<math>\displaystyle y_i(\frac{\beta^{T}x_i}{c}+\frac{\beta_0}{c})\geq1</math> <br />
<br />
This is known as the cnonical representation of the decision hyperplane.<br />
<br />
For <math>\displaystyle \beta^{T} </math> only the direction is important, so <math>\displaystyle \frac{\beta^{T}}{c} </math> does not change its direction, the hyperplance will be the same.<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq1 </math><br />
<br />
<math>\displaystyle y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\geq\frac{1}{\|\beta\|} </math><br />
<br />
<math>\displaystyle Margin=min(\frac{1}{\|\beta\|} )</math><br />
<br />
which is equivalent as <math>\displaystyle Margin=max(\|\beta\| )</math><br />
<br />
<br />
<br />
<br />
Refrence:<br /><br />
Hastie,T.,Tibshirani,R., Friedman,J.,(2008).The Elements of Statistical Learning:129-130<br />
<br />
=='''Optimizing The Support Vector Machine - November 16th, 2009'''==<br />
We currently derive Support Vector Machine for the case where two classes are separable in the given feature space. This margin can be written as <math>\,min\{y_id_i\}</math>, or the distance of each point from the hyperplane, where <math>\,d_i</math> is the distance and <math>\,y_i</math> is used as the sign.<br />
===Margin Maximizing Problem for the Support Vector Machine===<br />
<math>\,Margin=min\{y_id_i\}</math> can be rewritten as <math>\,min\left\{\frac{y_i\left(\beta^Tx_i+\beta_0\right)}{|\beta|}\right\}</math>. <br />
<br\>Note that the term <math>\,y_i\left(\beta^Tx_i+\beta_0\right) = 0</math> if <math>\,x_i</math> is on the hyperplane, but <math>\,y_i\left(\beta^Tx_i+\beta_0\right) > 0</math> if <math>\,x_i</math> is <math>\displaystyle not</math> on the hyperplane<br />
<br />
This implies <math>\,\exists C>0</math> such that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq C</math>.<br />
<br />
Divide through by C to produce <math>\,y_i\left(\frac{\beta^T}{C}x_i + \frac{\beta_0}{C}\right) \geq 1</math>. <br />
<br />
<math>\,\beta, \beta_0</math> compose a hyperplane that can have different values but we care about the direction, dividng through by a constant does not change the direction of the hyperplane. Thus, by assuming scaled values for <math>\,\beta, \beta_0</math> we eliminate C, so that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq 1</math>. Implying that the lower bound on <math>\,y_i\left(\beta^Tx_i+\beta_0\right)</math> is <math>\displaystyle 1</math><br />
<br />
Now in order to maximize the margin, we simply need to find <math>\,min\left\{\frac{1}{|\beta|}\right\}</math>. <br />
<br />
In other words, our optimization problem is now to find maximum <math>\,|\beta|</math>, under the constraint that <math>\,min_i\{y_i(\beta^Tx_i+\beta_0)\} = 1</math>.<br />
<br />
Note that we're dealing with the norm of <math>\,\beta</math>. There are many different choices of possible norms, in general[http://en.wikipedia.org/wiki/P-norm#p-norm p-norm]. The 1-norm of a vector is simply the sum of the absolute value of each element (also known as the taxicab or Manhattan distance), and is apparently more accurate, but also has a discontinuity in the derivative. 2-norm or the Euclidean norm (the intuitive measure of the length of a vector), is easier to work with - that is <math>\,\|\beta\|_2 = (\beta^T\beta)^{1/2}</math>. For convenience, we will maximize <math>\,\frac{1}{2}\|\beta\|_2^2 = \frac{1}{2}\beta^T\beta</math> where the constant 1/2 has been added for simplification and that the maximizing the function is the same as maximizing the square root of that function.<br />
<br />
This is an example of a quadratic programming problem and we will minimize a quadratic function respective to a linear inequality constraints.<br />
<br />
<br />
====Writing Lagrangian Form of Support Vector Machine====<br />
The Lagrangian form is introduced to ensure that the conditions are satisfied, as well as finding an optimal solution(The optimal saddle point of the lagrangian for the classic quadratic optimization). The problem will be solved in dual space by introducing <math>\,\alpha_i</math> as dual constraints, this is in contrast to solving the problem in primal space as function of the betas. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.<br />
<br />
<math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math>. To find the optimal value, set the derivative equal to zero.<br />
<br />
<math>\,\frac{\partial L}{\partial \beta} = 0</math>, <math>\,\frac{\partial L}{\partial \beta_0} = 0</math>. Note that <math>\,\frac{\partial L}{\partial \alpha_i}</math> is equivalent to the constraints <math>\left(y_i(\beta^Tx_i+\beta_0)-1\right) \geq 0, \,\forall\, i</math><br />
<br />
First, <math>\,\frac{\partial L}{\partial \beta} = \frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\left\{\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i)+\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0-\frac{\partial}{\partial \beta}\alpha_iy_i\right\}}</math><br />
<br />
: <math>\frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 = \beta</math>.<br />
<br />
: <math>\,\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i) = \alpha_iy_ix_i</math><br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0 = 0</math>. <br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i = 0</math>.<br />
<br />
So this simplifies to <math>\,\frac{\partial L}{\partial \beta} = \beta - \sum_{i=1}^n{\alpha_iy_ix_i} = 0</math>. In other words,<br />
<br />
<math>\,\beta = \sum_{i=1}^n{\alpha_iy_ix_i}</math>, <math>\,\beta^T = \sum_{i=1}^n{\alpha_iy_ix_i^T}</math><br />
<br />
Similarly, <math>\,\frac{\partial L}{\partial \beta_0} = -\sum_{i=1}^n{\alpha_iy_i} = 0</math>.<br />
<br />
This allows us to rewrite the Lagrangian without <math>\,\beta</math>.<br />
<br />
<math>\,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} - \sum_{i=1}^n{\alpha_iy_i(\sum_{j=1}^n{\alpha_jy_jx_j^Tx_i} + \beta_0) - 1}</math>. <br />
<br />
Because <math>\,\sum_{i=1}^n{\alpha_iy_i} = 0</math>, and <math>\,\beta_0</math> is constant, <math>\,\sum_{i=1}^n{\alpha_iy_i\beta_0} = 0</math>. So this simplifies further, to<br />
<br />
<math>L(\alpha) = \,-\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} + \sum_{i=1}^n{\alpha_i}</math><br />
A dual representation of the maximum margin.<br />
<br />
Because <math>\,\alpha_i</math> is the Lagrangian multiplier, <math>\,\alpha_i \geq 0 \forall i</math>.<br />
<br />
This is a much simpler optimization problem.<br />
<br />
=='''The Support Vector Machine algorithm - November 18, 2009'''==<br />
<br />
===Solving the Lagrangian===<br />
<br />
Continuing from the above derivation, we now have the equation that we need to minimize, as well as two constraints.<br />
<br />
The Support Vector Machine problem boils down to:<br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math><br />
:such that <math>\alpha_i \geq 0</math><br />
:and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
<br />
We are looking to minimize <math>\,\alpha</math>, which is our only unknown. Once we know <math>\,\alpha</math>, we can easily find <math>\,\beta</math> and <math>\,\beta_0</math> (see the Support Vector algorithm below for complete details).<br />
<br />
If we examine the Lagrangian equation, we can see that <math>\,\alpha</math> is multiplied by itself; that is, the Lagrangian is quadratic with respect to <math>\,\alpha</math>. Our constraints are linear. This is therefore a problem that can be solved through [http://en.wikipedia.org/wiki/Quadratic_programming quadratic programming] techniques. We will examine how to do this in Matlab shortly.<br />
<br />
We can write the Lagrangian equation in matrix form:<br />
<br />
<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:such that <math>\underline{\alpha} \geq \underline{0}</math><br />
:and <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Where:<br />
* <math>\underline{\alpha}</math> denotes an <math>\,n \times 1</math> vector; <math>\underline{\alpha}^T = [\alpha_1, ..., \alpha_n]</math><br />
* Matrix <math>S = y_iy_jx_i^Tx_j = (y_ix_i)^T(y_ix_i)</math><br />
* <math>\,\underline{0}</math> and <math>\,\underline{1}</math> are vectors containing all 0s or all 1s respectively<br />
<br />
Using this matrix notation, we can use Matlab's built in quadratic programming routine, [http://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/quadprog.html quadprog].<br />
<br />
===Quadprog example===<br />
<br />
Let's use <code>quadprog</code> to find the solution to <math>\,L(\alpha)</math>.<br />
<br />
Matlab's <code>quadprog</code> function minimizes an equation of the following form:<br />
:<math>\min_x\frac{1}{2}x^THx+f^Tx</math><br />
:such that: <math>\,A \cdot x \leq b</math>, <math>\,Aeq \cdot x = beq</math> and <math>\,lb \leq x \leq ub</math><br />
<br />
We can now see why we kept the <math>\frac{1}{2}</math> constant in the original derivation of the equation.<br />
<br />
The function is called as such: <code>x = quadprog(H,f,A,b,Aeq,beq,lb,ub)</code>. The variables correspond to values in the equation above.<br />
<br />
We can now try to find the solution to <math>\,L(\alpha)</math> (though, it should be noted, that in <math>\,L(\alpha)</math> we subtract the first term rather than add it; I'm not sure if this difference would change how we call <code>quadprog</code>).<br />
<br />
We'll use a simple one-dimensional data set, which is essentially y = -1 or 1 + Gaussian noise. (Note: you could easily put the values straight into the quadprog equation; they are separated for clarity.)<br />
<br />
x = [mvnrnd([-1],[0.01],100); mvnrnd([1],[0.01],100)]';<br />
y = [-ones(100,1); ones(100,1)];<br />
S = (y*x)' * (y*x);<br />
f = ones(200,1);<br />
A = -ones(1,200);<br />
b = 0;<br />
Aeq = y';<br />
beq = 0;<br />
lb = 0;<br />
ub = []; % There is no upper bound<br />
alpha = quadprog(S,f,A,b,Aeq,beq,lb,ub);<br />
<br />
This gives us the optimal <math>\,\alpha</math>... or at least I think it should, but it does not appear to work for me (that is, despite setting the lower boundary of 0, a number of values are still negative. Whether this is just the nature of <code>quadprog</code> or an error on my part is an exercise for the reader).<br />
<br />
===Examining K.K.T. conditions===<br />
<br />
[http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker conditions] ([http://webrum.uni-mannheim.de/mokuhn/public/KarushKuhnTucker.pdf more info]) give us a closer look into the Lagrangian equation and the associated conditions.<br />
<br />
Suppose we are looking to minimize <math>\,f(x)</math> such that <math>\,g_i(x) \geq 0, \forall{x}</math>. If <math>\,f</math> and <math>\,g</math> are differentiable, then the ''necessary'' conditions for <math>\hat{x}</math> to be a local minimum are:<br />
<br />
# At the optimal point, <math>\frac{\partial L}{\partial \hat{x}} = 0</math>; i.e. <math>f'(\hat{x}) - \sum{\alpha_ig'(\hat{x})}=0</math><br />
# <math>\alpha_i \geq 0</math>. (Dual Feasibility) <br />
# <math>\alpha_ig_i(\hat{x}) = 0, \forall{i}</math> (Complementary Slackness)<br />
# <math>g_i(\hat{x}) \geq 0</math> (Primal Feasibility)<br />
<br />
If any of these conditions are violated, then the problem is deemed not feasible.<br />
<br />
These are all trivial except for condition 4. Let's examine it further in our support vector machine problem.<br />
<br />
===Support Vectors===<br />
<br />
Basically, the support vectors are the training points that actually determine the optimal separating hyperplane which we are look for. Also, they are the most difficult points to classify or the most informative for the classification.<br />
<br />
In our case, the <math>g_i(\hat{x})</math> function is:<br />
:<math>\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1</math><br />
<br />
Substituting <math>\,g_i</math> into KKT condition 3, we get <math>\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0</math>. <br\>In order for this condition to be satisfied either <br/><math>\,\alpha_i= 0</math> or <br/><math>\,y_i(\beta^Tx_i+\beta_0)=1</math><br />
<br />
All points <math>\,x_i</math> will be either 1 or greater than 1 distance away from the hyperplane.<br />
<br />
'''Case 1: a point <math>\displaystyle x_i > 1</math> away from the margin'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) > 1 \Rightarrow \alpha_i = 0</math>.<br />
<br />
If point <math>\, x_i</math> is not on the margin, then the corresponding <math>\,\alpha_i=0</math>.<br />
<br />
'''Case 2: a point <math>\displaystyle x_i = 1</math> away from the margin'''<br />
<br />
If <math>\,\alpha_i > 0 \Rightarrow y_i(\beta^Tx_i+\beta_0) = 1</math> <br />
<br\>If point <math>\, x_i</math> is on the margin, then the corresponding <math>\,\alpha_i>0</math>.<br />
<br />
<br />
Points on the margin, with corresponding <math>\,\alpha_i > 0</math>, are called '''''support vectors'''''.<br />
<br />
===Using support vectors===<br />
<br />
Support vectors are important because they allow the support vector machine algorithm to be insensitive to outliers. If <math>\,\alpha = 0</math>, then the cost function is also 0, and won't contribute to the solution of the SVM problem; only points on the margin -- support vectors -- contribute. Hence the model given by SVM in entirely defined by the set of support vectors, a subset of the entire training set. This is interesting because in the NN methods(and can be generalize to classical statistical learning) previous to this the configuration of the network needed to be specified. In this case we have a data-driven or 'nonparametric' model in which is the training set and algorithm will determine the support vectors.<br />
<br />
References:<br />
Wang, L, 2005. Support Vector Machines: Theory and Applications, Springer, 3<br />
<br />
====The support vector machine algorithm====<br />
<br />
# Solve the quadratic programming problem:<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
## Use Matlab's quadprog to find the optimal <math>\,\underline{\alpha}</math><br />
# Find <math>\beta = \sum_{i=1}^n{\alpha_iy_i\underline{x_i}}</math><br />
# Find <math>\,\beta_0</math> by choosing a support vector (a point with <math>\,\alpha_i > 0</math>) and solving <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math><br />
<br />
===Example in Matlab===<br />
<br />
The following code, taken verbatim from the lecture, shows how to use Matlab built-in SVM routines (found in the Bioinformatics toolkit) to do classification through support vector machines.<br />
<br />
load 2_3;<br />
[U,Y] = princomp(X');<br />
data = Y(:,1:2);<br />
l = [-ones(1,200) ones(1,200)];<br />
[train,test] = crossvalind('holdOut',400);<br />
% Gives indices of train and test; so, train is a matrix of 0 or 1, 1 where the point should be used as part of the training set<br />
svmStruct = svmtrain(data(train,:), l(train), 'showPlot', true);<br />
<br />
[[File:Svm1.png|frame|center|The plot produced by training on some of the 2_3 data's first two features.]]<br />
<br />
yh = svmclassify(svmStruct, data(test,:), 'showPlot', true);<br />
<br />
[[File:Svm2.png|frame|center|The plot produced by testing some of the 2_3 data.]]<br />
=='''Non-linear hypersurfaces and Non-Separable classes - November 20, 2009'''==<br />
==='''Kernel Trick'''===<br />
We have talked about the curse of dimensions at the beginning of this course, however, now we turn to the power of high dimensions in order to make find a linearly separable hyperplane between two classes of data points. To understand this, imagine a two dimensional prison where a two dimensional person is constrained. Suppose magically we give the person a third dimension, then he can escape from the prison. In other words, the prison and the person are linearly separable now with respect to the third dimension. The intuition behind the "kernel trick" is basically to map data to a higher dimension so that they are linearly separable by a hyperplane.<br />
<br />
We have seen SVM as a linear classification problem finding the max margin hyperplane in the given input space. However, for many real world problems a more complex decision boundary is required. The following simple method was devised is order to solve the same linear classification problem but in a usually higher dimensional space, 'feature space' under which the max margin hyperplane is better suited.<br />
<br />
Let <math>\,\phi</math> be a mapping,<br />
<br />
<math>\phi:\Re^d \rightarrow \Re^D </math><br /><br /><br />
<br />
We wish to find a <math>\,\phi</math> such that our data will be suited for separation by a hyperplane. Given this function, we are lead to solving the previous constrained quadratic optimization on the transformed dataset,<br /><br /><br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_j\phi(x_i)^T\phi(x_j)}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br /><br /><br />
<br />
The solution to this optimization problem is now well know; however a workable <math>\,\phi</math> must be determined. Possibly the largest drawback in this method is that we must compute the inner product of two vectors in the high dimensional space. As the number of dimensions in the initial data set increases, the inner product becomes computationally intensive or impossible.<br />
<br />
However, we have a very useful result that says that there exists a class of functions, <math>\,\Phi</math>, which satisfy the above requirements and that for any function <math>\,\phi \in \Phi</math>,<br />
<br />
<math>\,\phi(x_i)^T\phi(x_j) = K(x_i,x_j) </math><br /><br /><br />
<br />
Where K is the kernel function in the input space satisfying [http://en.wikipedia.org/wiki/Mercer%27s_condition Mercer's condition] (to guarantee that it indeed corresponds to certain mapping function <math>\,\phi</math>). As a result, if the objective function depends on inner products but not on coordinates, we can always use the kernel function to implicitly calculate in the feature space without storing the huge data. Not only does this solve the computation problems but it no longer requires us to explicitly determine a specific mapping function in order to use this method. In fact, it is now possible to use an infinite dimensional feature space in SVM without even knowing the function <math>\,\phi</math>.<br />
<br />
==='''Mercer's Theorem in detail'''===<br />
Let <math>\,\phi</math> be a mapping to a high dimensional [http://en.wikipedia.org/wiki/Hilbert_space Hibert Space] <math>\,H</math><br /><br />
<br />
<br />
<math>\phi:x \in \Re^d \rightarrow H </math><br /><br /><br />
<br />
The transformed coordinates can be defined as,<br /><br />
<br />
<math>\phi_1(x)\dots\phi_d(x)\dots </math><br /><br /><br />
<br />
By Hilbert - Schmidt theory we can represent an inner product in Hilbert space as,<br /><br /><br />
<br />
<math>\,\phi(x_i)^T\phi(x_j) = \sum_{r=1}^{\infty}a_k\phi_r(x_i)\phi(x_j) \Leftrightarrow K(x_i,x_j), \ a_r \ge 0 </math><br /><br /><br />
where K is symmetric, then Mercer's theorem gives necessary and sufficient conditions on K for it to satisfy the above relation.<br><br><br />
<br />
'''Mercer's Theorem'''<br />
<br />
Let C be a compact subset of <math>\Re^d</math> and K a function <math> \in L^2(C) </math>, if<br /><br /><br />
<br />
<math>\, \int_C\int_C K(u,v)g(u)g(v)dudv \ge 0, \ \forall g \in L^2(C)</math> <br /><br /><br />
<br />
then,<br /><br /><br />
<br />
<math>\sum_{r=1}^{\infty}a_k\phi_r(u)\phi(v)</math> converges absolutely and uniformly to a symmetric function <math>\,K(u,v)</math><br />
<br />
References:<br />
Vapnik, V., 1998. Statistical Learning Theory. John Wiley & Sons, {423}<br />
<br />
==='''Kernel Functions'''===<br />
There are various kernel functions, for example:<br />
<br />
* Linear kernel: <math>\,k(x,y)=x \cdot y</math><br />
* Polynomial kernel: <math>\,k(x,y)=(x \cdot y)^d</math><br />
* Gaussian kernel: <math>e^{-\frac{|x-y|^2}{2\sigma^2}}</math><br />
<br />
If <math>\,X</math> is a <math>\,d \times n</math> matrix in the original space, and <math>\,\phi(X)</math> is a <math>\,D \times n</math> matrix in the [http://en.wikipedia.org/wiki/Hilbert_space Hilbert space] (good explanation video: [http://www.youtube.com/watch?v=V2pBdH7YzX0 part 1] [http://www.youtube.com/watch?v=YRY5xlk3TC0 part 2]), then <math>\,\phi^T(X) \cdot \phi(X)</math> is an <math>\,n \times n</math> matrix. <br />
The inner product is also illustrated as correlation, which measures the similarity between data points. This gives us some insight in how to choose the kernel. The choice depends on certain prior knowledge of the problem and on how we believe the similarity of our data should be measured. In practice, the Gaussian (RBF) kernel usually works best. Besides the most common kernel functions mentioned above, many novel kernels are also suggested for different problem domains like text classification, gene classification and so on.<br />
<br />
These kernel functions can be applied to many algorithms to derive the "kernel version". For example, kernel PCA, kernel LDA, etc..<br />
<br />
==='''SVM: non-separable case'''===<br />
We have seen how SVMs are able to find an optimally separating hyperplane of two separable classes of data, in which case the margin contains no data points. However, in the real world, data of different classes are usually mixed together at the boundary and it's hard to find a perfect boundary to totally separate them. To address this problem, we slack the classification rule to allow data cross the margin. Mathematically the problem becomes,<br />
:<math>\min_{\beta, \beta_0} \frac{1}{2}|\beta|^2</math><br />
:<math>\,y_i(\beta^Tx_i+\beta_0) \geq 1-\xi_i</math><br />
:<math>\xi_i \geq 0</math><br />
<br />
Now each data point can have some error <math>\,\xi_i</math>. However, we only want data to cross the boundary when they have to and make the minimum sacrifice; thus, a penalty term is added correspondingly in the objective function to constrain the number of points that cross the margin. The optimization problem now becomes:<br />
<br />
:<math>\min_{\alpha} \frac{1}{2}|\beta|^2+\gamma\sum_{i=1}^n{\xi_i}</math><br />
:<math>\,s.t.</math> <math>y_i(\beta^Tx+\beta_0) \geq 1-\xi_i</math> <br />
:<math>\xi_i \geq 0</math><br />
<br />
[[File:non-separable.JPG|350px|thumb|right|Figure non-separable case]]<br />
<br />
<br\>Note that <math>\,\xi_i</math> is not necessarily smaller than one, which means data can not only enter the margin but can also cross the separating hyperplane.<br />
<br />
References:<br />
<br />
Mercer, J., 1909. Functions of positive and negative type and their connection<br />
with the theory of integral equations. Philos. Trans. Roy. Soc. London, A<br />
209:415{446}<br />
<br />
==Support Vector Machine algorithm for non-separable cases - November 23, 2009==<br />
<br />
With the program formulation above, we can form the lagrangian, apply KKT conditions, and come up with a new function to optimize. As we will see, the equation that we will attempt to optimize in the SVM algorithm for non-separable data sets is the same as the optimization for the separable case, with slightly different conditions.<br />
<br />
===Forming the Lagrangian===<br />
<br />
:<math>L: \frac{1}{2} |\beta|^2 + \gamma \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i[y_i(\beta^T x_i+\beta_0)-1+\xi_i]-\sum_{i=1}^n \alpha_i \xi_i</math><br />
:<math>\alpha_i \geq 0, \lambda_i \geq 0</math><br />
<br />
===Applying KKT conditions===<br />
# <math>\frac{\partial L}{\partial \beta}=\beta - \sum_{i=1}^n \alpha_i y_i x_i = 0 \Rightarrow \beta=\sum_{i=1}^n\alpha_i y_i x_i</math> <br\><math>\frac{\partial L}{\partial \beta_0}=-\sum_{i=1}^n \alpha_i y_i =0 \Rightarrow \sum_{i=1}^n \alpha_i y_i =0</math> since the sign does not make a difference<br />
#<math>\frac{\partial L}{\partial \xi_i}=\gamma - \alpha_i - \lambda_i \Rightarrow \gamma = \alpha_i+\lambda_i</math><br />
#<math>\,\alpha_i[y_i(\beta^T x_i+\beta_0)-1+\xi_i]=0</math> and <math>\alpha_i \xi_i=0</math><br />
<br\> Similar to what we did for the separable case after apply KKT conditions, replace the primal variables in terms of dual variables into the Lagrangian equations and simplify.<br />
<br />
===Putting it all together===<br />
<br />
With our KKT conditions and the Lagrangian equation, we can now use quadratic programming to find <math>\,\alpha</math>.<br />
<br />
In matrix form, we want to solve the following optimization:<br />
:<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:<math>\,s.t.</math> <math>\underline{0} \leq \underline{\alpha} \leq \gamma</math>, <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Solving this gives us <math>\,\underline{\alpha}</math>, which we can use to find <math>\,\underline{\beta}</math> as before:<br />
:<math>\,\underline{\beta} = \sum{\alpha_i y_i \underline{x_i}}</math><br />
<br />
However, we cannot find <math>\,\beta_0</math> in the same way as before, even if we choose a point with <math>\,\alpha_i > 0</math>, because we do not know the value of <math>\,\xi_i</math> in the equation<br />
:<math>\,y_i(\underline{\beta}^Tx_i + \beta_0) - 1 + \xi_i = 0</math><br />
<br />
From our discussion on the KKT conditions, we know that <math>\,\lambda_i \xi_i = 0</math> and <math>\,\gamma = \alpha_i + \lambda_i</math>.<br />
<br />
So, if <math>\,\alpha_i < \gamma</math> then <math>\,\lambda_i > 0</math> and consequently <math>\,\xi_i = 0</math>.<br />
<br />
Therefore, we can solve for <math>\,\beta_0</math> if we choose a point where:<br />
:<math>\,0 < \alpha_i < \gamma</math><br />
<br />
====The SVM algorithm for non-separable data sets====<br />
<br />
The algorithm, then, for non-separable data sets is:<br />
<br />
# Use <code>quadprog</code> (or another quadratic programming technique) to solve the above optimization and find <math>\,\alpha</math><br />
# Find <math>\,\underline{\beta}</math> by solving <math>\,\underline{\beta} = \sum{\alpha_i y_i x_i}</math><br />
# Find <math>\,\beta_0</math> by choosing a point where <math>\,0 < \alpha_i < \gamma</math> and then solving <math>\,y_i(\underline{\beta}^Tx_i + \beta_0) - 1 = 0</math></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=5510stat8412009-11-24T00:07:29Z<p>Ipargaru: /* Applying KKT conditions */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rise of fields such as data-mining, bioinformatics, and machine learning, classification has becomes a fast-developing topic. In the age of information, vast amounts of data are generated constantly, and the goal of classification is to learn from data. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
'''Definition''': The problem of Prediction a discrete random variable <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math> is called Classification.<br />
<br />
In classification,, we attempt to approximate a function <math>\,h</math>, by using a training data set, which will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>d</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data which are identical independent distributions, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, for instance, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability. Intuitively speaking, to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we find <math>\,y\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y|X=x)</math>, and classify <math>\,X</math> into class <math>\,y</math>. In order to calculate the value of <math>\,P(Y=y|X=x)</math>, we use ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posterior probability, <math>\,P(Y=y)</math> as the prior probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two classes, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<sub></sub><br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''3 different approaches to classification''':<br />
<br />
1) Empirical Risk Minimization: Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of <math>\,L(h)</math><br />
<br />
2) Regression: Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3) Density Estimation: estimate <math>\,P(X=x|Y=0)</math> and <math>\,P(X=x|Y=1)</math> (less popular in high-dimension cases)<br />
<br />
<br />
<br />
'''Bayes Classification Rule Optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when<math></math> the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
<br />
:<math>\, h^*(X)= \left\{\begin{matrix} <br />
1 & if P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
Remark:<br />
<br />
1)Bayes classification rule is optimal. Proof:[http://www.ee.columbia.edu/~vittorio/BayesProof.pdf]<br />
<br />
2)We still need any other method, since we cannot define prior probability in realistic.<br />
<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, since it is generally impossible for us to know the prior <math>\,P(Y=1)</math>, and class conditional density <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], tree-augmented naive Bayes (TAN), Bayesian network augmented naive Bayes (BAN) and general Bayesian network (GBN).<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as having objective existence. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayesian and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayesian'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In Bayesian method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
'''Multi-class Classification''':<br />
<br />
Y takes on more than two values.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
''Theorem'': Suppose that <math>\,Y \in \mathcal{Y}= \{1,\dots,k\}</math>, the optimal rule is :<math>\,h^*(X) = \arg\max_{k}{P(Y = k|X = x)}</math><br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
1 Empirical Risk Minimization:Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
<br />
2 Regression:Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3 Density estimation, estimate <math>P(X = x|Y = 0)</math> and <math>P(X = x|Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
It is quadratic because there is no boundaries.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on [http://academicearth.org/lectures/advice-for-applying-machine-learning LDA and QDA] so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
LDA is for classification and FDA is used for feature extraction.<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
<br />
=== logistic function ===<br />
A logistic function or logistic curve is the most common sigmoid curve. <br />
<br />
:<math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
1. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
2. <math>y(0) = \frac{1}{2}</math><br />
<br />
3. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
4. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
5. The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood, using Pr(Y|X). The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Seeing these equations as a weighted least squares problem makes them easier to derivate.<br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== Perceptron (Foundation of Neural Network) ===<br />
<br />
==== Separating Hyperplane Classifiers ====<br />
Separating hyperplane trys to separate the data using linear decision boundaries. When the classes overlap, it can be generalized to support vector machine, which constructs nonlinear boundaries by constructing a linear boundary in an enlarged and transformed feature space.<br />
<br />
==== Perceptron ====<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
Least Squares returns the sign of a linear combination of data points as the class label<br />
<br />
sign(<math>(\underline{\beta}^T \underline{x} + {\beta}_0)) = sign(\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2})</math><br />
<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the [http://en.wikipedia.org/wiki/Artificial_neural_network Artificial Neural Network] models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br />
<br />
<br />
If we find a hyperplane that is not unique between 2 classes, there will be infinitely many solutions obtained from the perceptron algorithm.<br />
<br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
A Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math>, where <math>\,I</math> indicates the sign of the expression and returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math> (initial guess). Its goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach], which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
====Problems with the Algorithm and Issues Affecting Convergence====<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. This problem can be eliminated by using basis expansions technique. To be specific, we try to find a hyperplane not in the original space, but in the enlarged space obtained by using some basis functions.<br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#A perfect separation is not always available even desirable. If observations comes from different classes sharing the same imput, the classification model seems to be overfitting and will generally have poor predictive performance.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
<br />
</ref>.<br />
====Comment on gradient descent algorithm====<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction in which the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and you initially stand in the middle, then you will finally arrive at the saddle point (local minimum) and get stuck there.<br />
<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over <math>\,i</math> (all data points). Actually, this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent) known as Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>\,{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
*A Perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection with branches that "fan out" onto as many connections as desired, each carrying the same signal - the processing element output signal. <br />
<br />
<ref> Haykin, Simon (2009). Neural Networks and Learning Machines. Pearson Education, Inc. </ref><br />
A neural network resembles the brain in two respects:<br />
1. Knowledge is acquired by the network from its environment through a learning process.<br />
2. Interneuron connection strengths, known as synaptic weights, are used to store the acquired knowledge.<br />
<br />
<ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref> It is a multistage regression or classification model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
A regression problem typically has only one unit in the output layer. In a k-class classification problem, there are usually k units in the output layer that each represent the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1).<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classification by NN. <br />
<br />
In perceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] and is not continuous at 0. Thus, we replace it by a smooth function <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it the '''activation function'''.<br />
<br>The choice of this function <math>\displaystyle \sigma </math> is determined by the properties of the data and the assumed distribution of target variables, but for multiple binary classification problems the logistic function, also known as inverse-logit, is often used: <br />
<math>\sigma(a)=\frac {1}{1+e^{-a}}</math><br />
<br />
There are some important properties for the activation function.<br />
<br />
# Activation function is nonlinear. It can be shown that if the activation function of the hidden units is linear, a three-layer neural network is equivalent to a two layer one. <br />
# Activation function saturate, which means there are maximum and minimum output value. This property ensures that the weights are bounded and therefore the searching time is limited. <br />
# Activation function is continuous and smooth.<br />
# Activation function is monotonic. This property is not necessary, since we know that RBF networks is also a kind of power model. <br />
<br />
'''Note:''' A key difference between a perceptron and a neural network is that a neural network uses continuous nonlinearities in the units, for the purpose of differentiation, whereas the perceptron often uses a non-differentiable activation function. The neural network function is differentiable with respect to the network parameters so that a gradient descent method can be used in training. Moreover, a perceptron is a linear classifier, whereas a neural network, by combining layers of perceptrons, is able to classify non-linear problems through proper training.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes into the perceptron, to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural network is called [http://en.wikipedia.org/wiki/Feedforward_neural_network Feed-Forward Neural Network]. Applications to Feed-Forward Neural Networks include data reduction, speech recognition, sensor signal processing, and ECG abnormality detection, to name a few. <ref>J. Annema, Feed-Forward Neural Networks, (Springer 1995), pp. 9 </ref><br />
<br />
===Back-propagation===<br />
For a while, the Neural Network model was just an idea, since there were no algorithms for training the model until 1986, when Geoffrey Hinton <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> devised an algorithm called '''back-propagation''' [http://en.wikipedia.org/wiki/Backpropagation#Algorithm]. After that, a number of other training algorithms and various configurations of neural networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied a gradient descent algorithm for optimizing weights. Back-propagation uses this idea of gradient descent to train a neural network based on the chain rule in calculus. <br />
<br />
Assume that the last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
For simplicity, there is only 1 unit at the end and assume for the moment we are doing regression.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each unit we have a function <math>\displaystyle z_i=\sigma(a_i)</math> that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classification and the resulting classification output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
''i j are reversed!''<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\delta_j \cdot z_l</math><br />
<br />
Note that a change in <math>\,a_j</math> causes changes in all <math>\,a_i</math> in the next layer on which the error is based, so we need to sum over i in the chain:<br />
<math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} =\sum_i \delta_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\,\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math> Using the activation function <math>\,\sigma(\cdot)</math><br />
<br />
So <math>\delta_j = \sum_i \delta_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \delta_i \cdot u_{ij}</math><br />
<br />
We can propagate the error calculated in the output back through the previous layers and adjust weights to minimize error.<br />
<br />
==Neural Networks (NN) - October 30, 2009 ==<br />
<br />
=== Back-propagation ===<br />
The idea is that we first feed an input from the training set to the Neural Network, then find the error rate at the output and then we propagate the error to previous layers and for each edge of weight <math>\,u_{ij}</math> we find <math>\frac{\partial \mathrm{err}}{\partial u_{ij}}</math>. Having the error rates at hand we adjust the weight of each edge by taking steps proportional to the negative of the gradient to decrease the error at output. The next step is to apply the next input from the training set and go through the described adjustment procedure.<br />
The overview of Back-propagation algorithm:<br />
#Feed a point <math>\,x</math> in the training set to the network, and find the output of all the nodes.<br />
#Evaluate <math>\,\delta_k=y_k-\hat{y_k}</math> for all output units, where <math>y_k</math> is the expected output and <math>\hat{y_k}</math> is the real output.<br />
#By propagating to the previous layers evaluate all <math>\,\delta_j</math>s for hidden units: <math>\,\delta_j=\sigma'(a_j)\sum_i \delta_i u_{ij}</math> where <math>i</math> is associated to the previous layer.<br />
#Using <math>\frac{\partial \mathrm{err}}{\partial u_{jl}} = \delta_j\cdot z_l</math> find all the derivatives.<br />
#Adjust each weight by taking steps proportional to the negative of the gradient: <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} -\rho \frac{\partial \mathrm{err}}{u_{ij}}</math><br />
#Feed the next point in the training set and repeat the above steps.<br />
==== How to initialize the weights ====<br />
This still leaves the question of how to initialize the weights <math>\,u_{ij}, w_i</math>. The method of choosing weights mentioned in class was to randomize the weights before the first step. This is not likely to be near the optimal solution in every case, but is simple to implement. To be more specific, random values near zero will be a good choice for the initial weights(usually from [-1,1]). In this case, the model evolves from a nearly linear one to a nonlinear one as we desired. An alternative is to use an orthogonal least squares method to find the initial weights <ref>http://www.mitpressjournals.org/doi/abs/10.1162/neco.1995.7.5.982</ref>. Regression is performed on the weights and output by using a linear approximation of <math>\,\sigma(a_i)</math>, and finds optimal weights in the linear model. Back propagation is used afterward to find the optimal solution, since the NN is non-linear.<br />
<br />
==== How to set learning rates ====<br />
The learning rate <math>\,\rho</math> is usually a constant. <br />
<br />
If we use On-line learning, as a form of stochastic approximation process, <math>\,\rho</math> should decrease as the iteration increase.<br />
<br />
<br />
Choosing too large learning rate may cause the unstability of the system, while the too small learning rate may lead to a very slow convergence rate(very long time in learning phase). However, the advantage of small learning rate is that it can guarantee the convergence. Thus, generally, it is better to choose a relatively small learning rate to ensure the stability. Usually, choose <math>\,\rho</math> between 0.01 and 0.7<br />
<br />
==== How to determine the number of hidden units ====<br />
<br />
Here we will mainly discuss how to estimate the number of hidden units at very beginning. Obviously, we should adjust it to be more precise using CV, LOO or other complexity control methods. <br />
<br />
Basically, if the patterns are well separated, few hidden units are fairly enough. If the patterns are drawn from some highly complicated mixture models, more hidden units are really needed. <br />
<br />
Actually, the number of hidden units determines the size of the model, and therefore the total number of the weights in the model. Typically speaking, the number of weights should not be larger than the number of training data, say N. Thus, sometimes, N/10 is a good choice. However, in pratice, many well performed models will use more hidden units.<br />
<br />
=== Dimensionality reduction application ===<br />
[[File:NN-bottelneck.png|350px|thumb|right|Figure 1: Bottleneck configuration for applying dimensionality reduction.]]<br />
One possible application of Neural Networks is to perform dimensionality reduction, like other techniques, e.g., PCA, MDS, LLE and Isomap.<br />
<br />
Consider the following configuration as shown in figure 1:<br />
As we go forward in layers of this Neural Network, the number of nodes is reduced, until we reach a layer with the number of nodes representing the desired dimensionality. However, note that at the very first few layers the number of nodes may not be strictly decreasing, as long as finally it can reach a layer with less nodes. From now on in the Neural Network <br />
the previous layers are mirrored. So at the output layer we have the same number of states as we have in the input layer. Now note that if we feed the network with each point and get an output approximately equal to the fed input, that means at the output the same input is reconstructed from the middle layer units. So the output of the middle layer units can represent the input with less dimensions.<br />
<br />
To train this Neural Network, we feed the network with a training point and through back propagation we adjust the network weights based on the error between the input layer and the reconstruction at the output layer. Our low dimensional mapping will be the observed output from the middle layer. Data reconstruction consists of putting the low dimensional data through the second half of the network.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
== Complexity Control October 30, 2009 ==<br />
<br />
[[File:overfitting-model.png|500px|thumb|right|Figure 2. The overfitting model passes through all the points of the training set, but has poor predictive power for new points.<br />
In exchange the line model has some error on the training points but has extracted the main characteristic of the training points, and has good predictive power.]]<br />
There are [http://academicearth.org/lectures/underfitting-and-overfitting two issues] that we have to avoid in Machine Learning:<br />
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]<br />
#Underfitting<br />
<br />
Overfitting occurs when our model is heavily complex with so many degrees of freedom, that we can learn every detail of the training set. Such a model will have very high precision on the training set but will show very poor ability to predict outcomes of new instances, especially outside the domain of the training set.<br />
<br />
In a Neural Network if the depth is too much, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will show a very precise outcome of the training set but will not be able to generalize the commonality of the training set to predict the outcome of new cases.<br />
<br />
Underfitting occurs when the model we picked to describe the data is not complex enough, and has high error rate on the training set.<br />
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.<br />
<br />
'''Example'''<br />
#Consider the example showed in the figure. We have a training set and we want to find a model which fits it the best. We can find a polynomial of high degree which almost passes through all the points in the training set. But, in fact the training set is coming from a line model. Now the problem is although the complex model has less error on the training set it diverges from the line in other ranges which we have no training point. Because of that the high degree polynomail has very poor predictive result on test cases. This is an example of overfitting model.<br />
#Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem.<br />
#Consider a simple classification example. If our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is a yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features typical of bananas, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie., we have overfit the data. This occurs when we have considered so many features that we have perfectly described the existing bananas, but if presented with a new banana of slightly different shape than the existing bananas, for example, it cannot be detected. This is the tradeoff; what is the right level of complexity?<br />
<br />
== Complexity Control - Nov 2, 2009 ==<br />
<br />
Overfitting occurs when the model becomes too complex and underfitting occurs when it is not complex enough, both of which are not desirable. To control complexity, it is necessary to make assumptions for the model before fitting the data. Assumptions that we can make for a model are with polynomials or a neural network. There are other ways as well.<br />
<br />
[[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 1: An example of a model with a family of polynomials]]<br />
We do not want a model to get too complex, so we control it by making an assumption on the model. With complexity control, we want a model or a classifier with a low error rate.<br />
<br />
=== '''How do we choose a good classifier?''' ===<br />
<br />
Our goal is to find a classifier that minimizes the true error rate. <br />
Recall the empirical error rate<br />
<br />
<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math><br />
<br />
<math>\,h</math> is a classifier and we want to minimize the error rate. So we apply <math>\displaystyle x_1</math> to <math>\displaystyle x_n</math> to <math>\displaystyle h</math>, and take the average to get the empirical true error rate estimation of probability that <br />
<math>h(x_{i}) \neq y_{i}</math>.<br />
<br />
<span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 2]]</span><br />
There is a downward bias to this estimate meaning that it is always less than the true error rate. <br />
<br />
If there is a change in our complexity from low to high, our error rate is always decreasing. When we apply our model to the test data, our error rate will start to decrease to a point, but then it will increase since the model hasn't seen it before. This can be explained since training error will decrease when we fit the model better by increasing its complexity, but as we have seen, this complex model will not generalize well, resulting in a larger test error. <br />
<br />
We use our test data (from the test sample line shown on Figure 2) to get our empirical error rate.<br />
Right complexity is defined as where error rate of the test data is minimum; and this is one idea behind complexity control.<br />
<br />
<br />
<br />
[[File:Bias.jpg|200px|thumb|left|Figure 3]]<br />
<br />
We assume that we have samples <math>\,X_1, . . . ,X_n</math> that follow some (possibly unknown) distribution. We want to estimate a parameter <math>\,f</math> of the unknown distribution. This parameter may be the mean <math>\,E(X_i)</math>, the variance <math>\,var(X_i)</math> or some other quantity.<br />
<br />
The unknown parameter <math>\,f</math> is a fixed real number <math>f\in R</math>. To estimate it, we use an estimator which is a<br />
function of our observations, <math>\hat{f}(X_1,...,X_n)</math>. <br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f</math><br />
<br />
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]</math><br />
<br />
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math><br />
<br />
One property we desire of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>.<br />
However, there is a more important property for an estimator than just being unbiased: the mean squared error. In statistics, there are problems for which it may be good to use an estimator with a small bias. In some cases, an estimator with a small bias may have lesser mean squared error or be median-unbiased (rather than mean-unbiased, the standard unbiasedness property). The property of median-unbiasedness is invariant under transformations while the property of mean-unbiasedness may be lost under nonlinear transformations. For example, while using an unbiased estimator with large mean square error to estimate the parameter, we highly risk a big error. In contrast, a biased estimator with small mean square error will well improve the precision of our prediction.<br />
<br />
Hence, our goal is to minimize <math>MSE (\hat{f})</math>.<br />
<br />
From figure 3, we can see that the relationship of the three parameters is:<br />
<math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.<br />
<br />
A Test error is a good estimation on MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias.<br />
<br />
<br />
Referring to Figure 2, overfitting happens after the point where training data (training sample line) starts to decrease and test data (test sample line) starts to increase. There are 2 main approaches to avoid overfitting:<br />
<br />
1. Estimating error rate<br />
<br />
<math>\hookrightarrow</math> Empirical training error is not a good estimation<br />
<br />
<math>\hookrightarrow</math> Empirical test error is a better estimation<br />
<br />
<math>\hookrightarrow</math> Cross-Validation is fast<br />
<br />
<math>\hookrightarrow</math> Computing error bound (analytically) using some probability inequality.<br />
<br />
We will not discuss computing the error bound in class; however, a popular method for doing this computation is called VC Dimension (short for Vapnik–Chervonenkis Dimension). Information can be found from [http://www.autonlab.org/tutorials/vcdim.html Andrew Moore] and [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.7171&rep=rep1&type=pdf Steve Gunn].<br />
<br />
2. Regularization<br />
<br />
<math>\hookrightarrow</math> Use of shrinkage method<br />
<br />
<math>\hookrightarrow</math> Decrease the chance of overfitting by controlling the weights<br />
<br />
=== '''Example of under and overfitting in R''' ===<br />
<br />
To give further intuition of over and underfitting, consider this example. A simple quadratic data set with some random noise is generated, and then polynomials of varying degrees are fitted. The errors for the training set and a test set are calculated.<br />
[[File:Curvefitting-rex2.png|250px|thumb|right|Polynomial fits to curved data set.]]<br />
<br />
>> x <- rnorm(200,0,1)<br />
>> y <- x^2-0.5*x+rnorm(200,0,0.3)<br />
>> xtest <- rnorm(50,1,1)<br />
>> ytest <- xtest^2-0.5*xtest+rnorm(50,0,0.3)<br />
>> p1 <- lm(y~x)<br />
>> p2 <- lm(y ~ poly(x,2))<br />
>> pn <- lm(y ~ poly(x,10))<br />
>> psi <- lm(y~I(sin(x))+I(cos(x)))<br />
<br />
: <code>x</code> values for the training set are based on a <math>\,N(0,1)</math> distribution, while the test set has a <math>\,N(1,1)</math> distribution. <code>y</code> values are determined by <math>\,y = x^2 + x + N(0,0.3)</math>, a quadratic function with some random variation. Polynomial least square fits of degree 1, 2, and 10 are calculated, as well as a fit of <math>\,sin(x)+cos(x)</math>.<br />
<br />
>> > # calculate the mean squared error of degree 1 poly<br />
>> > sum((y-predict(p1,data.frame(x)))^2)/length(y)<br />
>> [1] 1.576042<br />
>> > sum((ytest-predict(p1,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 7.727615<br />
: Training and test mean squared errors for the linear fit. These are both quite high - and since the data is non-linear, the different mean value of the test data increases the error quite a bit.<br />
>> > # calculate the mean squared error of degree 2 poly<br />
>> > sum((y-predict(p2,data.frame(x)))^2)/length(y)<br />
>> [1] 0.08608467<br />
>> > sum((ytest-predict(p2,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 0.08407432<br />
: This fit is far better - and there is not much difference between the training and test error, either.<br />
>> > # calculate the mean squared error of degree 10 poly<br />
>> > sum((y-predict(pn,data.frame(x)))^2)/length(y)<br />
>> [1] 0.07967558<br />
>> > sum((ytest-predict(pn,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 156.7139<br />
: With a high-degree polynomial, the training error continues to decrease, but not by much - and the test set error has risen again. The overfitting makes it a poor predictor. As the degree of the polynomial rises further, the accuracy of the computer becomes an issue - and a good fit is not even consistently produced for the training data.<br />
>> > # calculate mse of sin/cos fit<br />
>> > sum((y-predict(psi,data.frame(x)))^2)/length(y)<br />
>> [1] 0.1105446<br />
>> > sum((ytest-predict(psi,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 1.320404<br />
: Fitting a function of the form sin(x)+cos(x) works pretty well on the training set, but because it is not the real underlying function, it fails on test data which doesn't lie on the same domain.<br />
<br />
== ''' Cross-Validation (CV) - Introduction ''' ==<br />
<br />
[[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]]<br />
Cross-Validation is used to estimate the error rate of a classifier with respect to test data rather than data used in the model. Here is a general introduction to CV:<br />
<br />
<math>\hookrightarrow</math> We have a set of collected data for which we know the proper labels<br />
<br />
<math>\hookrightarrow</math> We divide it into 2 parts, Training data (T) and Validation data (V)<br />
<br />
<math>\hookrightarrow</math> For our calculation, we pretend that we do not know the label of V and we use data in T to train the classifier<br />
<br />
<math>\hookrightarrow</math> We estimate an empirical error rate on V since the model hasn't seen V yet and we know the proper label of all elements in V to know how many were misclassified.<br />
<br />
CV has different implementations which can reduce the variance of the calculated error rate, but sometimes with a tradeoff of a higher calculation time.<br />
<br />
== ''' Complexity Control - Nov 4, 2009''' ==<br />
<br />
== Cross-validation ==<br />
[[File:Cross-validation.png|350px|thumb|right|Figure 1: Classical/Standard cross-validation]]<br />
Cross-validation is the simplest and most widely used method to estimate the true error. It comes from the observation that although training error always decreases with the increasing complexity of the model, the test error starts to increase from a certain point, which is noted as overfitting (see [[#prediction-error|figure 2]] above). Since test error estimates MSE (mean square error) best, people came up with the idea to divide the data set into three parts: training set, validation set, and test set. training set is used to build the model, validation set is used to deside the parameters and the optimal model, and the test set is used to estimate the performance of the chosen model. A classical division is 50% for training set, and 25% each for validation set and test set. All of them are randomly selected from the original data set. <br />
<br />
Then, we only use the part of our data marked as the "training set" to train our algorithm, while keeping the remaining marked as the "validation set" untouched. As a result, the validation set will be totally unknown to the trained model. The error rate is then estimated by:<br />
<br />
<math>\hat L(h) = \frac{1}{|\nu|}\sum_{X_i \in \nu}(h(x_i) \neq y_i)</math>, where <math>\,|\nu|</math> is the cardinality of the validation set.<br />
<br />
When we change the complexity, the error generated by the validation set will have the same behavior as the test set, so we are able to choose the best parameters to get the lowest error.<br />
<br />
<br />
=== K-fold Cross-validation ===<br />
[[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]]<br />
Above is the simplest form of complexity control. However, in reality, it may be hard to collect data ??and we usually suffer from the curse of dimensionality??, and a larger data set may be hard to come by. Consequently, we may not be able to afford to sacrifice part of the limited resources. In this case we use another method that addresses this problem, K-fold cross-validation. We divide the data set into <math>\,K</math> subsets roughly equal in size. The usual choice is <math>\,K = 10</math>.<br />
<br />
Generally, how to choose <math>\,K</math>:<br />
<br />
if <math>\,K=n</math>, leave one out, low bias, high variance. Each subset contains a single element, so the model is trained with all except one point, and then validated using that point.<br />
<br />
if <math>\,K=2</math>, say 2-fold, 5-fold, high bias, low variance. Each subset contains approximately <math>\,\frac{1}{2}</math> or <math>\,\frac{1}{5}</math> of the data.<br />
<br />
For every <math>\,k</math>th <math>( \,k \in [ 1, K ] )</math> part, we use the other <math>\,K-1</math> parts to fit the model and test on the <math>\,k</math>th part to estimate the prediction error <math>\hat L_k</math>, where<br />
<br />
<math>\hat L(h) = \frac{1}{K}\sum_{k=1}^K\hat L_k</math><br />
<br />
For example, suppose we want to fit a polynomial model to the data set and split the set into four equal subsets as shown in Figure 2. First we choose the degree to be 1, i.e. a linear model. Next we use the first three sets as training sets and the last as validation set, then the 1st, 2nd, 4th subsets as training set and the 3rd as validation set, so on and so forth until all the subsets have been the validation set once (all observations are used for both training and validation). After we get <math>\hat L_1, \hat L_2, \hat L_3, \hat L_4</math>, we can calculate the average <math>\hat L</math> for degree 1 model. Similarly, we can estimate the error for n degree model and generate a simulating curve. Now we are able to choose the right degree which corresponds to the minimum error. Also, we can use this method to find the optimal unit number of hidden layers of neural networks. We can begin with 1 unit number, then 2, 3 and so on and so forth. Then find the unit number of hidden layers with lowest average error.<br />
<br />
=== Generalized Cross-validation ===<br />
If the vector of observed values is denoted by <math>\mathbf{y}</math>, and the vector of fitted values by <math>\hat\mathbf{y}</math>.<br />
<br />
<math>\mathbf{y} = \mathbf{H}\hat\mathbf{y}</math>, <br />
<br />
where the hat matrix is given by<br />
<br />
<math>\mathbf{H} = \mathbf{X}( \mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X}^{T}</math>,<br />
<br />
<math> \frac{1}{N}\sum_{i=1}^{N}[y_{i} - \hat f^{-i}(\mathbf{x}_{i})]^{2}=\frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-\mathbf{H}_{ii}}]^{2}</math>,<br />
<br />
Then the GCV approximation is given by<br />
<br />
<math> GCV(\hat f) = \frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-trace(\mathbf{H})/N}]^{2}</math>,<br />
<br />
Thus, one of the biggest advantages of the GCV is that the trace is more easily computed.<br />
<br />
=== Leave-one-out Cross-validation ===<br />
Leave-one-out cross-validation involves using all but one data point in the original training data set to train our model, then using the data point that we initially left out as a means to estimate true error. But repeating this process for every data point in our original data set, we can obtain a good estimation of true error.<br />
<br />
In other words, leave-one-out cross-validation is k-fold cross-validation in which we set the subset number <math>\,K</math> to be the cardinality of the whole data set.<br />
<br />
In the above example, we can see that k-fold cross-validation can be computationally expensive: for every possible value of the parameter, we must train the model <math>\,K</math> times. This deficiency is even more obvious in leave-one-out cross-validation, where we must train the model <math>\,n</math> times, where <math>\,n</math> is the number of data point in the data set.<br />
<br />
Fortunately, when adding data points to the classifier is reversible, calculating the difference between two classifiers is computationally more efficient than calculating the two classifiers separately. So, if the classifier on all the data points is known, we simply undo the changes from a data point <math>\,K</math> times to calculate the leave-one-out cross-validation error rate.<br />
<br />
== Regularization for Neural Network — Weight Decay ==<br />
[[File:figure 2.png|350px|thumb|right|Figure 1: activation function]]<br />
Weight decay training is suggested as an implementation for achieving a robust neural network which is insensitive to noise. Since the number of hidden layers in NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting.<br />
<br />
It can be seen from Figure 1 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, we can avoid overfitting by constraining the weights to be small. This gives us a hint to initialize the random weights to be close to zero.<br />
<br />
Formally, we penalize nonlinear weights by adding a penalty term in the error function. Now the regularized error function becomes:<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}|w_i|^2 + \sum_{jk}|u_{jk}|^2)</math>, where <math>\,err</math> is the original error in back-propagation; <math>\,w_i</math> is the weights of the output layer; <math>\,u_{jk}</math> is the weights of the hidden layers.<br />
<br />
Usually, too large <math>\,\lambda</math> will make the weights <math>\,w_i</math> and <math>\,u_{jk}</math> too small. We can use cross validation to estimate <math>\,\lambda</math>.<br />
<br />
A similar penalty, weight elimination, is given by,<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}\frac{|w_i|^2}{1 + |w_i|^2} + \sum_{jk}\frac{|u_{jk}|^2}{1+|u_{jk}|^2})</math>.<br />
<br />
As in back-propagation, we take partial derivative with respect to the weights:<br />
<br />
<math>\frac{\partial REG}{\partial w_i} = \frac{\partial err}{\partial w_i} + 2\lambda w_i</math><br />
<br />
<math>\frac{\partial REG}{\partial u_{jk}} = \frac{\partial err}{\partial u_{jk}} + 2\lambda u_{jk}</math><br />
<br />
<math>w^{new} \leftarrow w^{old} - \rho\left(\frac{\partial err}{\partial w} + 2\lambda w\right)</math><br />
<br />
<math>u^{new} \leftarrow u^{old} - \rho\left(\frac{\partial err}{\partial u} + 2\lambda u\right)</math><br />
<br />
Note that here <math>\,\lambda</math> serves as a trade-off parameter, tuning between the error rate and the linearity. Actually, we may also set <math>\,\lambda</math> by cross-validation. The tuning parameter is important since weights of zero will lead to zero derivatives and the algorithm will not change. On the other hand, starting with weights that are too large means starting with a nonlinear model which can often lead to poor solutions. <ref>Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning (Springer 2009) pp.398</ref><br />
<br />
== Radial Basis Function (RBF) Networks - November 6, 2009 ==<br />
<br />
[[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]]<br />
<br />
=== Introduction === <br />
<br />
A Radial Basis Function (RBF) network [http://en.wikipedia.org/wiki/Radial_basis_function_network] is a type of artificial neural network with an output layer and a single hidden layer, with weights from the hidden layer to the output layer, and can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. One choice that has been widely used is that of radial basis functions, which have the property that each basis function depends only on the radial distance (typically Euclidean) from a center <math>\displaystyle\mu_{j}</math>, so that <math>\phi_{j}(x)= h({\Vert x - \mu_{j}\Vert})</math>.<br />
<br />
<br />
The output of an RBF network can be expressed as a weighted sum of its radial basis functions as follows:<br />
<br />
<math>\hat y_{k} = \sum_{j=1}^M\phi_{j}(x) w_{jk}</math><br />
<br />
The radial basis function is: <br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br /><br />
(Gaussian without a normalization constant)<br /><br /><br />
'''note:'''The hidden layer has a variable number of neurons (the optimal number is determined by the training process). As usual the more neurons in the hidden layer, the higher the model complexity. Each neuron consists of a radial basis function centered on a point with the same dimensions as the input data. The radii of the RBF functions may be different. The centers and radii can be determined through clustering or an EM algorithm. When the x vector is given from the input layer, the hidden neuron computes the radial distance from the neuron’s center point and then applies RBF function to this distance. The resulting value is passed to the the output layer and weighed together to form the output. <br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat{Y}_{n,k} = \left[ \begin{matrix}<br />
\hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\<br />
\hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\<br />
\vdots &\vdots & \ddots & \vdots \\<br />
\hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k}<br />
\end{matrix}\right] </math> is the matrix of output variables. <br />
<br />
:<math>\Phi_{n,m} = \left[ \begin{matrix}<br />
\phi_{1,1} & \phi_{1,2} & \cdots & \phi_{1,m} \\<br />
\phi_{2,1} & \phi_{2,2} & \cdots & \phi_{2,m} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
\phi_{n,1} & \phi_{n,2} & \cdots & \phi_{n,m}<br />
\end{matrix}\right] </math> is the matrix of Radial Basis Functions.<br />
<br />
:<math>W_{m,k} = \left[ \begin{matrix}<br />
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\<br />
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
w_{m,1} & w_{m,2} & \cdots & w_{m,k}<br />
\end{matrix}\right] </math> is the matrix of weights.<br />
<br />
Here, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>k = 1</math>, <math>\hat Y</math> and <math>W</math> are column vectors.<br />
<br />
''related reading'':<br />
<br />
Introduction of the Radial Basis Function (RBF) Networks [http://axiom.anu.edu.au/~daa/courses/GSAC6017/rbf.pdf]<br />
<br />
Radial Basis Function (RBF) Networks [http://documents.wolfram.com/applications/neuralnetworks/index6.html] [http://lcn.epfl.ch/tutorial/english/rbf/html/index.html]<br />
<br />
=== Estimation of weight matrix W ===<br />
<br />
We minimize the training error, <math>\Vert Y - \hat{Y}\Vert^2</math> in order to find <math>\,W</math>.<br /><br /><br />
From a previous result in linear algebra we know that <br />
<br />
<math>\Vert A \Vert^2 = Tr(A^{T}A)</math><br />
<br />
Thus we have a problem similar to linear regression:<br />
<br />
<math>\ err = \Vert Y - \Phi W\Vert^{2} = Tr[(Y - \Phi W)^{T}(Y - \Phi W)]</math><br />
<br />
<math>\ err = Tr[Y^{T}Y - Y^{T}\Phi W - W^{T} \Phi^{T} Y + W^{T}\Phi^{T} \Phi W]</math><br />
<br />
<br />
==== Useful properties of matrix differentiation ====<br />
<br />
<br />
<math>\frac{\partial Tr(AX)}{\partial X} = A^{T}</math><br />
<br />
<math>\frac{\partial Tr(X^{T}A)}{\partial X} = A</math><br />
<br />
<math>\frac{\partial Tr(X^{T}AX)}{\partial X} = (A^{T} + A)X</math><br />
<br />
==== Solving for W ====<br />
<br />
We find the minimum over <math>\,W</math> by setting <math>\frac{\partial err}{\partial W}</math> equal to zero and using the aforementioned properties of matrix differentiation.<br />
<br />
<math>\frac{\partial err}{\partial W} = 0</math><br />
<br />
<math>\ 0 - \Phi^{T}Y - \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ -2 \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math><br />
<br />
<math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY</math><br />
<br />
where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}</math><br />
<br />
<math>\,H</math> is the hat matrix for this model. This gives us a nice results since the solution has a closed form and we do not have to worry about convexity problems in this case.<br />
<br />
=== Including an additional bias ===<br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat Y = \left[ \begin{matrix}<br />
y_{11} & y_{12} & \cdots & y_{1k} \\<br />
y_{21} & y_{22} & \cdots & y_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
y_{n1} & y_{n2} & \cdots & y_{nk}<br />
\end{matrix}\right] </math> is the matrix(n by k) of output variables.<br />
<br />
:<math>\Phi = \left[ \begin{matrix}<br />
\phi_{10} &\phi_{11} & \phi_{12} & \cdots & \phi_{1M} \\<br />
\phi_{20} & \phi_{21} & \phi_{22} & \cdots & \phi_{2M} \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{n0} &\phi_{n1} & \phi_{n2} & \cdots & \phi_{nM}<br />
\end{matrix}\right] </math> is the matrix(n by M+1) of Radial Basis Functions.<br />
<br />
:<math>W = \left[ \begin{matrix}<br />
w_{01} & w_{02} & \cdots & w_{0k} \\<br />
w_{11} & w_{12} & \cdots & w_{1k} \\<br />
w_{21} & w_{22} & \cdots & w_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
w_{M1} & w_{M2} & \cdots & w_{Mk}<br />
\end{matrix}\right] </math> is the matrix(M+1 by k) of weights.<br />
<br />
where the extra basis function <math>\Phi_{0}</math> is set to 1.<br />
<br />
==== Normalized RBF ====<br />
<br />
In addition to the above unnormalized architecture, the normalized RBF can be represented as:<br />
<br />
<math>\hat{y}_{k}(X) = \frac{\sum_{j=1}^{M} w_{jk}\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math><br /><br /><br />
<br />
<br />
Actually, <math>\Phi^{\ast}_{j}(X) = \frac{\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math> is known as a normalized radial basis function. Giving the familiar form,<br /><br />
<br />
<math>\hat{y}_{k}(X) = \sum_{j=1}^{M} w_{jk}\Phi^{\ast}_{j}(X)</math><br /><br /><br />
<br />
=== Conceptualizing RBF networks ===<br />
<br />
In the past, we have classified data using models that were explicitly linear, quadratic, or otherwise definite. In RBF networks, like in Neural Networks, we can fit an arbitrary model. How can we do this without changing the equations being used?<br />
<br />
Recall a [[#Trick:_Using_LDA_to_do_QDA_-_October_7.2C_2009|trick]] that was discussed in the October 7 lecture: if we add new features to our original data set, we can project into higher dimensions, use a linear algorithm, and get a quadratic result by collapsing to a lower dimension afterward. In RBF networks, something similar can happen.<br />
<br />
Think of <math>\,\Phi</math>, our matrix of radial basis functions, as a feature space of the input. Each hidden unit, then, can be thought to represent a feature; we can see that, if there are more hidden units than input units, we can essentially project to a higher-dimensional space, as we did in our earlier trick. However, this does not mean that an RBF network will actually do this, it is merely a way to convince yourself that RBF networks (and neural networks) can fit arbitrary models. Nevertheless, it is also noticed that just because of such great power, the problem of overfitting appears more important. We have to control its complexity so that it does not fit to an arbitrary training model but to a general one.<br />
<br />
=== RBF networks for classification -- a probabilistic paradigm ===<br />
<br />
[[File:Rbf_graphical_model.png|350px|thumb|left|Figure 1: RBF graphical model]]<br />
<br />
An RBF network is akin to fitting a Gaussian mixture model to data. We assume that each class can be modelled by a single function <math>\,\phi</math> and data is generated by a mixture model. According to Bayes Rule,<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(x|y_{k})*Pr(y_{k})}{Pr(x)}</math><br />
<br />
While all classifiers that we have seen thus far in the course have been in discriminative form, the RBF network is a generative model that can be represented using a directed graph.<br />
<br />
We can replace the class conditional density in the above conditional probability expression by marginalizing <math>\,x</math> over <math>\,j</math>:<br />
<math>\Pr(x|y_{k}) = \sum_{j} Pr(x|j)*Pr(j|y_{k})</math><br />
<br />
<br />
<br />
<br/><br/><br />
*'''Note''' We made the assumption that each class can be modelled by a single function <math>\displaystyle\Phi</math> and that the data was generated by a mixture model. The Gaussian mixture model has the form:<br />
<math>f(x)=\sum_{m=1}^M \alpha_m \phi(x;\mu_m,\Sigma_m)</math> where <math>\displaystyle\alpha_m</math> are mixing proportions, <math>\displaystyle\sum_m \alpha_m=1</math>, and <math>\displaystyle\mu_m</math> and <math>\displaystyle\Sigma_m</math> are the mean and covariance of each Gaussian density respectively. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009), pp. 214. </ref> The generative model in Figure 1 shows graphically how each Gaussian in the mixture model is chosen to sample from.<br />
<br />
== '''Radial Basis Function (RBF) Networks - November 9th, 2009''' ==<br />
<br />
=== RBF Network for classification (A probabilistic point of view) ===<br />
Using RBF Network to do classification, we usually treat it as regression problem. We want to set a threshold to decide what the data’s class membership is. However, to find some insight to the classification problem of what we are doing in terms of RBF Network, we often think of mixture models and make certain assumptions.<br />
<br />
[[File:RBF.png|350px|thumb|right|Figure 26.1: RBF Network Classification Demo]] <br />
<br />
We assume, as we can see in the graph on the right hand side, that we have three random variables, <math>\displaystyle y_k</math>, <math>\displaystyle j</math>, and <math>\displaystyle x</math> where <math>\displaystyle y_k</math> denotes class <math>\,k</math>, <math>\displaystyle x</math> is what we observed here, and <math>\displaystyle j</math> is some hidden random variable. There is a process that there are different classes, and each class can trigger a different hidden random variable <math>\displaystyle j</math>. To understand this, we can assume that, for instance, this random variable <math>\displaystyle j</math> has a Gaussian distribution (it could have any other distribution as well) and that all the <math>\displaystyle j</math>’s have the same distribution (Gaussian), but with different parameters. From each Gaussian distribution triggered by each class, we are going to sample some data points. Therefore, in the end, we are going to get a set of data, which are not strictly Gaussian, but are actually a mixture of Gaussians.<br />
<br />
Again, we look at the posterior distribution from [http://en.wikipedia.org/wiki/Bayes'_theorem Bayes' Rule].<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(X = x | Y = y_{k})*Pr(Y = y_{k})}{Pr(X = x)}</math><br />
<br />
Since we made the assumption that the data has been generated from a mixture model, we can estimate this conditional probability by<br />
<br />
<math>\Pr(X = x | Y = y_{k}) = \sum_{j} Pr(X = x | j)*Pr(j | Y = y_{k})</math>, <br />
<br />
which is the class conditional distribution (or probability) of the mixture model. Note, here, if we only have a simple model from <math>\displaystyle y_k</math> to <math>\displaystyle x</math>, then we won’t have this summation.<br />
<br />
We can substitute this class conditional distribution into Bayes' formula. We can see that the posterior of class <math>\displaystyle k</math> is the summation over <math>\displaystyle j</math> of the probability of <math>\displaystyle x</math> given <math>\displaystyle j</math> times the probability of <math>\displaystyle j</math> given <math>\displaystyle y_k</math>, times the prior distribution of class <math>\displaystyle k</math>, and lastly divided by the marginal probability of <math>\displaystyle x</math>. That is,<br />
<br />
<math>\Pr(y_k | x) = \frac {\sum_{j} Pr(x | j)*Pr(j | y_{k})*Pr(y_{k})}{Pr(x)}</math>.<br />
<br />
Since, the prior probability of class <math>\displaystyle k</math>, <math>\displaystyle Pr(y_{k})</math>, does not have an index of <math>\displaystyle j</math>, it can be taken out of the summation. This yields,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)}</math>.<br />
<br />
We multiply this by <math>\displaystyle 1 = \frac {Pr(j)}{Pr(j)}</math>. Then, it becomes,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)} * \frac {Pr(j)}{Pr(j)}</math>.<br />
<br />
Next, note that <math>\displaystyle Pr(j | x) = \frac {Pr(x | j)*Pr(j)}{Pr(x)}</math>, and <math>\displaystyle Pr(y_k | j) = \frac {Pr(j | y_k)*Pr(y_k)}{Pr(j)}</math>. Then rearranging the terms, we finally have the posterior:<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} Pr(j | x)Pr(y_k | j)</math>.<br />
<br />
Interestingly, this is just the product of the posterior of the two functions that are summed.<br />
<br />
==== Interpretation of RBF Network classification ====<br />
<br />
[[File:2.png|350px|thumb|right|Figure 26.1.2(2): RBF Nerwork ]]<br />
<br />
We want to relate the results that we derived above to our RBF Network. In a RBF Network, as we can see on the right hand side, we have a set of data, <math>\displaystyle x_1</math> to <math>\displaystyle x_d</math>, and the hidden basis function, <math>\displaystyle \phi_{1}</math> to <math>\displaystyle \phi_{M}</math>, and then we have some output, <math>\displaystyle y_1</math> to <math>\displaystyle y_k</math>. Also, we have weights from the hidden layer to output layer. The output is just the linear sum of <math>\displaystyle \phi</math>’s. <br />
<br />
Now consider probability of <math>\displaystyle j</math> given <math>\displaystyle x</math> to be <math>\displaystyle \phi</math>, and the probability of <math>\displaystyle y_k</math> given <math>\displaystyle j</math> to be the weights <math>\displaystyle w_{jk}</math>, then the posterior can be written as,<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math>.<br />
<br />
[[File:3.png|350px|thumb|left|Figure 26.1.2(1): Gaussian mixture ]]<br />
<br />
Now, let us look at an example in one dimensional case. Suppose,<br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, and <math>\displaystyle j</math> is from 1 to 2. <br />
<br />
We know that <math>\displaystyle \phi</math> is a radial basis function. It's as if we put some Gaussian over data. And for each Gaussian, we consider the center <math>\displaystyle \mu</math>. Then, what <math>\displaystyle \phi</math> computes is the similarity of any data point to the center. <br />
<br />
We can see the graph on the left which plots the density of <math>\displaystyle \phi_{1}</math> and <math>\displaystyle \phi_{2}</math>. Take <math>\displaystyle \phi_{1}</math> for instance, if the point gets far from the center <math>\displaystyle \mu_{1}</math>, then it will reduce <math>\displaystyle \phi_{1}</math> to become nearly zero. Remember that, we can usually find a non-linear regression or classification of input space by doing a linear one in some extended space or some feature space (more details in Aside). Here, the <math>\displaystyle \phi</math>’s actually produce that feature space. <br />
<br />
So, one way to look at this is that this <math>\displaystyle \phi</math> is telling us that given an input, how likely the probability of presence of a particular feature is. Say, for example, we define the features as the centers of these Gaussian distributions. Then, this <math>\displaystyle \phi</math> function somehow computes the possibility given certain data points, of this kind of feature appearing. If the data point is right at the center, then the value of that <math>\displaystyle \phi</math> would be one, i.e. the probability is 1. If the point is far from the center, then the probability (<math>\displaystyle \phi</math> function value) will be close to zero, that is, it’s less likely. Therefore, we can treat <math>\displaystyle Pr(j | x)</math> as the probability of a particular feature given data. <br />
<br />
When we have those features, then <math>\displaystyle y</math> is the linear combination of the features. Hence, any of the weights <math>\displaystyle w</math>, which is equal to <math>\displaystyle Pr(y_k | j)</math>, tells us how likely this particular <math>\displaystyle y</math> will appear given those features. Therefore, the weight <math>\displaystyle w_{jk}</math> shows the probability of class membership given feature. <br />
<br />
Hence, we have found a probabilistic point of view to look at RBF Network!<br />
<br />
*'''Note''' There are some inconsistencies with this probabilistic point of view. There are no restrictions that force <math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math> to be between 0 and 1. So if least squares is used to solve this, <math>\displaystyle w_{jk}</math> cannot be interpreted as a probability. <br />
<br />
<br />
''' Aside '''<br />
*Feature Space:<br />
:One way to produce a feature space is LDA<br />
:Suppose, we have n data points <math>\mathbf{x}_1</math> to <math>\mathbf{x}_n </math>. Each data point has d features. And these n data points consist of the <math>X</math> matrix, <br />
:<math>X = \left[ \begin{matrix}<br />
x_{11} & x_{21} & \cdots & x_{n1} \\<br />
x_{12} & x_{22} & \cdots & x_{n2} \\<br />
\vdots & & \ddots & \vdots \\<br />
x_{1d} & x_{2d} & \cdots & x_{nd}<br />
\end{matrix}\right] </math><br />
:Also, we have feature space,<br />
:<math>\Phi^{T} = \left[ \begin{matrix}<br />
\phi_{1}(\mathbf{x_1}) & \phi_{1}(\mathbf{x_2})& \cdots & \phi_{1}(\mathbf{x_n})\\<br />
\phi_{2}(\mathbf{x_1})& \phi_{2}(\mathbf{x_2})& \cdots & \phi_{2}(\mathbf{x_n}) \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{M}(\mathbf{x_1}) & \phi_{M}(\mathbf{x_2}) & \cdots & \phi_{M}(\mathbf{x_n})<br />
\end{matrix}\right] </math> <br />
:If we want to solve a regression problem for the input data, we don’t perform Least Square on this <math>\displaystyle X</math> matrix, we do Least Square on the feature space, i.e. on the <math>\displaystyle \Phi^{T}</math> matrix. The dimensionality of <math>\displaystyle \Phi^{T}</math> is M by n.<br />
:Now, we still have n data points, but we define these n data points in terms of a new set of features. So, originally, we define our data points by d features, but now, we define them by M features. And what are those M features telling us? <br />
:Let us look at the first column of <math>\displaystyle \Phi^{T}</math> matrix. The first entry is <math>\displaystyle \phi_1</math> applied to <math>\mathbf{x_1}</math>, and so on, until the last entry is <math>\displaystyle \phi_M</math> applied to <math>\mathbf{x_1}</math>. Suppose each of these <math>\displaystyle \phi_j</math> is defined by<br />
:<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>.<br />
:Then, each <math>\displaystyle \phi_j</math> checks the similarity of the data point with its center. Hence, the new set of features are actually representing M centers in our data set, and for each data point, its new features check how this point is similar to the first center; how it is similar to the second center; and how it is similar to the <math>\displaystyle M^{th}</math> center. And this checking process will apply to all data points. Therefore, feature space gives another representation of our data set. <br />
<br />
</noinclude><br />
<br />
=== Model selection or complexity control for RBF Network - a brief introduction ===<br />
In order to obtain a better fit for the training data, we often want to increase the complexity of our RBF Network. By its construction, we know that to change the complexity of a RBF Network, the only way is to add or decrease the number of basis functions. A large number of basis function yields a more complex network. In theory, if we add enough basis functions, the RFB Network would work for any training; however, it doesn't mean this RBF Network model can generalize well. Therefore, for the purpose of avoiding overfitting problem (see Notes below), we only want to increase the number of basis function to certain point, i.e. its optimal level. <br />
<br />
For the model selection, what we usually do is estimate the training error. After working through the training error, we’ll see that the training error in fact can be decomposed, and one component of training error is called Mean Squared Error (MSE). In the later notes, we will find that our final goal is to get a good estimate of MSE. Moreover, in order to find an optimal model for our data, we select the model with the smallest MSE.<br />
<br />
Now, let us introduce some notations that we will use in the analysis:<br />
*<math>\hat f</math> -- the prediction model estimated by a RBF network from the training data<br />
*<math>\displaystyle f</math> -- the real model (not null), and ideally, we want <math>\hat f</math> to be close to <math>\displaystyle f</math><br />
*<math>\displaystyle err</math> -- the training error<br />
*<math>\displaystyle Err</math> -- the testing error<br />
*<math>\displaystyle MSE</math> -- the Mean Squared Error<br />
<br />
''' Notes '''<br />
<br />
[[File:overfitting.png|350px|thumb|left|Figure 26.2: Overfitting]]<br />
<br />
*Being more complex isn’t always a good thing. Sometime, [http://en.wikipedia.org/wiki/Overfitting overfitting] causes the model to lose its generality. For example in the graph on left hand side, the data points are sampled from the model <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle f(x_i)</math> is a linear function, which is shown by the blue line, and <math>\displaystyle \epsilon_i</math>is additive Gaussian noise from <math>~N(0,\sigma^2)</math>. The red curve displayed in the graph shows the over-fitted model. Clearly, this over-fitted model only works for any training data, and is useless for any further prediction when new data points are introduced.<br />
<br />
> n<-20;<br />
> x<-seq(1,10,length=n);<br />
> alpha<-2.5;<br />
> beta<-1.75;<br />
> y<-alpha+beta*x+rnorm(n);<br />
> plot(y~x, pch=16, lwd=3, cex=0.5, main='Overfitting');<br />
> abline(alpha, beta, col='blue');<br />
> lines(spline(x, y), col = 2);<br />
<br />
*More details on this topic later on.<br />
<br />
<br />
<br />
</noinclude><br />
<br />
<br />
<br />
<br />
<br />
<br />
== '''Model Selection(Stein's Unbiased Risk Estimate)- November 11th, 2009''' ==<br />
<br />
===Model Selection===<br />
<br />
Model selection is a task of selecting a model of optimal complexity for a given data. Learning a radial basis function network from data is a parameter estimation problem. One difficulty with this problem is selecting parameters that show good performance on both training and testing data. In principle, a model is selected to have parameters associated with the best observed performance on training data, although our goal really is to achieve good performance on unseen testing data. Not surprisingly, a model selected on the basis of training data does not necessarily exhibit comparable performance on the testing data. When squared error is used as the performance index, a zero-error model on the training data can always be achieved by using a sufficient number of basis functions.<br />
<br />
<br />
But, training error and testing error do not demonstrate a linear relationship. In particular, a smaller training error does do not necessarily result in a smaller testing error. In practice, one often observes that, up to a certain point, the model error on testing data tends to decrease as the training error decreases. However, if one attempts to decrease the training error too far by increasing model complexity, the testing error often can take a dramatic increase.<br />
<br />
<br />
The basic reason behind this phenomenon is that in the process of minimizing training error, after a certain point, the model begins to over-fit the training set. Over-fitting in this context means fitting the model to training data at the expense of losing generality. In the extreme form, a set of <math>\displaystyle N</math> training data points can be modeled exactly with <math>\displaystyle N</math> radial basis functions. Such a model follows the training data perfectly. However, the model is not representative features of the true underlying data source, and this is why it fails to correctly model new data points.<br />
<br />
<br />
In general, the training error rate will be less than the testing error on the new data. A model typically adapts to the training data, and hence the training error will be an overly optimistic estimate of the testing error. An obvious way to well estimate testing err is to add a penalty term to the training error to compensate. Actually, SURE is developed based on this.<br />
<br />
<br />
<br />
===Stein's unbiased risk estimate (SURE)===<br />
<br />
<br />
====Important Notation====<br />
<br />
Let:<br />
*<math>\hat f(X)</math> denote the ''prediction model'', which is estimated from a training sample by the RBF neural network model.<br />
*<math>\displaystyle f(X)</math> denote the ''true model''.<br />
*<math>\displaystyle err=\sum_{i=1}^N (\hat y_i-y_i)^2 </math> denote the ''training error'',which is the average loss over the training sample.<br />
*<math>\displaystyle Err=\sum_{i=1}^M (\hat y_i-y_i)^2 </math> denote the ''test error'', which is the expected prediction error on an independent test sample.<br />
*<math>\displaystyle MSE=E(\hat f-f)^2</math> denote the ''mean squared error'', where <math>\hat f(X)</math> is the estimated model and <math>\displaystyle f(X)</math> is the true model.<br />
<br />
<br />
<br />
Suppose the observations <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>.We need to estimate <math>\hat f</math> from training data set <math>T=\{(x_i,y_i)\}^2_{i=1}</math>. Let <math>\hat f_i=\hat f_(x_i)</math> and <math>\displaystyle f_i= f(x_i)</math>, then <br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i-\epsilon_i)^2]</math><math>=E[(\hat f_i-f_i)^2]+E[\epsilon_i^2]-2E[\epsilon_i(\hat f_i-f_i)]</math><br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2E[\epsilon_i(\hat f_i-f_i)]</math> <math>\displaystyle (1)</math><br />
<br />
The last term can be written as:<br />
<br />
<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=E[(y_i-f_i)(\hat f_i-f_i)]=cov(y_i,\hat f)</math>, where<math>\displaystyle y_i</math> and <math>\hat f_i</math> both have same mean <math>\displaystyle f_i</math>.<br />
<br />
<br />
<br />
====Stein's Lemma====<br />
<br />
If <math>\,Z</math> is <math>\,N(\mu,\sigma^2)</math> and if <math>\displaystyle g(Z)</math> is weakly differentiable,such that<math>\displaystyle E[\vert g'(Z)\vert]<\infty</math>, then <math>\displaystyle E[g(Z)(Z-\mu)]=\sigma^2E(g'(Z))</math>.<br />
<br />
<br />
According to Stein's Lemma, the last cross term of <math>\displaystyle (1)</math>, <math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]</math> can be written as <math>\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math>. The derivation is as follows.<br />
<br />
<math>\displaystyle Proof</math>: Let <math>\,Z = \epsilon</math>. Then <math>g(Z) = \hat f-f</math>, since <math>\hat y = f + \epsilon</math>, and <math>\,f</math> is a constant. So <math>\,\mu = 0</math> and <math>\,\sigma^2</math> is the variance in <math>\,\epsilon</math>.<br />
<math>\displaystyle E[g(Z)(Z-\mu)]=E[(\hat f-f)\epsilon]=\sigma^2E(g'(Z))=\sigma^2 E[\frac {\partial (\hat f-f)}{\partial y_i}]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}-\frac {\partial f}{\partial y_i}]</math><br />
<br />
<br />
Since <math>\displaystyle f</math> is the true model, not the function of the observations <math>\displaystyle y_i</math>, then <math>\frac {\partial f}{\partial y_i}=0</math>.<br />
<br />
So,<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math> <math>\displaystyle (2)</math><br />
<br />
====Two Different Cases====<br />
SURE in RBF,<br />
[http://www.cs.ualberta.ca/~papersdb/uploaded_files/801/paper_automatic-basis-selection-for.pdf Automatic basis selection for RBF networks using Stein’s unbiased risk estimator,Ali Ghodsi Dale Schuurmans]<br />
<br />
<br />
=====''Case 1''=====<br />
<br />
Consider the case in which a new data point has been introduced to the estimated model, i.e. <math>(x_i,y_i)\not\in\tau</math>; this new point belong to the validation set <math>\displaystyle \nu</math>,i.e.<math>(x_i,y_i)\in\nu</math>. Since <math>\displaystyle y_i</math> is a new point, <math>\hat f</math> and <math>\displaystyle y_i</math> are independent. Therefore <math>\displaystyle cov(y_i,\hat f)=0</math> (or think about <math>\frac{\partial \hat f}{\partial y_i}</math>, when <math>\,y_i</math>is a new point, then it has nothing with <math>\hat f</math> because the estimation of <math>\hat f</math> is from the training data, so <math>\frac{\partial \hat f}{\partial y_i}=0</math> ) and <math>\displaystyle (1)</math> in this case can be written as:<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2</math>. <br />
<br />
This expectation means <math>\frac {1}{m}\sum_{i=1}^m (\hat y_i-y_i)^2 = \frac {1}{m}\sum_{i=1}^m (\hat f_i-f_i)^2+ \sigma^2</math>.<br />
<br />
<math>\sum_{i=1}^m (\hat y_i-y_i)^2 = \sum_{i=1}^m (\hat f_i-f_i)^2+ m\sigma^2</math><br />
<br />
Based on the notation we denote above, then we obtain:<br />
<math>\displaystyle MSE=Err-m\sigma^2</math><br />
<br />
<br />
<br />
This is the justification behind the technique of cross validation. since <math>\displaystyle \sigma^2</math> is constant, to minimize <math>\displaystyle MSE</math> is equal to minimize the test err <math>\displaystyle Err</math>. In cross vaildation to avoid overfitting or underfitting, a validation data set is independent from the estimated model.<br />
<br />
<br />
=====''Case 2''=====<br />
<br />
A more interesting case is the case in which we do not use new data points to assess the performance of the estimated model. and the training data is used for both estimating and assessing a model <math>\hat f_i</math>. In this case the cross term in <math>\displaystyle (1)</math> cannot be ignored because <math>\hat f_i</math> and <math>\displaystyle y_i</math> are not independent. Therefore the cross term can be estimated by Stein's lemma, which was originally proposed to estimated the mean of a Guassian distribution.<br />
<br />
<br />
Suppose <math>(x_i,y_i)\in\tau</math>, then by applying Stein's lemma, we obtain <math>\displaystyle (2)</math> proved above.<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2\sigma^2E[\frac {\partial \hat f}{\partial y_i}]</math>.<br />
<br />
This expectation means <math>\frac {1}{N}\sum_{i=1}^N (\hat y_i-y_i)^2 = \frac {1}{N}\sum_{i=1}^N (\hat f_i-f_i)^2+ \sigma^2-\frac {2}{N}\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<br />
<math>\sum_{i=1}^N (\hat y_i-y_i)^2 = \sum_{i=1}^N (\hat f_i-f_i)^2+ N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<math>\displaystyle err=MSE+N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math><br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math> <math>\displaystyle (3)</math><br />
<br />
In statistics, this is known as [http://www.reference.com/browse/Stein%27s+unbiased+risk+estimate Stein's unbiased risk estimate (SURE)] is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter, and thus cannot be determined completely.<br />
<br />
===SURE for RBF Network===<br />
<br />
Based on SURE, the optimum number of basis functions should be assigned to have the minimum generalization err <math>\displaystyle err</math>. Based on the Radial Basis Function Network, by setting <math>\frac{\partial err}{\partial W}</math> equal to zero , we get the least squared solution of<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math>.Then we have <math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}Y = HY</math>,where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}</math>, <math>\,H</math> is the hat matrix for this model.<br />
<br />
<br />
<math>\hat f=\,H_{i1}y_2+\,H_{i2}y_2+\cdots+\,H_{in}y_n</math> <math>\displaystyle (3)</math><br />
<br />
where <math>\,H</math> depends on the input vector <math>\displaystyle x_i</math> but not on <math>\displaystyle y_i</math>. <br />
<br />
By taking the derivative of <math>\hat f_i</math> with respect to <math>\displaystyle y_i</math>, we can easily obtain:<br />
<br />
<math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Now, substituing this into<math>\displaystyle (3)</math> ,we get<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Here, we can tell that <math>\sum_{i=1}^N \,H_{ii}= \,Trace(H)</math>, the sum of the diagonal elements of <math>\,H</math>. Thus, we can obtain the further simplification that <math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1})\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=d</math>, where<math>\displaystyle d</math> is the dimension of <math>\displaystyle \Phi</math>. Since <math>\displaystyle \Phi</math> is a projection of input matrix <math>\,X</math> onto a basis set spanned by <math>\,M</math>, the number of basis functions. If considering intercept, then <math>\,Trace(H)= M+1</math>.<br />
<br />
Then,<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1)</math>.<br />
<br />
===SURE Algorithm===<br />
<br />
<br />
[[File:27.1.jpg|350px|thumb|right|Figure 27.1]]<br />
<br />
We use this method to find the optimum number of basis function by choosing the model with smallest MSE over the set of models considered. Given a set of models <math>\hat f_M(x)</math> indexed by the number of basis functions, <math>\displaystyle err(M)</math>. <br />
<br />
Then, <math>\displaystyle MSE(M)=err(M)-N\sigma^2+2\sigma^2(M+1)</math><br />
<br />
where <math>\displaystyle N</math> is the number of training samples and the noise,<math>\sigma^2</math>, can be estimated from the training data as<br />
<br />
<math>\hat \sigma^2=\frac {1}{N-1}\sum_{i=1}^N (\hat y-y)^2</math>.<br />
<br />
<br />
By applying SURE algorithm to SPECT Heart data, we get the optimal number of basis functions is <math>\displaystyle M=4</math>.<br />
<br />
<br />
Pls take a look at figure 27.1 on the right, which shows that<math>\displaystyle MSE</math> is smallest when <math>\displaystyle M=4</math>.<br />
<br />
<br />
Calculating the SURE value is easy if you have access to <math>\,\sigma</math>.<br />
<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
If <math>\,\sigma</math> is not known, it can be estimated using the error.<br />
<br />
error = (output - expected_output) .^ 2;<br />
sigma = (dot(err, ones(1, num_data_point)) / (num_data_point)) / (num_data_point - 1);<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
=='''SURE for RBF network & Support Vector Machine - November 13th, 2009'''==<br />
<br />
===SURE for RBF network===<br />
<br />
====Minimizing MSE====<br />
<br />
By Stein's unbiased risk estimate (SURE) for Radial Basis Function (RBF) Network<br />
we get:<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1) </math> (28.1)<br />
<br />
*<math>\displaystyle MSE</math>(mean square error)= <math>\sum_{i=1}^N (\hat y_i-y_i)^2 </math><br />
*<math>\displaystyle err</math>(training error)= <math>\sum_{i=1}^N (\hat f_i-f_i)^2 </math><br />
*<math>\displaystyle (M+1) </math>( number of hidden units)= <math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math><br />
<br />
<br />
'''Goal''': To minimize MSE<br />
<br />
1. If <math>\displaystyle \sigma </math> is known, then it is no impact (i.e. a constant),<br />
and we can ignore it. Only need to minimize <math>\displaystyle MSE=err +2\sigma^2(M+1)</math>.<br />
<br />
2. In reality, we do not know <math>\displaystyle \sigma</math>, and the estimate <math>\,\hat \sigma</math> changes when <math>\displaystyle (M+1) </math> changes. However, we can estimate <math>\displaystyle \sigma </math>.<br />
<br />
<math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>. Suppose we do not know the variance of <math>\displaystyle \epsilon</math>. Then, <br />
<br />
<math>\displaystyle \sigma^2=\frac{1}{N-1}\sum_{i=1}^N (\hat y_i-y_i)^2 =\frac{1}{N-1}err</math> (28.2)<br />
<br />
Substitute (28.2) into (28.1), get<br />
<br />
<math>\displaystyle MSE=err-N\frac{1}{N-1}err+2\frac{1}{N-1}err(M+1)</math><br />
<br />
<math>\displaystyle MSE=err(1-\frac{N}{N-1}+\frac{2(M+1)}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{N-1-N+2M+2}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{2M+1}{N-1}) </math> (28.3) <br />
<br />
<br />
[[File:28.1.jpg|350px|thumb|Figure 28.1: MSE vs err]]<br />
<br />
Figure 28.1: the training error will decrease and the MSE will increase when increasing the number of hidden units (i.e. the model is more complex).<br />
<br />
<br />
When the number of hidden units gets larger and larger, the training error will decrease until it approaches to <math>\displaystyle 0 </math>. If training error equals <math>\displaystyle 0 </math>, then no matter how large <math>\displaystyle (M+1) </math> is, from (28.3) we can see the estimate of MSE will approach <math>\displaystyle 0 </math> as well. However, in fact it does not happen since when training error is close to <math>\displaystyle 0 </math> [http://en.wikipedia.org/wiki/Overfitting overfitting] happens, and MSE should increase instead of being close to <math>\displaystyle 0 </math>. We can see it from the Figure 28.1. <br />
<br />
<br />
We can see <math>\displaystyle \sigma^2 </math> is the average of <math>\displaystyle err </math>. In order to deal with this problem, we can take the average for <math>\displaystyle err</math> of each hidden unit. For example: we can first take 1 hidden unit, and take 10 hidden units in the next. Since in reality the value of <math>\, \sigma^2</math> is a constant adjustment to the data points, and doesn't depend on <math>\,M+1</math>, using the average <math>\,\sigma^2</math> value for 1 to 10 hidden units has a firm theoretical basis.<br />
<br />
We can also see that unlike the classical Cross Validation (CV) or Leave one out (LOO) techniques, the SURE technique does not need to do the validation to find the optimal model. Hence, SURE technique uses less data than CV or LOO. It is suitable for the case that there is not enough data for validation. However, to implement SURE we need to find <math>\frac {\partial \hat f}{\partial y_i}</math>, which may not be trivial for models that do not have a closed-form solution.<br />
<br />
====Kmeans Clustering====<br />
<br />
Discription:<br /> [http://en.wikipedia.org/wiki/K-means_clustering Kmeans clustering] is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.<br />
<br />
*The number of hidden units is same as the number of clusters, which is <math>\displaystyle \phi </math><br />
<br />
*<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, we set it same for all clusters.<br />
<br />
The basic details for <math>K</math>-means clustering are given:<br />
<br />
The <math>K</math> initial centers are randomly chosen from the training data.<br />
<br />
Then the following two steps are iterated alternatively until convergernce.<br />
<br />
1. for each existing center, we reidentify its cluster (every points in this cluster should be closer to this center than others).<br />
<br />
2. compute the mean for each cluster and make it as the new center for each cluster.<br />
<br />
<br />
Example:<br /><br />
Partition data into 2 clusters (2 hidden values)<br />
<br />
<br />
>> X=rand(30,80); <br />
>> [IDX,C,sumD,D]=kmeans(X,2); <br />
>> size(IDX) <br />
>> 30 1<br />
>> size(C) <br />
>> 2 80<br />
>> size(sumD) <br />
>> 2 1<br />
>> c1=sum(IDX==1)<br />
>> 14<br />
>> c2=sum(IDX==2)<br />
>> 16<br />
>> sumD<br />
>> 85.6643<br />
>> 101.0419<br />
>> v1=sumD(1,1)/c1 <br />
>> 6.1189<br />
>> v2=sumD(2,1)/c2 <br />
>> 6.3151 <br />
<br />
<br />
<br />
Comments:<br />
<br />
We create <math>X</math> randomly as a train set with 80 data points and 30 dimensions, and then apply “kmeans” method to separate X into 2 clusters.IDX is a vector contains 1 or 2 which indicates 2 clusters, and its size is 30*1. <math>\displaystyle C </math> is the center (mean) of each cluster with size 2*80; sumD is sum of the square distance between the data points and center of its cluster. The <math>\displaystyle c1 </math> and <math>\displaystyle c2 </math> indicate the number of data points in cluster 1 and 2. <math>\displaystyle V1 </math> is the variance of the first cluster <math>\displaystyle (v1=\sigma_1)</math>; <math>\displaystyle V2 </math> is the variance of the second cluster <math>\displaystyle (v2=\sigma_2)</math>. Now we can get <math>\displaystyle \phi </math>, <math>\displaystyle w </math>, hat matrix <math>\displaystyle H </math> and <math>\displaystyle \hat Y </math> by following equations. Finally, we will get the <math>\displaystyle MSE </math> and predict the test set. <br />
<br />
<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br />
<br />
<math>\displaystyle W = (\phi^{T}\Phi)^{-1}\phi^{T}Y</math><br />
<br />
<math>\displaystyle H = \phi(\Phi^{T}\Phi)^{-1}\phi^{T}</math><br />
<br />
<math>\displaystyle \hat{Y} = \phi W = \phi(\phi^{T}\phi)^{-1}\phi^{T}Y = HY</math><br />
<br />
<br />
<br />
Aside:<br />
<br />
Similar in spirit to <math>K</math>-means, there is EM algorithm with respect to Gaussian mixture model. Generally speaking, the Gaussian mixture model is referred to as a soft clustering while <math>K</math>-means is hard clustering.<br />
<br />
Similar to <math>K</math>-means, the following two steps are iterated alternatively until convergernce.<br />
<br />
E-step, each point is assigned a weight for each cluster based on the likelihood of each of the corresponding Gaussians. Observations is assigned 1 for one cluster if they are closer to the center of that cluster, and is assigned 0 for other clusters. <br />
<br />
M-step, compute the weighted means and covariances and make them as the new means and covariances for every cluster.<br />
<br />
>>[P,mu,phi,1Pxtr]=mdgEM(X,2,200,0);<br />
<br />
===Support Vector Machine===<br />
<br />
====Introduction====<br />
We have seen that linear discriminate analysis and logistic regression both estimate linear decision boundaries in similar but slightly different ways. Separating hyperplane classifiers provide the basis for support vector classifier. It constructs linear decision boundaries that explicitly try to separate the data into different classes as well as possible. The techniques that make the extensions to the nonseparable case, where the classes overlap are generalized to what is known as the support vector machine. It produces nonlinear boundaries by constructing a linear boundary in a high dimensional, transformed version of the feature space.<br />
<br />
The original basis for SVM was published in the 1960s by [http://en.wikipedia.org/wiki/Vapnik Vapnik], Chervonenkis and co, however the ideas did not gain any attention until strong results were shown in the early 90s.<br />
<br />
Definition: <br /><br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machine (SVM)] are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a Maximum margin hyperplane or set of hyperplanes in a higher or infinite dimensional space. The set of points near the class boundaries, support vectors, define the model. which can be used for classification, regression or other tasks.<br />
<br />
====Optimal Seperating Hyperplane====<br />
<br />
[[File:28.2.jpg|350px|thumb|right|Figure 28.2]]<br />
<br />
Figure 28.2 An example with two classes separated by a hyperplane. The blue line is the least squares solution, which misclassifies one of the training points. Also shown are the black separating hyperplanes found by the [http://en.wikipedia.org/wiki/Perceptron perceptron] learning algorithm with different random starts.<br /><br />
<br />
We can see the data points can be separated by a linear boundary are in two classes in <math>\displaystyle \mathbb{R}^{2} </math>. Suppose a dataset is indeed linearly separable, then there exits infinitely many possible separating hyperplanes including the black lines in the figure as two of them for training data. However, which solution is the best when we introduce the new data. <br /><br />
<br />
Aside: <br /><br />
The blue line is the least squares solution to the problem,obtained by regressing the <math>\displaystyle -1/+1 </math> response <math>\displaystyle Y </math> on <math>\displaystyle X </math> (with intercept); the line is given by<br />
<math>\displaystyle {X:\hat\beta_0+\hat\beta_1X_1+\hat\beta_2X_2=0}</math>.<br />
This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by linear discriminant analysis, in light of its equivalence with linear regression in the two-class case.<br />
<br />
Classifiers such as (28.4) that compute a linear combination of the input features and return the sign were called ''perceptrons'' in the engineering literature in the late 1950s. <br />
<br />
<br />
Identifications:<br />
<br />
*Hyperplane: seprate two classes <br />
<br />
<math>\displaystyle x^{T}\beta+\beta_0=0</math><br />
<br />
*Margin: the distance between the hyperplane and the closest point.<br />
<br />
<math>\displaystyle d_i=x_i^{T}\beta+\beta_0 </math> where <math>\displaystyle i=1,....,N</math><br />
<br />
Note: since distance is positive, if the data is on <math>\displaystyle +1 </math> side the distance is <math>\displaystyle d_i(+1)</math>. If the data is on the <math>\displaystyle -1 </math> side the distance is <math>\displaystyle d_i(-1)</math>.<br />
<br />
*Data points: <math>\displaystyle y_i\in\{-1,+1\}</math>we can classify points as<math>\displaystyle sign\{d_i\}</math> if <math>\displaystyle \beta,\beta_0 </math> is known.<br /><br />
<br />
====Maximum Margin Classifiers in the Linearly separable case====<br />
Choose the line farthest from both classes or Choose the line which has the maximum distance from the closest point. (i.e maximize the margin)<br /><br />
<br />
<math>\displaystyle Margin=min\{y_id_i\}</math> <math>\displaystyle i=1,2,....,N </math> <br />
where <math>\displaystyle y_i </math> is label and <math>\displaystyle d_i </math> is distance<br /><br />
<br />
[[File:28.3.jpg|350px|thumb|right|Figure 28.3 The linear algebra of a hyperplane]]<br />
<br />
<br />
<br />
Fiqure 28.3 depicts a hyperplane defined by the equation <math>\displaystyle x^{T}\beta+\beta_0=0</math>. Since they are in <math>\displaystyle \mathbb{R}^{2} </math>, the hyperplane is a line.<br /><br />
<br />
<br />
Let us rewrite <math>\displaystyle Margin=min\{y_id_i\}</math> by using the following properties:<br /><br />
<br />
1. <math>\displaystyle \beta </math> is orthogonal to the hyperplane <br /><br />
<br />
Two points<math>\displaystyle x_1,x_2</math> lying on the hyperplane.<br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0\beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}(x_1-x_2)=0</math><br />
<br />
Hence,<math>\displaystyle \beta </math> is orthogonal to <math>\displaystyle (x_1-x_2)</math>, and<math>\displaystyle \beta^*=\frac{\beta}{\|\beta\|} </math> is the vector normal to the hyperplane.<br /><br />
<br />
2. For any point <math>\displaystyle x_0 </math> on the hyperplane, <br />
<br />
<math>\displaystyle \beta^{T}x_0+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1=-\beta_0</math><br />
For any point on the hyperplane <math>\displaystyle \beta^{T}</math> gives negative value of the intercept of the hyperplane. <br/><br />
<br />
<br />
3. The signed distance for any point <math>\displaystyle x </math> to the hyperplane is <math>\displaystyle d_i=\beta^{T}(x_i-x_0)</math>. <br/>Since the length of <math>\displaystyle \beta </math> is not known, let it be unit vector.<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}(x_i-x_0)}{\|\beta\|} </math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i-\beta^{T}x_0}{\|\beta\|} </math><br />
<br />
by property 2<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i+\beta_0}{\|\beta\|} </math><br />
<br />
<br />
<br />
[[File:4.jpg|350px|thumb|right|Figure 28.4]]<br />
<br />
<br />
We had <math>\displaystyle Margin=min(y_id_i)</math> <math>\displaystyle i=1,2,....,N </math>, and since we now know how to compute <math>\displaystyle d_i \Rightarrow</math> <br />
<br />
<math>\displaystyle Margin=min\{y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\} </math><br />
<br />
Suppose <math>\displaystyle x_i </math> is not on the hyperplan<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)>0 </math><br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq c </math>for <math>\displaystyle c>0 </math><br />
<br />
<br />
<math>\displaystyle y_i(\frac{\beta^{T}x_i}{c}+\frac{\beta_0}{c})\geq1</math> <br />
<br />
This is known as the cnonical representation of the decision hyperplane.<br />
<br />
For <math>\displaystyle \beta^{T} </math> only the direction is important, so <math>\displaystyle \frac{\beta^{T}}{c} </math> does not change its direction, the hyperplance will be the same.<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq1 </math><br />
<br />
<math>\displaystyle y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\geq\frac{1}{\|\beta\|} </math><br />
<br />
<math>\displaystyle Margin=min(\frac{1}{\|\beta\|} )</math><br />
<br />
which is equivalent as <math>\displaystyle Margin=max(\|\beta\| )</math><br />
<br />
<br />
<br />
<br />
Refrence:<br /><br />
Hastie,T.,Tibshirani,R., Friedman,J.,(2008).The Elements of Statistical Learning:129-130<br />
<br />
=='''Optimizing The Support Vector Machine - November 16th, 2009'''==<br />
We currently derive Support Vector Machine for the case where two classes are separable in the given feature space. This margin can be written as <math>\,min\{y_id_i\}</math>, or the distance of each point from the hyperplane, where <math>\,d_i</math> is the distance and <math>\,y_i</math> is used as the sign.<br />
===Margin Maximizing Problem for the Support Vector Machine===<br />
<math>\,Margin=min\{y_id_i\}</math> can be rewritten as <math>\,min\left\{\frac{y_i\left(\beta^Tx_i+\beta_0\right)}{|\beta|}\right\}</math>. <br />
<br\>Note that the term <math>\,y_i\left(\beta^Tx_i+\beta_0\right) = 0</math> if <math>\,x_i</math> is on the hyperplane, but <math>\,y_i\left(\beta^Tx_i+\beta_0\right) > 0</math> if <math>\,x_i</math> is <math>\displaystyle not</math> on the hyperplane<br />
<br />
This implies <math>\,\exists C>0</math> such that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq C</math>.<br />
<br />
Divide through by C to produce <math>\,y_i\left(\frac{\beta^T}{C}x_i + \frac{\beta_0}{C}\right) \geq 1</math>. <br />
<br />
<math>\,\beta, \beta_0</math> compose a hyperplane that can have different values but we care about the direction, dividng through by a constant does not change the direction of the hyperplane. Thus, by assuming scaled values for <math>\,\beta, \beta_0</math> we eliminate C, so that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq 1</math>. Implying that the lower bound on <math>\,y_i\left(\beta^Tx_i+\beta_0\right)</math> is <math>\displaystyle 1</math><br />
<br />
Now in order to maximize the margin, we simply need to find <math>\,min\left\{\frac{1}{|\beta|}\right\}</math>. <br />
<br />
In other words, our optimization problem is now to find maximum <math>\,|\beta|</math>, under the constraint that <math>\,min_i\{y_i(\beta^Tx_i+\beta_0)\} = 1</math>.<br />
<br />
Note that we're dealing with the norm of <math>\,\beta</math>. There are many different choices of possible norms, in general[http://en.wikipedia.org/wiki/P-norm#p-norm p-norm]. The 1-norm of a vector is simply the sum of the absolute value of each element (also known as the taxicab or Manhattan distance), and is apparently more accurate, but also has a discontinuity in the derivative. 2-norm or the Euclidean norm (the intuitive measure of the length of a vector), is easier to work with - that is <math>\,\|\beta\|_2 = (\beta^T\beta)^{1/2}</math>. For convenience, we will maximize <math>\,\frac{1}{2}\|\beta\|_2^2 = \frac{1}{2}\beta^T\beta</math> where the constant 1/2 has been added for simplification and that the maximizing the function is the same as maximizing the square root of that function.<br />
<br />
This is an example of a quadratic programming problem and we will minimize a quadratic function respective to a linear inequality constraints.<br />
<br />
<br />
====Writing Lagrangian Form of Support Vector Machine====<br />
The Lagrangian form is introduced to ensure that the conditions are satisfied, as well as finding an optimal solution(The optimal saddle point of the lagrangian for the classic quadratic optimization). The problem will be solved in dual space by introducing <math>\,\alpha_i</math> as dual constraints, this is in contrast to solving the problem in primal space as function of the betas. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.<br />
<br />
<math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math>. To find the optimal value, set the derivative equal to zero.<br />
<br />
<math>\,\frac{\partial L}{\partial \beta} = 0</math>, <math>\,\frac{\partial L}{\partial \beta_0} = 0</math>. Note that <math>\,\frac{\partial L}{\partial \alpha_i}</math> is equivalent to the constraints <math>\left(y_i(\beta^Tx_i+\beta_0)-1\right) \geq 0, \,\forall\, i</math><br />
<br />
First, <math>\,\frac{\partial L}{\partial \beta} = \frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\left\{\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i)+\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0-\frac{\partial}{\partial \beta}\alpha_iy_i\right\}}</math><br />
<br />
: <math>\frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 = \beta</math>.<br />
<br />
: <math>\,\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i) = \alpha_iy_ix_i</math><br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0 = 0</math>. <br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i = 0</math>.<br />
<br />
So this simplifies to <math>\,\frac{\partial L}{\partial \beta} = \beta - \sum_{i=1}^n{\alpha_iy_ix_i} = 0</math>. In other words,<br />
<br />
<math>\,\beta = \sum_{i=1}^n{\alpha_iy_ix_i}</math>, <math>\,\beta^T = \sum_{i=1}^n{\alpha_iy_ix_i^T}</math><br />
<br />
Similarly, <math>\,\frac{\partial L}{\partial \beta_0} = -\sum_{i=1}^n{\alpha_iy_i} = 0</math>.<br />
<br />
This allows us to rewrite the Lagrangian without <math>\,\beta</math>.<br />
<br />
<math>\,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} - \sum_{i=1}^n{\alpha_iy_i(\sum_{j=1}^n{\alpha_jy_jx_j^Tx_i} + \beta_0) - 1}</math>. <br />
<br />
Because <math>\,\sum_{i=1}^n{\alpha_iy_i} = 0</math>, and <math>\,\beta_0</math> is constant, <math>\,\sum_{i=1}^n{\alpha_iy_i\beta_0} = 0</math>. So this simplifies further, to<br />
<br />
<math>L(\alpha) = \,-\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} + \sum_{i=1}^n{\alpha_i}</math><br />
A dual representation of the maximum margin.<br />
<br />
Because <math>\,\alpha_i</math> is the Lagrangian multiplier, <math>\,\alpha_i \geq 0 \forall i</math>.<br />
<br />
This is a much simpler optimization problem.<br />
<br />
=='''The Support Vector Machine algorithm - November 18, 2009'''==<br />
<br />
===Solving the Lagrangian===<br />
<br />
Continuing from the above derivation, we now have the equation that we need to minimize, as well as two constraints.<br />
<br />
The Support Vector Machine problem boils down to:<br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math><br />
:such that <math>\alpha_i \geq 0</math><br />
:and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
<br />
We are looking to minimize <math>\,\alpha</math>, which is our only unknown. Once we know <math>\,\alpha</math>, we can easily find <math>\,\beta</math> and <math>\,\beta_0</math> (see the Support Vector algorithm below for complete details).<br />
<br />
If we examine the Lagrangian equation, we can see that <math>\,\alpha</math> is multiplied by itself; that is, the Lagrangian is quadratic with respect to <math>\,\alpha</math>. Our constraints are linear. This is therefore a problem that can be solved through [http://en.wikipedia.org/wiki/Quadratic_programming quadratic programming] techniques. We will examine how to do this in Matlab shortly.<br />
<br />
We can write the Lagrangian equation in matrix form:<br />
<br />
<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:such that <math>\underline{\alpha} \geq \underline{0}</math><br />
:and <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Where:<br />
* <math>\underline{\alpha}</math> denotes an <math>\,n \times 1</math> vector; <math>\underline{\alpha}^T = [\alpha_1, ..., \alpha_n]</math><br />
* Matrix <math>S = y_iy_jx_i^Tx_j = (y_ix_i)^T(y_ix_i)</math><br />
* <math>\,\underline{0}</math> and <math>\,\underline{1}</math> are vectors containing all 0s or all 1s respectively<br />
<br />
Using this matrix notation, we can use Matlab's built in quadratic programming routine, [http://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/quadprog.html quadprog].<br />
<br />
===Quadprog example===<br />
<br />
Let's use <code>quadprog</code> to find the solution to <math>\,L(\alpha)</math>.<br />
<br />
Matlab's <code>quadprog</code> function minimizes an equation of the following form:<br />
:<math>\min_x\frac{1}{2}x^THx+f^Tx</math><br />
:such that: <math>\,A \cdot x \leq b</math>, <math>\,Aeq \cdot x = beq</math> and <math>\,lb \leq x \leq ub</math><br />
<br />
We can now see why we kept the <math>\frac{1}{2}</math> constant in the original derivation of the equation.<br />
<br />
The function is called as such: <code>x = quadprog(H,f,A,b,Aeq,beq,lb,ub)</code>. The variables correspond to values in the equation above.<br />
<br />
We can now try to find the solution to <math>\,L(\alpha)</math> (though, it should be noted, that in <math>\,L(\alpha)</math> we subtract the first term rather than add it; I'm not sure if this difference would change how we call <code>quadprog</code>).<br />
<br />
We'll use a simple one-dimensional data set, which is essentially y = -1 or 1 + Gaussian noise. (Note: you could easily put the values straight into the quadprog equation; they are separated for clarity.)<br />
<br />
x = [mvnrnd([-1],[0.01],100); mvnrnd([1],[0.01],100)]';<br />
y = [-ones(100,1); ones(100,1)];<br />
S = (y*x)' * (y*x);<br />
f = ones(200,1);<br />
A = -ones(1,200);<br />
b = 0;<br />
Aeq = y';<br />
beq = 0;<br />
lb = 0;<br />
ub = []; % There is no upper bound<br />
alpha = quadprog(S,f,A,b,Aeq,beq,lb,ub);<br />
<br />
This gives us the optimal <math>\,\alpha</math>... or at least I think it should, but it does not appear to work for me (that is, despite setting the lower boundary of 0, a number of values are still negative. Whether this is just the nature of <code>quadprog</code> or an error on my part is an exercise for the reader).<br />
<br />
===Examining K.K.T. conditions===<br />
<br />
[http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker conditions] ([http://webrum.uni-mannheim.de/mokuhn/public/KarushKuhnTucker.pdf more info]) give us a closer look into the Lagrangian equation and the associated conditions.<br />
<br />
Suppose we are looking to minimize <math>\,f(x)</math> such that <math>\,g_i(x) \geq 0, \forall{x}</math>. If <math>\,f</math> and <math>\,g</math> are differentiable, then the ''necessary'' conditions for <math>\hat{x}</math> to be a local minimum are:<br />
<br />
# At the optimal point, <math>\frac{\partial L}{\partial \hat{x}} = 0</math>; i.e. <math>f'(\hat{x}) - \sum{\alpha_ig'(\hat{x})}=0</math><br />
# <math>\alpha_i \geq 0</math>. (Dual Feasibility) <br />
# <math>\alpha_ig_i(\hat{x}) = 0, \forall{i}</math> (Complementary Slackness)<br />
# <math>g_i(\hat{x}) \geq 0</math> (Primal Feasibility)<br />
<br />
If any of these conditions are violated, then the problem is deemed not feasible.<br />
<br />
These are all trivial except for condition 4. Let's examine it further in our support vector machine problem.<br />
<br />
===Support Vectors===<br />
<br />
Basically, the support vectors are the training points that actually determine the optimal separating hyperplane which we are look for. Also, they are the most difficult points to classify or the most informative for the classification.<br />
<br />
In our case, the <math>g_i(\hat{x})</math> function is:<br />
:<math>\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1</math><br />
<br />
Substituting <math>\,g_i</math> into KKT condition 3, we get <math>\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0</math>. <br\>In order for this condition to be satisfied either <br/><math>\,\alpha_i= 0</math> or <br/><math>\,y_i(\beta^Tx_i+\beta_0)=1</math><br />
<br />
All points <math>\,x_i</math> will be either 1 or greater than 1 distance away from the hyperplane.<br />
<br />
'''Case 1: a point <math>\displaystyle x_i > 1</math> away from the margin'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) > 1 \Rightarrow \alpha_i = 0</math>.<br />
<br />
If point <math>\, x_i</math> is not on the margin, then the corresponding <math>\,\alpha_i=0</math>.<br />
<br />
'''Case 2: a point <math>\displaystyle x_i = 1</math> away from the margin'''<br />
<br />
If <math>\,\alpha_i > 0 \Rightarrow y_i(\beta^Tx_i+\beta_0) = 1</math> <br />
<br\>If point <math>\, x_i</math> is on the margin, then the corresponding <math>\,\alpha_i>0</math>.<br />
<br />
<br />
Points on the margin, with corresponding <math>\,\alpha_i > 0</math>, are called '''''support vectors'''''.<br />
<br />
===Using support vectors===<br />
<br />
Support vectors are important because they allow the support vector machine algorithm to be insensitive to outliers. If <math>\,\alpha = 0</math>, then the cost function is also 0, and won't contribute to the solution of the SVM problem; only points on the margin -- support vectors -- contribute. Hence the model given by SVM in entirely defined by the set of support vectors, a subset of the entire training set. This is interesting because in the NN methods(and can be generalize to classical statistical learning) previous to this the configuration of the network needed to be specified. In this case we have a data-driven or 'nonparametric' model in which is the training set and algorithm will determine the support vectors.<br />
<br />
References:<br />
Wang, L, 2005. Support Vector Machines: Theory and Applications, Springer, 3<br />
<br />
====The support vector machine algorithm====<br />
<br />
# Solve the quadratic programming problem:<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
## Use Matlab's quadprog to find the optimal <math>\,\underline{\alpha}</math><br />
# Find <math>\beta = \sum_{i=1}^n{\alpha_iy_i\underline{x_i}}</math><br />
# Find <math>\,\beta_0</math> by choosing a support vector (a point with <math>\,\alpha_i > 0</math>) and solving <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math><br />
<br />
===Example in Matlab===<br />
<br />
The following code, taken verbatim from the lecture, shows how to use Matlab built-in SVM routines (found in the Bioinformatics toolkit) to do classification through support vector machines.<br />
<br />
load 2_3;<br />
[U,Y] = princomp(X');<br />
data = Y(:,1:2);<br />
l = [-ones(1,200) ones(1,200)];<br />
[train,test] = crossvalind('holdOut',400);<br />
% Gives indices of train and test; so, train is a matrix of 0 or 1, 1 where the point should be used as part of the training set<br />
svmStruct = svmtrain(data(train,:), l(train), 'showPlot', true);<br />
<br />
[[File:Svm1.png|frame|center|The plot produced by training on some of the 2_3 data's first two features.]]<br />
<br />
yh = svmclassify(svmStruct, data(test,:), 'showPlot', true);<br />
<br />
[[File:Svm2.png|frame|center|The plot produced by testing some of the 2_3 data.]]<br />
=='''Non-linear hypersurfaces and Non-Separable classes - November 20, 2009'''==<br />
==='''Kernel Trick'''===<br />
We have talked about the curse of dimensions at the beginning of this course, however, now we turn to the power of high dimensions in order to make find a linearly separable hyperplane between two classes of data points. To understand this, imagine a two dimensional prison where a two dimensional person is constrained. Suppose magically we give the person a third dimension, then he can escape from the prison. In other words, the prison and the person are linearly separable now with respect to the third dimension. The intuition behind the "kernel trick" is basically to map data to a higher dimension so that they are linearly separable by a hyperplane.<br />
<br />
We have seen SVM as a linear classification problem finding the max margin hyperplane in the given input space. However, for many real world problems a more complex decision boundary is required. The following simple method was devised is order to solve the same linear classification problem but in a usually higher dimensional space, 'feature space' under which the max margin hyperplane is better suited.<br />
<br />
Let <math>\,\phi</math> be a mapping,<br />
<br />
<math>\phi:\Re^d \rightarrow \Re^D </math><br /><br /><br />
<br />
We wish to find a <math>\,\phi</math> such that our data will be suited for separation by a hyperplane. Given this function, we are lead to solving the previous constrained quadratic optimization on the transformed dataset,<br /><br /><br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_j\phi(x_i)^T\phi(x_j)}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br /><br /><br />
<br />
The solution to this optimization problem is now well know; however a workable <math>\,\phi</math> must be determined. Possibly the largest drawback in this method is that we must compute the inner product of two vectors in the high dimensional space. As the number of dimensions in the initial data set increases, the inner product becomes computationally intensive or impossible.<br />
<br />
However, we have a very useful result that says that there exists a class of functions, <math>\,\Phi</math>, which satisfy the above requirements and that for any function <math>\,\phi \in \Phi</math>,<br />
<br />
<math>\,\phi(x_i)^T\phi(x_j) = K(x_i,x_j) </math><br /><br /><br />
<br />
Where K is the kernel function in the input space satisfying [http://en.wikipedia.org/wiki/Mercer%27s_condition Mercer's condition] (to guarantee that it indeed corresponds to certain mapping function <math>\,\phi</math>). As a result, if the objective function depends on inner products but not on coordinates, we can always use the kernel function to implicitly calculate in the feature space without storing the huge data. Not only does this solve the computation problems but it no longer requires us to explicitly determine a specific mapping function in order to use this method. In fact, it is now possible to use an infinite dimensional feature space in SVM without even knowing the function <math>\,\phi</math>.<br />
<br />
==='''Mercer's Theorem in detail'''===<br />
Let <math>\,\phi</math> be a mapping to a high dimensional [http://en.wikipedia.org/wiki/Hilbert_space Hibert Space] <math>\,H</math><br /><br />
<br />
<br />
<math>\phi:x \in \Re^d \rightarrow H </math><br /><br /><br />
<br />
The transformed coordinates can be defined as,<br /><br />
<br />
<math>\phi_1(x)\dots\phi_d(x)\dots </math><br /><br /><br />
<br />
By Hilbert - Schmidt theory we can represent an inner product in Hilbert space as,<br /><br /><br />
<br />
<math>\,\phi(x_i)^T\phi(x_j) = \sum_{r=1}^{\infty}a_k\phi_r(x_i)\phi(x_j) \Leftrightarrow K(x_i,x_j), \ a_r \ge 0 </math><br /><br /><br />
where K is symmetric, then Mercer's theorem gives necessary and sufficient conditions on K for it to satisfy the above relation.<br><br><br />
<br />
'''Mercer's Theorem'''<br />
<br />
Let C be a compact subset of <math>\Re^d</math> and K a function <math> \in L^2(C) </math>, if<br /><br /><br />
<br />
<math>\, \int_C\int_C K(u,v)g(u)g(v)dudv \ge 0, \ \forall g \in L^2(C)</math> <br /><br /><br />
<br />
then,<br /><br /><br />
<br />
<math>\sum_{r=1}^{\infty}a_k\phi_r(u)\phi(v)</math> converges absolutely and uniformly to a symmetric function <math>\,K(u,v)</math><br />
<br />
References:<br />
Vapnik, V., 1998. Statistical Learning Theory. John Wiley & Sons, {423}<br />
<br />
==='''Kernel Functions'''===<br />
There are various kernel functions, for example:<br />
<br />
* Linear kernel: <math>\,k(x,y)=x \cdot y</math><br />
* Polynomial kernel: <math>\,k(x,y)=(x \cdot y)^d</math><br />
* Gaussian kernel: <math>e^{-\frac{|x-y|^2}{2\sigma^2}}</math><br />
<br />
If <math>\,X</math> is a <math>\,d \times n</math> matrix in the original space, and <math>\,\phi(X)</math> is a <math>\,D \times n</math> matrix in the [http://en.wikipedia.org/wiki/Hilbert_space Hilbert space] (good explanation video: [http://www.youtube.com/watch?v=V2pBdH7YzX0 part 1] [http://www.youtube.com/watch?v=YRY5xlk3TC0 part 2]), then <math>\,\phi^T(X) \cdot \phi(X)</math> is an <math>\,n \times n</math> matrix. <br />
The inner product is also illustrated as correlation, which measures the similarity between data points. This gives us some insight in how to choose the kernel. The choice depends on certain prior knowledge of the problem and on how we believe the similarity of our data should be measured. In practice, the Gaussian (RBF) kernel usually works best. Besides the most common kernel functions mentioned above, many novel kernels are also suggested for different problem domains like text classification, gene classification and so on.<br />
<br />
These kernel functions can be applied to many algorithms to derive the "kernel version". For example, kernel PCA, kernel LDA, etc..<br />
<br />
==='''SVM: non-separable case'''===<br />
We have seen how SVMs are able to find an optimally separating hyperplane of two separable classes of data, in which case the margin contains no data points. However, in the real world, data of different classes are usually mixed together at the boundary and it's hard to find a perfect boundary to totally separate them. To address this problem, we slack the classification rule to allow data cross the margin. Mathematically the problem becomes,<br />
:<math>\min_{\beta, \beta_0} \frac{1}{2}|\beta|^2</math><br />
:<math>\,y_i(\beta^Tx_i+\beta_0) \geq 1-\xi_i</math><br />
:<math>\xi_i \geq 0</math><br />
<br />
Now each data point can have some error <math>\,\xi_i</math>. However, we only want data to cross the boundary when they have to and make the minimum sacrifice; thus, a penalty term is added correspondingly in the objective function to constrain the number of points that cross the margin. The optimization problem now becomes:<br />
<br />
:<math>\min_{\alpha} \frac{1}{2}|\beta|^2+\gamma\sum_{i=1}^n{\xi_i}</math><br />
:<math>\,s.t.</math> <math>y_i(\beta^Tx+\beta_0) \geq 1-\xi_i</math> <br />
:<math>\xi_i \geq 0</math><br />
<br />
[[File:non-separable.JPG|350px|thumb|right|Figure non-separable case]]<br />
<br />
<br\>Note that <math>\,\xi_i</math> is not necessarily smaller than one, which means data can not only enter the margin but can also cross the separating hyperplane.<br />
<br />
References:<br />
<br />
Mercer, J., 1909. Functions of positive and negative type and their connection<br />
with the theory of integral equations. Philos. Trans. Roy. Soc. London, A<br />
209:415{446}<br />
<br />
==Support Vector Machine algorithm for non-separable cases - November 23, 2009==<br />
<br />
With the program formulation above, we can form the lagrangian, apply KKT conditions, and come up with a new function to optimize. As we will see, the equation that we will attempt to optimize in the SVM algorithm for non-separable data sets is the same as the optimization for the separable case, with slightly different conditions.<br />
<br />
===Forming the Lagrangian===<br />
<br />
:<math>L: \frac{1}{2} |\beta|^2 + \gamma \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i[y_i(\beta^T x_i+\beta_0)-1+\xi_i]-\sum_{i=1}^n \alpha_i \xi_i</math><br />
:<math>\alpha_i \geq 0, \lambda_i \geq 0</math><br />
<br />
===Applying KKT conditions===<br />
# <math>\frac{\partial L}{\partial \beta}=\beta - \sum_{i=1}^n \alpha_i y_i x_i = 0 \Rightarrow \beta=\sum_{i=1}^n\alpha_i y_i x_i</math> <br\><math>\frac{\partial L}{\partial \beta_0}=-\sum_{i=1}^n \alpha_i y_i =0 \Rightarrow \sum_{i=1}^n \alpha_i y_i =0</math> since the sign does not make a difference<br />
#<math>\frac{\partial L}{\partial \xi_i}=\gamma - \alpha_i - \lambda_i \Rightarrow \gamma = \alpha_i+\lambda_i</math><br />
#<math>\,\alpha_i[y_i(\beta^T x_i+\beta_0)-1+\xi_i]=0 and \alpha_i \xi_i=0</math><br />
<br\> Similar to what we did for the separable case after apply KKT conditions, replace the primal variables in terms of dual variables into the Lagrangian equations and simplify.<br />
<br />
===Putting it all together===<br />
<br />
With our KKT conditions and the Lagrangian equation, we can now use quadratic programming to find <math>\,\alpha</math>.<br />
<br />
In matrix form, we want to solve the following optimization:<br />
:<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:<math>\,s.t.</math> <math>\underline{0} \leq \underline{\alpha} \leq \gamma</math>, <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Solving this gives us <math>\,\underline{\alpha}</math>, which we can use to find <math>\,\underline{\beta}</math> as before:<br />
:<math>\,\underline{\beta} = \sum{\alpha_i y_i \underline{x_i}}</math><br />
<br />
However, we cannot find <math>\,\beta_0</math> in the same way as before, even if we choose a point with <math>\,\alpha_i > 0</math>, because we do not know the value of <math>\,\xi_i</math> in the equation<br />
:<math>\,y_i(\underline{\beta}^Tx_i + \beta_0) - 1 + \xi_i = 0</math><br />
<br />
From our discussion on the KKT conditions, we know that <math>\,\lambda_i \xi_i = 0</math> and <math>\,\gamma = \alpha_i + \lambda_i</math>.<br />
<br />
So, if <math>\,\alpha_i < \gamma</math> then <math>\,\lambda_i > 0</math> and consequently <math>\,\xi_i = 0</math>.<br />
<br />
Therefore, we can solve for <math>\,\beta_0</math> if we choose a point where:<br />
:<math>\,0 < \alpha_i < \gamma</math><br />
<br />
====The SVM algorithm for non-separable data sets====<br />
<br />
The algorithm, then, for non-separable data sets is:<br />
<br />
# Use <code>quadprog</code> (or another quadratic programming technique) to solve the above optimization and find <math>\,\alpha</math><br />
# Find <math>\,\underline{\beta}</math> by solving <math>\,\underline{\beta} = \sum{\alpha_i y_i x_i}</math><br />
# Find <math>\,\beta_0</math> by choosing a point where <math>\,0 < \alpha_i < \gamma</math> and then solving <math>\,y_i(\underline{\beta}^Tx_i + \beta_0) - 1 = 0</math></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=5509stat8412009-11-23T23:55:18Z<p>Ipargaru: /* Support Vector Machine algorithm for non-separable cases - November 23, 2009 */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rise of fields such as data-mining, bioinformatics, and machine learning, classification has becomes a fast-developing topic. In the age of information, vast amounts of data are generated constantly, and the goal of classification is to learn from data. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
'''Definition''': The problem of Prediction a discrete random variable <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math> is called Classification.<br />
<br />
In classification,, we attempt to approximate a function <math>\,h</math>, by using a training data set, which will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>d</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data which are identical independent distributions, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, for instance, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability. Intuitively speaking, to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we find <math>\,y\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y|X=x)</math>, and classify <math>\,X</math> into class <math>\,y</math>. In order to calculate the value of <math>\,P(Y=y|X=x)</math>, we use ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posterior probability, <math>\,P(Y=y)</math> as the prior probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two classes, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<sub></sub><br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''3 different approaches to classification''':<br />
<br />
1) Empirical Risk Minimization: Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of <math>\,L(h)</math><br />
<br />
2) Regression: Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3) Density Estimation: estimate <math>\,P(X=x|Y=0)</math> and <math>\,P(X=x|Y=1)</math> (less popular in high-dimension cases)<br />
<br />
<br />
<br />
'''Bayes Classification Rule Optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when<math></math> the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
<br />
:<math>\, h^*(X)= \left\{\begin{matrix} <br />
1 & if P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
Remark:<br />
<br />
1)Bayes classification rule is optimal. Proof:[http://www.ee.columbia.edu/~vittorio/BayesProof.pdf]<br />
<br />
2)We still need any other method, since we cannot define prior probability in realistic.<br />
<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, since it is generally impossible for us to know the prior <math>\,P(Y=1)</math>, and class conditional density <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], tree-augmented naive Bayes (TAN), Bayesian network augmented naive Bayes (BAN) and general Bayesian network (GBN).<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as having objective existence. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayesian and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayesian'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In Bayesian method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
'''Multi-class Classification''':<br />
<br />
Y takes on more than two values.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
''Theorem'': Suppose that <math>\,Y \in \mathcal{Y}= \{1,\dots,k\}</math>, the optimal rule is :<math>\,h^*(X) = \arg\max_{k}{P(Y = k|X = x)}</math><br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
1 Empirical Risk Minimization:Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
<br />
2 Regression:Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3 Density estimation, estimate <math>P(X = x|Y = 0)</math> and <math>P(X = x|Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
It is quadratic because there is no boundaries.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on [http://academicearth.org/lectures/advice-for-applying-machine-learning LDA and QDA] so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
LDA is for classification and FDA is used for feature extraction.<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
<br />
=== logistic function ===<br />
A logistic function or logistic curve is the most common sigmoid curve. <br />
<br />
:<math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
1. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
2. <math>y(0) = \frac{1}{2}</math><br />
<br />
3. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
4. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
5. The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood, using Pr(Y|X). The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Seeing these equations as a weighted least squares problem makes them easier to derivate.<br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== Perceptron (Foundation of Neural Network) ===<br />
<br />
==== Separating Hyperplane Classifiers ====<br />
Separating hyperplane trys to separate the data using linear decision boundaries. When the classes overlap, it can be generalized to support vector machine, which constructs nonlinear boundaries by constructing a linear boundary in an enlarged and transformed feature space.<br />
<br />
==== Perceptron ====<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
Least Squares returns the sign of a linear combination of data points as the class label<br />
<br />
sign(<math>(\underline{\beta}^T \underline{x} + {\beta}_0)) = sign(\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2})</math><br />
<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the [http://en.wikipedia.org/wiki/Artificial_neural_network Artificial Neural Network] models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br />
<br />
<br />
If we find a hyperplane that is not unique between 2 classes, there will be infinitely many solutions obtained from the perceptron algorithm.<br />
<br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
A Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math>, where <math>\,I</math> indicates the sign of the expression and returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math> (initial guess). Its goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach], which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
====Problems with the Algorithm and Issues Affecting Convergence====<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. This problem can be eliminated by using basis expansions technique. To be specific, we try to find a hyperplane not in the original space, but in the enlarged space obtained by using some basis functions.<br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#A perfect separation is not always available even desirable. If observations comes from different classes sharing the same imput, the classification model seems to be overfitting and will generally have poor predictive performance.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
<br />
</ref>.<br />
====Comment on gradient descent algorithm====<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction in which the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and you initially stand in the middle, then you will finally arrive at the saddle point (local minimum) and get stuck there.<br />
<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over <math>\,i</math> (all data points). Actually, this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent) known as Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>\,{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
*A Perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection with branches that "fan out" onto as many connections as desired, each carrying the same signal - the processing element output signal. <br />
<br />
<ref> Haykin, Simon (2009). Neural Networks and Learning Machines. Pearson Education, Inc. </ref><br />
A neural network resembles the brain in two respects:<br />
1. Knowledge is acquired by the network from its environment through a learning process.<br />
2. Interneuron connection strengths, known as synaptic weights, are used to store the acquired knowledge.<br />
<br />
<ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref> It is a multistage regression or classification model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
A regression problem typically has only one unit in the output layer. In a k-class classification problem, there are usually k units in the output layer that each represent the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1).<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classification by NN. <br />
<br />
In perceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] and is not continuous at 0. Thus, we replace it by a smooth function <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it the '''activation function'''.<br />
<br>The choice of this function <math>\displaystyle \sigma </math> is determined by the properties of the data and the assumed distribution of target variables, but for multiple binary classification problems the logistic function, also known as inverse-logit, is often used: <br />
<math>\sigma(a)=\frac {1}{1+e^{-a}}</math><br />
<br />
There are some important properties for the activation function.<br />
<br />
# Activation function is nonlinear. It can be shown that if the activation function of the hidden units is linear, a three-layer neural network is equivalent to a two layer one. <br />
# Activation function saturate, which means there are maximum and minimum output value. This property ensures that the weights are bounded and therefore the searching time is limited. <br />
# Activation function is continuous and smooth.<br />
# Activation function is monotonic. This property is not necessary, since we know that RBF networks is also a kind of power model. <br />
<br />
'''Note:''' A key difference between a perceptron and a neural network is that a neural network uses continuous nonlinearities in the units, for the purpose of differentiation, whereas the perceptron often uses a non-differentiable activation function. The neural network function is differentiable with respect to the network parameters so that a gradient descent method can be used in training. Moreover, a perceptron is a linear classifier, whereas a neural network, by combining layers of perceptrons, is able to classify non-linear problems through proper training.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes into the perceptron, to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural network is called [http://en.wikipedia.org/wiki/Feedforward_neural_network Feed-Forward Neural Network]. Applications to Feed-Forward Neural Networks include data reduction, speech recognition, sensor signal processing, and ECG abnormality detection, to name a few. <ref>J. Annema, Feed-Forward Neural Networks, (Springer 1995), pp. 9 </ref><br />
<br />
===Back-propagation===<br />
For a while, the Neural Network model was just an idea, since there were no algorithms for training the model until 1986, when Geoffrey Hinton <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> devised an algorithm called '''back-propagation''' [http://en.wikipedia.org/wiki/Backpropagation#Algorithm]. After that, a number of other training algorithms and various configurations of neural networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied a gradient descent algorithm for optimizing weights. Back-propagation uses this idea of gradient descent to train a neural network based on the chain rule in calculus. <br />
<br />
Assume that the last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
For simplicity, there is only 1 unit at the end and assume for the moment we are doing regression.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each unit we have a function <math>\displaystyle z_i=\sigma(a_i)</math> that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classification and the resulting classification output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
''i j are reversed!''<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\delta_j \cdot z_l</math><br />
<br />
Note that a change in <math>\,a_j</math> causes changes in all <math>\,a_i</math> in the next layer on which the error is based, so we need to sum over i in the chain:<br />
<math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} =\sum_i \delta_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\,\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math> Using the activation function <math>\,\sigma(\cdot)</math><br />
<br />
So <math>\delta_j = \sum_i \delta_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \delta_i \cdot u_{ij}</math><br />
<br />
We can propagate the error calculated in the output back through the previous layers and adjust weights to minimize error.<br />
<br />
==Neural Networks (NN) - October 30, 2009 ==<br />
<br />
=== Back-propagation ===<br />
The idea is that we first feed an input from the training set to the Neural Network, then find the error rate at the output and then we propagate the error to previous layers and for each edge of weight <math>\,u_{ij}</math> we find <math>\frac{\partial \mathrm{err}}{\partial u_{ij}}</math>. Having the error rates at hand we adjust the weight of each edge by taking steps proportional to the negative of the gradient to decrease the error at output. The next step is to apply the next input from the training set and go through the described adjustment procedure.<br />
The overview of Back-propagation algorithm:<br />
#Feed a point <math>\,x</math> in the training set to the network, and find the output of all the nodes.<br />
#Evaluate <math>\,\delta_k=y_k-\hat{y_k}</math> for all output units, where <math>y_k</math> is the expected output and <math>\hat{y_k}</math> is the real output.<br />
#By propagating to the previous layers evaluate all <math>\,\delta_j</math>s for hidden units: <math>\,\delta_j=\sigma'(a_j)\sum_i \delta_i u_{ij}</math> where <math>i</math> is associated to the previous layer.<br />
#Using <math>\frac{\partial \mathrm{err}}{\partial u_{jl}} = \delta_j\cdot z_l</math> find all the derivatives.<br />
#Adjust each weight by taking steps proportional to the negative of the gradient: <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} -\rho \frac{\partial \mathrm{err}}{u_{ij}}</math><br />
#Feed the next point in the training set and repeat the above steps.<br />
==== How to initialize the weights ====<br />
This still leaves the question of how to initialize the weights <math>\,u_{ij}, w_i</math>. The method of choosing weights mentioned in class was to randomize the weights before the first step. This is not likely to be near the optimal solution in every case, but is simple to implement. To be more specific, random values near zero will be a good choice for the initial weights(usually from [-1,1]). In this case, the model evolves from a nearly linear one to a nonlinear one as we desired. An alternative is to use an orthogonal least squares method to find the initial weights <ref>http://www.mitpressjournals.org/doi/abs/10.1162/neco.1995.7.5.982</ref>. Regression is performed on the weights and output by using a linear approximation of <math>\,\sigma(a_i)</math>, and finds optimal weights in the linear model. Back propagation is used afterward to find the optimal solution, since the NN is non-linear.<br />
<br />
==== How to set learning rates ====<br />
The learning rate <math>\,\rho</math> is usually a constant. <br />
<br />
If we use On-line learning, as a form of stochastic approximation process, <math>\,\rho</math> should decrease as the iteration increase.<br />
<br />
<br />
Choosing too large learning rate may cause the unstability of the system, while the too small learning rate may lead to a very slow convergence rate(very long time in learning phase). However, the advantage of small learning rate is that it can guarantee the convergence. Thus, generally, it is better to choose a relatively small learning rate to ensure the stability. Usually, choose <math>\,\rho</math> between 0.01 and 0.7<br />
<br />
==== How to determine the number of hidden units ====<br />
<br />
Here we will mainly discuss how to estimate the number of hidden units at very beginning. Obviously, we should adjust it to be more precise using CV, LOO or other complexity control methods. <br />
<br />
Basically, if the patterns are well separated, few hidden units are fairly enough. If the patterns are drawn from some highly complicated mixture models, more hidden units are really needed. <br />
<br />
Actually, the number of hidden units determines the size of the model, and therefore the total number of the weights in the model. Typically speaking, the number of weights should not be larger than the number of training data, say N. Thus, sometimes, N/10 is a good choice. However, in pratice, many well performed models will use more hidden units.<br />
<br />
=== Dimensionality reduction application ===<br />
[[File:NN-bottelneck.png|350px|thumb|right|Figure 1: Bottleneck configuration for applying dimensionality reduction.]]<br />
One possible application of Neural Networks is to perform dimensionality reduction, like other techniques, e.g., PCA, MDS, LLE and Isomap.<br />
<br />
Consider the following configuration as shown in figure 1:<br />
As we go forward in layers of this Neural Network, the number of nodes is reduced, until we reach a layer with the number of nodes representing the desired dimensionality. However, note that at the very first few layers the number of nodes may not be strictly decreasing, as long as finally it can reach a layer with less nodes. From now on in the Neural Network <br />
the previous layers are mirrored. So at the output layer we have the same number of states as we have in the input layer. Now note that if we feed the network with each point and get an output approximately equal to the fed input, that means at the output the same input is reconstructed from the middle layer units. So the output of the middle layer units can represent the input with less dimensions.<br />
<br />
To train this Neural Network, we feed the network with a training point and through back propagation we adjust the network weights based on the error between the input layer and the reconstruction at the output layer. Our low dimensional mapping will be the observed output from the middle layer. Data reconstruction consists of putting the low dimensional data through the second half of the network.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
== Complexity Control October 30, 2009 ==<br />
<br />
[[File:overfitting-model.png|500px|thumb|right|Figure 2. The overfitting model passes through all the points of the training set, but has poor predictive power for new points.<br />
In exchange the line model has some error on the training points but has extracted the main characteristic of the training points, and has good predictive power.]]<br />
There are [http://academicearth.org/lectures/underfitting-and-overfitting two issues] that we have to avoid in Machine Learning:<br />
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]<br />
#Underfitting<br />
<br />
Overfitting occurs when our model is heavily complex with so many degrees of freedom, that we can learn every detail of the training set. Such a model will have very high precision on the training set but will show very poor ability to predict outcomes of new instances, especially outside the domain of the training set.<br />
<br />
In a Neural Network if the depth is too much, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will show a very precise outcome of the training set but will not be able to generalize the commonality of the training set to predict the outcome of new cases.<br />
<br />
Underfitting occurs when the model we picked to describe the data is not complex enough, and has high error rate on the training set.<br />
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.<br />
<br />
'''Example'''<br />
#Consider the example showed in the figure. We have a training set and we want to find a model which fits it the best. We can find a polynomial of high degree which almost passes through all the points in the training set. But, in fact the training set is coming from a line model. Now the problem is although the complex model has less error on the training set it diverges from the line in other ranges which we have no training point. Because of that the high degree polynomail has very poor predictive result on test cases. This is an example of overfitting model.<br />
#Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem.<br />
#Consider a simple classification example. If our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is a yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features typical of bananas, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie., we have overfit the data. This occurs when we have considered so many features that we have perfectly described the existing bananas, but if presented with a new banana of slightly different shape than the existing bananas, for example, it cannot be detected. This is the tradeoff; what is the right level of complexity?<br />
<br />
== Complexity Control - Nov 2, 2009 ==<br />
<br />
Overfitting occurs when the model becomes too complex and underfitting occurs when it is not complex enough, both of which are not desirable. To control complexity, it is necessary to make assumptions for the model before fitting the data. Assumptions that we can make for a model are with polynomials or a neural network. There are other ways as well.<br />
<br />
[[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 1: An example of a model with a family of polynomials]]<br />
We do not want a model to get too complex, so we control it by making an assumption on the model. With complexity control, we want a model or a classifier with a low error rate.<br />
<br />
=== '''How do we choose a good classifier?''' ===<br />
<br />
Our goal is to find a classifier that minimizes the true error rate. <br />
Recall the empirical error rate<br />
<br />
<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math><br />
<br />
<math>\,h</math> is a classifier and we want to minimize the error rate. So we apply <math>\displaystyle x_1</math> to <math>\displaystyle x_n</math> to <math>\displaystyle h</math>, and take the average to get the empirical true error rate estimation of probability that <br />
<math>h(x_{i}) \neq y_{i}</math>.<br />
<br />
<span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 2]]</span><br />
There is a downward bias to this estimate meaning that it is always less than the true error rate. <br />
<br />
If there is a change in our complexity from low to high, our error rate is always decreasing. When we apply our model to the test data, our error rate will start to decrease to a point, but then it will increase since the model hasn't seen it before. This can be explained since training error will decrease when we fit the model better by increasing its complexity, but as we have seen, this complex model will not generalize well, resulting in a larger test error. <br />
<br />
We use our test data (from the test sample line shown on Figure 2) to get our empirical error rate.<br />
Right complexity is defined as where error rate of the test data is minimum; and this is one idea behind complexity control.<br />
<br />
<br />
<br />
[[File:Bias.jpg|200px|thumb|left|Figure 3]]<br />
<br />
We assume that we have samples <math>\,X_1, . . . ,X_n</math> that follow some (possibly unknown) distribution. We want to estimate a parameter <math>\,f</math> of the unknown distribution. This parameter may be the mean <math>\,E(X_i)</math>, the variance <math>\,var(X_i)</math> or some other quantity.<br />
<br />
The unknown parameter <math>\,f</math> is a fixed real number <math>f\in R</math>. To estimate it, we use an estimator which is a<br />
function of our observations, <math>\hat{f}(X_1,...,X_n)</math>. <br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f</math><br />
<br />
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]</math><br />
<br />
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math><br />
<br />
One property we desire of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>.<br />
However, there is a more important property for an estimator than just being unbiased: the mean squared error. In statistics, there are problems for which it may be good to use an estimator with a small bias. In some cases, an estimator with a small bias may have lesser mean squared error or be median-unbiased (rather than mean-unbiased, the standard unbiasedness property). The property of median-unbiasedness is invariant under transformations while the property of mean-unbiasedness may be lost under nonlinear transformations. For example, while using an unbiased estimator with large mean square error to estimate the parameter, we highly risk a big error. In contrast, a biased estimator with small mean square error will well improve the precision of our prediction.<br />
<br />
Hence, our goal is to minimize <math>MSE (\hat{f})</math>.<br />
<br />
From figure 3, we can see that the relationship of the three parameters is:<br />
<math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.<br />
<br />
A Test error is a good estimation on MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias.<br />
<br />
<br />
Referring to Figure 2, overfitting happens after the point where training data (training sample line) starts to decrease and test data (test sample line) starts to increase. There are 2 main approaches to avoid overfitting:<br />
<br />
1. Estimating error rate<br />
<br />
<math>\hookrightarrow</math> Empirical training error is not a good estimation<br />
<br />
<math>\hookrightarrow</math> Empirical test error is a better estimation<br />
<br />
<math>\hookrightarrow</math> Cross-Validation is fast<br />
<br />
<math>\hookrightarrow</math> Computing error bound (analytically) using some probability inequality.<br />
<br />
We will not discuss computing the error bound in class; however, a popular method for doing this computation is called VC Dimension (short for Vapnik–Chervonenkis Dimension). Information can be found from [http://www.autonlab.org/tutorials/vcdim.html Andrew Moore] and [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.7171&rep=rep1&type=pdf Steve Gunn].<br />
<br />
2. Regularization<br />
<br />
<math>\hookrightarrow</math> Use of shrinkage method<br />
<br />
<math>\hookrightarrow</math> Decrease the chance of overfitting by controlling the weights<br />
<br />
=== '''Example of under and overfitting in R''' ===<br />
<br />
To give further intuition of over and underfitting, consider this example. A simple quadratic data set with some random noise is generated, and then polynomials of varying degrees are fitted. The errors for the training set and a test set are calculated.<br />
[[File:Curvefitting-rex2.png|250px|thumb|right|Polynomial fits to curved data set.]]<br />
<br />
>> x <- rnorm(200,0,1)<br />
>> y <- x^2-0.5*x+rnorm(200,0,0.3)<br />
>> xtest <- rnorm(50,1,1)<br />
>> ytest <- xtest^2-0.5*xtest+rnorm(50,0,0.3)<br />
>> p1 <- lm(y~x)<br />
>> p2 <- lm(y ~ poly(x,2))<br />
>> pn <- lm(y ~ poly(x,10))<br />
>> psi <- lm(y~I(sin(x))+I(cos(x)))<br />
<br />
: <code>x</code> values for the training set are based on a <math>\,N(0,1)</math> distribution, while the test set has a <math>\,N(1,1)</math> distribution. <code>y</code> values are determined by <math>\,y = x^2 + x + N(0,0.3)</math>, a quadratic function with some random variation. Polynomial least square fits of degree 1, 2, and 10 are calculated, as well as a fit of <math>\,sin(x)+cos(x)</math>.<br />
<br />
>> > # calculate the mean squared error of degree 1 poly<br />
>> > sum((y-predict(p1,data.frame(x)))^2)/length(y)<br />
>> [1] 1.576042<br />
>> > sum((ytest-predict(p1,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 7.727615<br />
: Training and test mean squared errors for the linear fit. These are both quite high - and since the data is non-linear, the different mean value of the test data increases the error quite a bit.<br />
>> > # calculate the mean squared error of degree 2 poly<br />
>> > sum((y-predict(p2,data.frame(x)))^2)/length(y)<br />
>> [1] 0.08608467<br />
>> > sum((ytest-predict(p2,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 0.08407432<br />
: This fit is far better - and there is not much difference between the training and test error, either.<br />
>> > # calculate the mean squared error of degree 10 poly<br />
>> > sum((y-predict(pn,data.frame(x)))^2)/length(y)<br />
>> [1] 0.07967558<br />
>> > sum((ytest-predict(pn,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 156.7139<br />
: With a high-degree polynomial, the training error continues to decrease, but not by much - and the test set error has risen again. The overfitting makes it a poor predictor. As the degree of the polynomial rises further, the accuracy of the computer becomes an issue - and a good fit is not even consistently produced for the training data.<br />
>> > # calculate mse of sin/cos fit<br />
>> > sum((y-predict(psi,data.frame(x)))^2)/length(y)<br />
>> [1] 0.1105446<br />
>> > sum((ytest-predict(psi,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 1.320404<br />
: Fitting a function of the form sin(x)+cos(x) works pretty well on the training set, but because it is not the real underlying function, it fails on test data which doesn't lie on the same domain.<br />
<br />
== ''' Cross-Validation (CV) - Introduction ''' ==<br />
<br />
[[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]]<br />
Cross-Validation is used to estimate the error rate of a classifier with respect to test data rather than data used in the model. Here is a general introduction to CV:<br />
<br />
<math>\hookrightarrow</math> We have a set of collected data for which we know the proper labels<br />
<br />
<math>\hookrightarrow</math> We divide it into 2 parts, Training data (T) and Validation data (V)<br />
<br />
<math>\hookrightarrow</math> For our calculation, we pretend that we do not know the label of V and we use data in T to train the classifier<br />
<br />
<math>\hookrightarrow</math> We estimate an empirical error rate on V since the model hasn't seen V yet and we know the proper label of all elements in V to know how many were misclassified.<br />
<br />
CV has different implementations which can reduce the variance of the calculated error rate, but sometimes with a tradeoff of a higher calculation time.<br />
<br />
== ''' Complexity Control - Nov 4, 2009''' ==<br />
<br />
== Cross-validation ==<br />
[[File:Cross-validation.png|350px|thumb|right|Figure 1: Classical/Standard cross-validation]]<br />
Cross-validation is the simplest and most widely used method to estimate the true error. It comes from the observation that although training error always decreases with the increasing complexity of the model, the test error starts to increase from a certain point, which is noted as overfitting (see [[#prediction-error|figure 2]] above). Since test error estimates MSE (mean square error) best, people came up with the idea to divide the data set into three parts: training set, validation set, and test set. training set is used to build the model, validation set is used to deside the parameters and the optimal model, and the test set is used to estimate the performance of the chosen model. A classical division is 50% for training set, and 25% each for validation set and test set. All of them are randomly selected from the original data set. <br />
<br />
Then, we only use the part of our data marked as the "training set" to train our algorithm, while keeping the remaining marked as the "validation set" untouched. As a result, the validation set will be totally unknown to the trained model. The error rate is then estimated by:<br />
<br />
<math>\hat L(h) = \frac{1}{|\nu|}\sum_{X_i \in \nu}(h(x_i) \neq y_i)</math>, where <math>\,|\nu|</math> is the cardinality of the validation set.<br />
<br />
When we change the complexity, the error generated by the validation set will have the same behavior as the test set, so we are able to choose the best parameters to get the lowest error.<br />
<br />
<br />
=== K-fold Cross-validation ===<br />
[[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]]<br />
Above is the simplest form of complexity control. However, in reality, it may be hard to collect data ??and we usually suffer from the curse of dimensionality??, and a larger data set may be hard to come by. Consequently, we may not be able to afford to sacrifice part of the limited resources. In this case we use another method that addresses this problem, K-fold cross-validation. We divide the data set into <math>\,K</math> subsets roughly equal in size. The usual choice is <math>\,K = 10</math>.<br />
<br />
Generally, how to choose <math>\,K</math>:<br />
<br />
if <math>\,K=n</math>, leave one out, low bias, high variance. Each subset contains a single element, so the model is trained with all except one point, and then validated using that point.<br />
<br />
if <math>\,K=2</math>, say 2-fold, 5-fold, high bias, low variance. Each subset contains approximately <math>\,\frac{1}{2}</math> or <math>\,\frac{1}{5}</math> of the data.<br />
<br />
For every <math>\,k</math>th <math>( \,k \in [ 1, K ] )</math> part, we use the other <math>\,K-1</math> parts to fit the model and test on the <math>\,k</math>th part to estimate the prediction error <math>\hat L_k</math>, where<br />
<br />
<math>\hat L(h) = \frac{1}{K}\sum_{k=1}^K\hat L_k</math><br />
<br />
For example, suppose we want to fit a polynomial model to the data set and split the set into four equal subsets as shown in Figure 2. First we choose the degree to be 1, i.e. a linear model. Next we use the first three sets as training sets and the last as validation set, then the 1st, 2nd, 4th subsets as training set and the 3rd as validation set, so on and so forth until all the subsets have been the validation set once (all observations are used for both training and validation). After we get <math>\hat L_1, \hat L_2, \hat L_3, \hat L_4</math>, we can calculate the average <math>\hat L</math> for degree 1 model. Similarly, we can estimate the error for n degree model and generate a simulating curve. Now we are able to choose the right degree which corresponds to the minimum error. Also, we can use this method to find the optimal unit number of hidden layers of neural networks. We can begin with 1 unit number, then 2, 3 and so on and so forth. Then find the unit number of hidden layers with lowest average error.<br />
<br />
=== Generalized Cross-validation ===<br />
If the vector of observed values is denoted by <math>\mathbf{y}</math>, and the vector of fitted values by <math>\hat\mathbf{y}</math>.<br />
<br />
<math>\mathbf{y} = \mathbf{H}\hat\mathbf{y}</math>, <br />
<br />
where the hat matrix is given by<br />
<br />
<math>\mathbf{H} = \mathbf{X}( \mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X}^{T}</math>,<br />
<br />
<math> \frac{1}{N}\sum_{i=1}^{N}[y_{i} - \hat f^{-i}(\mathbf{x}_{i})]^{2}=\frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-\mathbf{H}_{ii}}]^{2}</math>,<br />
<br />
Then the GCV approximation is given by<br />
<br />
<math> GCV(\hat f) = \frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-trace(\mathbf{H})/N}]^{2}</math>,<br />
<br />
Thus, one of the biggest advantages of the GCV is that the trace is more easily computed.<br />
<br />
=== Leave-one-out Cross-validation ===<br />
Leave-one-out cross-validation involves using all but one data point in the original training data set to train our model, then using the data point that we initially left out as a means to estimate true error. But repeating this process for every data point in our original data set, we can obtain a good estimation of true error.<br />
<br />
In other words, leave-one-out cross-validation is k-fold cross-validation in which we set the subset number <math>\,K</math> to be the cardinality of the whole data set.<br />
<br />
In the above example, we can see that k-fold cross-validation can be computationally expensive: for every possible value of the parameter, we must train the model <math>\,K</math> times. This deficiency is even more obvious in leave-one-out cross-validation, where we must train the model <math>\,n</math> times, where <math>\,n</math> is the number of data point in the data set.<br />
<br />
Fortunately, when adding data points to the classifier is reversible, calculating the difference between two classifiers is computationally more efficient than calculating the two classifiers separately. So, if the classifier on all the data points is known, we simply undo the changes from a data point <math>\,K</math> times to calculate the leave-one-out cross-validation error rate.<br />
<br />
== Regularization for Neural Network — Weight Decay ==<br />
[[File:figure 2.png|350px|thumb|right|Figure 1: activation function]]<br />
Weight decay training is suggested as an implementation for achieving a robust neural network which is insensitive to noise. Since the number of hidden layers in NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting.<br />
<br />
It can be seen from Figure 1 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, we can avoid overfitting by constraining the weights to be small. This gives us a hint to initialize the random weights to be close to zero.<br />
<br />
Formally, we penalize nonlinear weights by adding a penalty term in the error function. Now the regularized error function becomes:<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}|w_i|^2 + \sum_{jk}|u_{jk}|^2)</math>, where <math>\,err</math> is the original error in back-propagation; <math>\,w_i</math> is the weights of the output layer; <math>\,u_{jk}</math> is the weights of the hidden layers.<br />
<br />
Usually, too large <math>\,\lambda</math> will make the weights <math>\,w_i</math> and <math>\,u_{jk}</math> too small. We can use cross validation to estimate <math>\,\lambda</math>.<br />
<br />
A similar penalty, weight elimination, is given by,<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}\frac{|w_i|^2}{1 + |w_i|^2} + \sum_{jk}\frac{|u_{jk}|^2}{1+|u_{jk}|^2})</math>.<br />
<br />
As in back-propagation, we take partial derivative with respect to the weights:<br />
<br />
<math>\frac{\partial REG}{\partial w_i} = \frac{\partial err}{\partial w_i} + 2\lambda w_i</math><br />
<br />
<math>\frac{\partial REG}{\partial u_{jk}} = \frac{\partial err}{\partial u_{jk}} + 2\lambda u_{jk}</math><br />
<br />
<math>w^{new} \leftarrow w^{old} - \rho\left(\frac{\partial err}{\partial w} + 2\lambda w\right)</math><br />
<br />
<math>u^{new} \leftarrow u^{old} - \rho\left(\frac{\partial err}{\partial u} + 2\lambda u\right)</math><br />
<br />
Note that here <math>\,\lambda</math> serves as a trade-off parameter, tuning between the error rate and the linearity. Actually, we may also set <math>\,\lambda</math> by cross-validation. The tuning parameter is important since weights of zero will lead to zero derivatives and the algorithm will not change. On the other hand, starting with weights that are too large means starting with a nonlinear model which can often lead to poor solutions. <ref>Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning (Springer 2009) pp.398</ref><br />
<br />
== Radial Basis Function (RBF) Networks - November 6, 2009 ==<br />
<br />
[[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]]<br />
<br />
=== Introduction === <br />
<br />
A Radial Basis Function (RBF) network [http://en.wikipedia.org/wiki/Radial_basis_function_network] is a type of artificial neural network with an output layer and a single hidden layer, with weights from the hidden layer to the output layer, and can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. One choice that has been widely used is that of radial basis functions, which have the property that each basis function depends only on the radial distance (typically Euclidean) from a center <math>\displaystyle\mu_{j}</math>, so that <math>\phi_{j}(x)= h({\Vert x - \mu_{j}\Vert})</math>.<br />
<br />
<br />
The output of an RBF network can be expressed as a weighted sum of its radial basis functions as follows:<br />
<br />
<math>\hat y_{k} = \sum_{j=1}^M\phi_{j}(x) w_{jk}</math><br />
<br />
The radial basis function is: <br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br /><br />
(Gaussian without a normalization constant)<br /><br /><br />
'''note:'''The hidden layer has a variable number of neurons (the optimal number is determined by the training process). As usual the more neurons in the hidden layer, the higher the model complexity. Each neuron consists of a radial basis function centered on a point with the same dimensions as the input data. The radii of the RBF functions may be different. The centers and radii can be determined through clustering or an EM algorithm. When the x vector is given from the input layer, the hidden neuron computes the radial distance from the neuron’s center point and then applies RBF function to this distance. The resulting value is passed to the the output layer and weighed together to form the output. <br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat{Y}_{n,k} = \left[ \begin{matrix}<br />
\hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\<br />
\hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\<br />
\vdots &\vdots & \ddots & \vdots \\<br />
\hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k}<br />
\end{matrix}\right] </math> is the matrix of output variables. <br />
<br />
:<math>\Phi_{n,m} = \left[ \begin{matrix}<br />
\phi_{1,1} & \phi_{1,2} & \cdots & \phi_{1,m} \\<br />
\phi_{2,1} & \phi_{2,2} & \cdots & \phi_{2,m} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
\phi_{n,1} & \phi_{n,2} & \cdots & \phi_{n,m}<br />
\end{matrix}\right] </math> is the matrix of Radial Basis Functions.<br />
<br />
:<math>W_{m,k} = \left[ \begin{matrix}<br />
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\<br />
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
w_{m,1} & w_{m,2} & \cdots & w_{m,k}<br />
\end{matrix}\right] </math> is the matrix of weights.<br />
<br />
Here, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>k = 1</math>, <math>\hat Y</math> and <math>W</math> are column vectors.<br />
<br />
''related reading'':<br />
<br />
Introduction of the Radial Basis Function (RBF) Networks [http://axiom.anu.edu.au/~daa/courses/GSAC6017/rbf.pdf]<br />
<br />
Radial Basis Function (RBF) Networks [http://documents.wolfram.com/applications/neuralnetworks/index6.html] [http://lcn.epfl.ch/tutorial/english/rbf/html/index.html]<br />
<br />
=== Estimation of weight matrix W ===<br />
<br />
We minimize the training error, <math>\Vert Y - \hat{Y}\Vert^2</math> in order to find <math>\,W</math>.<br /><br /><br />
From a previous result in linear algebra we know that <br />
<br />
<math>\Vert A \Vert^2 = Tr(A^{T}A)</math><br />
<br />
Thus we have a problem similar to linear regression:<br />
<br />
<math>\ err = \Vert Y - \Phi W\Vert^{2} = Tr[(Y - \Phi W)^{T}(Y - \Phi W)]</math><br />
<br />
<math>\ err = Tr[Y^{T}Y - Y^{T}\Phi W - W^{T} \Phi^{T} Y + W^{T}\Phi^{T} \Phi W]</math><br />
<br />
<br />
==== Useful properties of matrix differentiation ====<br />
<br />
<br />
<math>\frac{\partial Tr(AX)}{\partial X} = A^{T}</math><br />
<br />
<math>\frac{\partial Tr(X^{T}A)}{\partial X} = A</math><br />
<br />
<math>\frac{\partial Tr(X^{T}AX)}{\partial X} = (A^{T} + A)X</math><br />
<br />
==== Solving for W ====<br />
<br />
We find the minimum over <math>\,W</math> by setting <math>\frac{\partial err}{\partial W}</math> equal to zero and using the aforementioned properties of matrix differentiation.<br />
<br />
<math>\frac{\partial err}{\partial W} = 0</math><br />
<br />
<math>\ 0 - \Phi^{T}Y - \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ -2 \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math><br />
<br />
<math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY</math><br />
<br />
where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}</math><br />
<br />
<math>\,H</math> is the hat matrix for this model. This gives us a nice results since the solution has a closed form and we do not have to worry about convexity problems in this case.<br />
<br />
=== Including an additional bias ===<br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat Y = \left[ \begin{matrix}<br />
y_{11} & y_{12} & \cdots & y_{1k} \\<br />
y_{21} & y_{22} & \cdots & y_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
y_{n1} & y_{n2} & \cdots & y_{nk}<br />
\end{matrix}\right] </math> is the matrix(n by k) of output variables.<br />
<br />
:<math>\Phi = \left[ \begin{matrix}<br />
\phi_{10} &\phi_{11} & \phi_{12} & \cdots & \phi_{1M} \\<br />
\phi_{20} & \phi_{21} & \phi_{22} & \cdots & \phi_{2M} \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{n0} &\phi_{n1} & \phi_{n2} & \cdots & \phi_{nM}<br />
\end{matrix}\right] </math> is the matrix(n by M+1) of Radial Basis Functions.<br />
<br />
:<math>W = \left[ \begin{matrix}<br />
w_{01} & w_{02} & \cdots & w_{0k} \\<br />
w_{11} & w_{12} & \cdots & w_{1k} \\<br />
w_{21} & w_{22} & \cdots & w_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
w_{M1} & w_{M2} & \cdots & w_{Mk}<br />
\end{matrix}\right] </math> is the matrix(M+1 by k) of weights.<br />
<br />
where the extra basis function <math>\Phi_{0}</math> is set to 1.<br />
<br />
==== Normalized RBF ====<br />
<br />
In addition to the above unnormalized architecture, the normalized RBF can be represented as:<br />
<br />
<math>\hat{y}_{k}(X) = \frac{\sum_{j=1}^{M} w_{jk}\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math><br /><br /><br />
<br />
<br />
Actually, <math>\Phi^{\ast}_{j}(X) = \frac{\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math> is known as a normalized radial basis function. Giving the familiar form,<br /><br />
<br />
<math>\hat{y}_{k}(X) = \sum_{j=1}^{M} w_{jk}\Phi^{\ast}_{j}(X)</math><br /><br /><br />
<br />
=== Conceptualizing RBF networks ===<br />
<br />
In the past, we have classified data using models that were explicitly linear, quadratic, or otherwise definite. In RBF networks, like in Neural Networks, we can fit an arbitrary model. How can we do this without changing the equations being used?<br />
<br />
Recall a [[#Trick:_Using_LDA_to_do_QDA_-_October_7.2C_2009|trick]] that was discussed in the October 7 lecture: if we add new features to our original data set, we can project into higher dimensions, use a linear algorithm, and get a quadratic result by collapsing to a lower dimension afterward. In RBF networks, something similar can happen.<br />
<br />
Think of <math>\,\Phi</math>, our matrix of radial basis functions, as a feature space of the input. Each hidden unit, then, can be thought to represent a feature; we can see that, if there are more hidden units than input units, we can essentially project to a higher-dimensional space, as we did in our earlier trick. However, this does not mean that an RBF network will actually do this, it is merely a way to convince yourself that RBF networks (and neural networks) can fit arbitrary models. Nevertheless, it is also noticed that just because of such great power, the problem of overfitting appears more important. We have to control its complexity so that it does not fit to an arbitrary training model but to a general one.<br />
<br />
=== RBF networks for classification -- a probabilistic paradigm ===<br />
<br />
[[File:Rbf_graphical_model.png|350px|thumb|left|Figure 1: RBF graphical model]]<br />
<br />
An RBF network is akin to fitting a Gaussian mixture model to data. We assume that each class can be modelled by a single function <math>\,\phi</math> and data is generated by a mixture model. According to Bayes Rule,<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(x|y_{k})*Pr(y_{k})}{Pr(x)}</math><br />
<br />
While all classifiers that we have seen thus far in the course have been in discriminative form, the RBF network is a generative model that can be represented using a directed graph.<br />
<br />
We can replace the class conditional density in the above conditional probability expression by marginalizing <math>\,x</math> over <math>\,j</math>:<br />
<math>\Pr(x|y_{k}) = \sum_{j} Pr(x|j)*Pr(j|y_{k})</math><br />
<br />
<br />
<br />
<br/><br/><br />
*'''Note''' We made the assumption that each class can be modelled by a single function <math>\displaystyle\Phi</math> and that the data was generated by a mixture model. The Gaussian mixture model has the form:<br />
<math>f(x)=\sum_{m=1}^M \alpha_m \phi(x;\mu_m,\Sigma_m)</math> where <math>\displaystyle\alpha_m</math> are mixing proportions, <math>\displaystyle\sum_m \alpha_m=1</math>, and <math>\displaystyle\mu_m</math> and <math>\displaystyle\Sigma_m</math> are the mean and covariance of each Gaussian density respectively. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009), pp. 214. </ref> The generative model in Figure 1 shows graphically how each Gaussian in the mixture model is chosen to sample from.<br />
<br />
== '''Radial Basis Function (RBF) Networks - November 9th, 2009''' ==<br />
<br />
=== RBF Network for classification (A probabilistic point of view) ===<br />
Using RBF Network to do classification, we usually treat it as regression problem. We want to set a threshold to decide what the data’s class membership is. However, to find some insight to the classification problem of what we are doing in terms of RBF Network, we often think of mixture models and make certain assumptions.<br />
<br />
[[File:RBF.png|350px|thumb|right|Figure 26.1: RBF Network Classification Demo]] <br />
<br />
We assume, as we can see in the graph on the right hand side, that we have three random variables, <math>\displaystyle y_k</math>, <math>\displaystyle j</math>, and <math>\displaystyle x</math> where <math>\displaystyle y_k</math> denotes class <math>\,k</math>, <math>\displaystyle x</math> is what we observed here, and <math>\displaystyle j</math> is some hidden random variable. There is a process that there are different classes, and each class can trigger a different hidden random variable <math>\displaystyle j</math>. To understand this, we can assume that, for instance, this random variable <math>\displaystyle j</math> has a Gaussian distribution (it could have any other distribution as well) and that all the <math>\displaystyle j</math>’s have the same distribution (Gaussian), but with different parameters. From each Gaussian distribution triggered by each class, we are going to sample some data points. Therefore, in the end, we are going to get a set of data, which are not strictly Gaussian, but are actually a mixture of Gaussians.<br />
<br />
Again, we look at the posterior distribution from [http://en.wikipedia.org/wiki/Bayes'_theorem Bayes' Rule].<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(X = x | Y = y_{k})*Pr(Y = y_{k})}{Pr(X = x)}</math><br />
<br />
Since we made the assumption that the data has been generated from a mixture model, we can estimate this conditional probability by<br />
<br />
<math>\Pr(X = x | Y = y_{k}) = \sum_{j} Pr(X = x | j)*Pr(j | Y = y_{k})</math>, <br />
<br />
which is the class conditional distribution (or probability) of the mixture model. Note, here, if we only have a simple model from <math>\displaystyle y_k</math> to <math>\displaystyle x</math>, then we won’t have this summation.<br />
<br />
We can substitute this class conditional distribution into Bayes' formula. We can see that the posterior of class <math>\displaystyle k</math> is the summation over <math>\displaystyle j</math> of the probability of <math>\displaystyle x</math> given <math>\displaystyle j</math> times the probability of <math>\displaystyle j</math> given <math>\displaystyle y_k</math>, times the prior distribution of class <math>\displaystyle k</math>, and lastly divided by the marginal probability of <math>\displaystyle x</math>. That is,<br />
<br />
<math>\Pr(y_k | x) = \frac {\sum_{j} Pr(x | j)*Pr(j | y_{k})*Pr(y_{k})}{Pr(x)}</math>.<br />
<br />
Since, the prior probability of class <math>\displaystyle k</math>, <math>\displaystyle Pr(y_{k})</math>, does not have an index of <math>\displaystyle j</math>, it can be taken out of the summation. This yields,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)}</math>.<br />
<br />
We multiply this by <math>\displaystyle 1 = \frac {Pr(j)}{Pr(j)}</math>. Then, it becomes,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)} * \frac {Pr(j)}{Pr(j)}</math>.<br />
<br />
Next, note that <math>\displaystyle Pr(j | x) = \frac {Pr(x | j)*Pr(j)}{Pr(x)}</math>, and <math>\displaystyle Pr(y_k | j) = \frac {Pr(j | y_k)*Pr(y_k)}{Pr(j)}</math>. Then rearranging the terms, we finally have the posterior:<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} Pr(j | x)Pr(y_k | j)</math>.<br />
<br />
Interestingly, this is just the product of the posterior of the two functions that are summed.<br />
<br />
==== Interpretation of RBF Network classification ====<br />
<br />
[[File:2.png|350px|thumb|right|Figure 26.1.2(2): RBF Nerwork ]]<br />
<br />
We want to relate the results that we derived above to our RBF Network. In a RBF Network, as we can see on the right hand side, we have a set of data, <math>\displaystyle x_1</math> to <math>\displaystyle x_d</math>, and the hidden basis function, <math>\displaystyle \phi_{1}</math> to <math>\displaystyle \phi_{M}</math>, and then we have some output, <math>\displaystyle y_1</math> to <math>\displaystyle y_k</math>. Also, we have weights from the hidden layer to output layer. The output is just the linear sum of <math>\displaystyle \phi</math>’s. <br />
<br />
Now consider probability of <math>\displaystyle j</math> given <math>\displaystyle x</math> to be <math>\displaystyle \phi</math>, and the probability of <math>\displaystyle y_k</math> given <math>\displaystyle j</math> to be the weights <math>\displaystyle w_{jk}</math>, then the posterior can be written as,<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math>.<br />
<br />
[[File:3.png|350px|thumb|left|Figure 26.1.2(1): Gaussian mixture ]]<br />
<br />
Now, let us look at an example in one dimensional case. Suppose,<br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, and <math>\displaystyle j</math> is from 1 to 2. <br />
<br />
We know that <math>\displaystyle \phi</math> is a radial basis function. It's as if we put some Gaussian over data. And for each Gaussian, we consider the center <math>\displaystyle \mu</math>. Then, what <math>\displaystyle \phi</math> computes is the similarity of any data point to the center. <br />
<br />
We can see the graph on the left which plots the density of <math>\displaystyle \phi_{1}</math> and <math>\displaystyle \phi_{2}</math>. Take <math>\displaystyle \phi_{1}</math> for instance, if the point gets far from the center <math>\displaystyle \mu_{1}</math>, then it will reduce <math>\displaystyle \phi_{1}</math> to become nearly zero. Remember that, we can usually find a non-linear regression or classification of input space by doing a linear one in some extended space or some feature space (more details in Aside). Here, the <math>\displaystyle \phi</math>’s actually produce that feature space. <br />
<br />
So, one way to look at this is that this <math>\displaystyle \phi</math> is telling us that given an input, how likely the probability of presence of a particular feature is. Say, for example, we define the features as the centers of these Gaussian distributions. Then, this <math>\displaystyle \phi</math> function somehow computes the possibility given certain data points, of this kind of feature appearing. If the data point is right at the center, then the value of that <math>\displaystyle \phi</math> would be one, i.e. the probability is 1. If the point is far from the center, then the probability (<math>\displaystyle \phi</math> function value) will be close to zero, that is, it’s less likely. Therefore, we can treat <math>\displaystyle Pr(j | x)</math> as the probability of a particular feature given data. <br />
<br />
When we have those features, then <math>\displaystyle y</math> is the linear combination of the features. Hence, any of the weights <math>\displaystyle w</math>, which is equal to <math>\displaystyle Pr(y_k | j)</math>, tells us how likely this particular <math>\displaystyle y</math> will appear given those features. Therefore, the weight <math>\displaystyle w_{jk}</math> shows the probability of class membership given feature. <br />
<br />
Hence, we have found a probabilistic point of view to look at RBF Network!<br />
<br />
*'''Note''' There are some inconsistencies with this probabilistic point of view. There are no restrictions that force <math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math> to be between 0 and 1. So if least squares is used to solve this, <math>\displaystyle w_{jk}</math> cannot be interpreted as a probability. <br />
<br />
<br />
''' Aside '''<br />
*Feature Space:<br />
:One way to produce a feature space is LDA<br />
:Suppose, we have n data points <math>\mathbf{x}_1</math> to <math>\mathbf{x}_n </math>. Each data point has d features. And these n data points consist of the <math>X</math> matrix, <br />
:<math>X = \left[ \begin{matrix}<br />
x_{11} & x_{21} & \cdots & x_{n1} \\<br />
x_{12} & x_{22} & \cdots & x_{n2} \\<br />
\vdots & & \ddots & \vdots \\<br />
x_{1d} & x_{2d} & \cdots & x_{nd}<br />
\end{matrix}\right] </math><br />
:Also, we have feature space,<br />
:<math>\Phi^{T} = \left[ \begin{matrix}<br />
\phi_{1}(\mathbf{x_1}) & \phi_{1}(\mathbf{x_2})& \cdots & \phi_{1}(\mathbf{x_n})\\<br />
\phi_{2}(\mathbf{x_1})& \phi_{2}(\mathbf{x_2})& \cdots & \phi_{2}(\mathbf{x_n}) \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{M}(\mathbf{x_1}) & \phi_{M}(\mathbf{x_2}) & \cdots & \phi_{M}(\mathbf{x_n})<br />
\end{matrix}\right] </math> <br />
:If we want to solve a regression problem for the input data, we don’t perform Least Square on this <math>\displaystyle X</math> matrix, we do Least Square on the feature space, i.e. on the <math>\displaystyle \Phi^{T}</math> matrix. The dimensionality of <math>\displaystyle \Phi^{T}</math> is M by n.<br />
:Now, we still have n data points, but we define these n data points in terms of a new set of features. So, originally, we define our data points by d features, but now, we define them by M features. And what are those M features telling us? <br />
:Let us look at the first column of <math>\displaystyle \Phi^{T}</math> matrix. The first entry is <math>\displaystyle \phi_1</math> applied to <math>\mathbf{x_1}</math>, and so on, until the last entry is <math>\displaystyle \phi_M</math> applied to <math>\mathbf{x_1}</math>. Suppose each of these <math>\displaystyle \phi_j</math> is defined by<br />
:<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>.<br />
:Then, each <math>\displaystyle \phi_j</math> checks the similarity of the data point with its center. Hence, the new set of features are actually representing M centers in our data set, and for each data point, its new features check how this point is similar to the first center; how it is similar to the second center; and how it is similar to the <math>\displaystyle M^{th}</math> center. And this checking process will apply to all data points. Therefore, feature space gives another representation of our data set. <br />
<br />
</noinclude><br />
<br />
=== Model selection or complexity control for RBF Network - a brief introduction ===<br />
In order to obtain a better fit for the training data, we often want to increase the complexity of our RBF Network. By its construction, we know that to change the complexity of a RBF Network, the only way is to add or decrease the number of basis functions. A large number of basis function yields a more complex network. In theory, if we add enough basis functions, the RFB Network would work for any training; however, it doesn't mean this RBF Network model can generalize well. Therefore, for the purpose of avoiding overfitting problem (see Notes below), we only want to increase the number of basis function to certain point, i.e. its optimal level. <br />
<br />
For the model selection, what we usually do is estimate the training error. After working through the training error, we’ll see that the training error in fact can be decomposed, and one component of training error is called Mean Squared Error (MSE). In the later notes, we will find that our final goal is to get a good estimate of MSE. Moreover, in order to find an optimal model for our data, we select the model with the smallest MSE.<br />
<br />
Now, let us introduce some notations that we will use in the analysis:<br />
*<math>\hat f</math> -- the prediction model estimated by a RBF network from the training data<br />
*<math>\displaystyle f</math> -- the real model (not null), and ideally, we want <math>\hat f</math> to be close to <math>\displaystyle f</math><br />
*<math>\displaystyle err</math> -- the training error<br />
*<math>\displaystyle Err</math> -- the testing error<br />
*<math>\displaystyle MSE</math> -- the Mean Squared Error<br />
<br />
''' Notes '''<br />
<br />
[[File:overfitting.png|350px|thumb|left|Figure 26.2: Overfitting]]<br />
<br />
*Being more complex isn’t always a good thing. Sometime, [http://en.wikipedia.org/wiki/Overfitting overfitting] causes the model to lose its generality. For example in the graph on left hand side, the data points are sampled from the model <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle f(x_i)</math> is a linear function, which is shown by the blue line, and <math>\displaystyle \epsilon_i</math>is additive Gaussian noise from <math>~N(0,\sigma^2)</math>. The red curve displayed in the graph shows the over-fitted model. Clearly, this over-fitted model only works for any training data, and is useless for any further prediction when new data points are introduced.<br />
<br />
> n<-20;<br />
> x<-seq(1,10,length=n);<br />
> alpha<-2.5;<br />
> beta<-1.75;<br />
> y<-alpha+beta*x+rnorm(n);<br />
> plot(y~x, pch=16, lwd=3, cex=0.5, main='Overfitting');<br />
> abline(alpha, beta, col='blue');<br />
> lines(spline(x, y), col = 2);<br />
<br />
*More details on this topic later on.<br />
<br />
<br />
<br />
</noinclude><br />
<br />
<br />
<br />
<br />
<br />
<br />
== '''Model Selection(Stein's Unbiased Risk Estimate)- November 11th, 2009''' ==<br />
<br />
===Model Selection===<br />
<br />
Model selection is a task of selecting a model of optimal complexity for a given data. Learning a radial basis function network from data is a parameter estimation problem. One difficulty with this problem is selecting parameters that show good performance on both training and testing data. In principle, a model is selected to have parameters associated with the best observed performance on training data, although our goal really is to achieve good performance on unseen testing data. Not surprisingly, a model selected on the basis of training data does not necessarily exhibit comparable performance on the testing data. When squared error is used as the performance index, a zero-error model on the training data can always be achieved by using a sufficient number of basis functions.<br />
<br />
<br />
But, training error and testing error do not demonstrate a linear relationship. In particular, a smaller training error does do not necessarily result in a smaller testing error. In practice, one often observes that, up to a certain point, the model error on testing data tends to decrease as the training error decreases. However, if one attempts to decrease the training error too far by increasing model complexity, the testing error often can take a dramatic increase.<br />
<br />
<br />
The basic reason behind this phenomenon is that in the process of minimizing training error, after a certain point, the model begins to over-fit the training set. Over-fitting in this context means fitting the model to training data at the expense of losing generality. In the extreme form, a set of <math>\displaystyle N</math> training data points can be modeled exactly with <math>\displaystyle N</math> radial basis functions. Such a model follows the training data perfectly. However, the model is not representative features of the true underlying data source, and this is why it fails to correctly model new data points.<br />
<br />
<br />
In general, the training error rate will be less than the testing error on the new data. A model typically adapts to the training data, and hence the training error will be an overly optimistic estimate of the testing error. An obvious way to well estimate testing err is to add a penalty term to the training error to compensate. Actually, SURE is developed based on this.<br />
<br />
<br />
<br />
===Stein's unbiased risk estimate (SURE)===<br />
<br />
<br />
====Important Notation====<br />
<br />
Let:<br />
*<math>\hat f(X)</math> denote the ''prediction model'', which is estimated from a training sample by the RBF neural network model.<br />
*<math>\displaystyle f(X)</math> denote the ''true model''.<br />
*<math>\displaystyle err=\sum_{i=1}^N (\hat y_i-y_i)^2 </math> denote the ''training error'',which is the average loss over the training sample.<br />
*<math>\displaystyle Err=\sum_{i=1}^M (\hat y_i-y_i)^2 </math> denote the ''test error'', which is the expected prediction error on an independent test sample.<br />
*<math>\displaystyle MSE=E(\hat f-f)^2</math> denote the ''mean squared error'', where <math>\hat f(X)</math> is the estimated model and <math>\displaystyle f(X)</math> is the true model.<br />
<br />
<br />
<br />
Suppose the observations <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>.We need to estimate <math>\hat f</math> from training data set <math>T=\{(x_i,y_i)\}^2_{i=1}</math>. Let <math>\hat f_i=\hat f_(x_i)</math> and <math>\displaystyle f_i= f(x_i)</math>, then <br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i-\epsilon_i)^2]</math><math>=E[(\hat f_i-f_i)^2]+E[\epsilon_i^2]-2E[\epsilon_i(\hat f_i-f_i)]</math><br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2E[\epsilon_i(\hat f_i-f_i)]</math> <math>\displaystyle (1)</math><br />
<br />
The last term can be written as:<br />
<br />
<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=E[(y_i-f_i)(\hat f_i-f_i)]=cov(y_i,\hat f)</math>, where<math>\displaystyle y_i</math> and <math>\hat f_i</math> both have same mean <math>\displaystyle f_i</math>.<br />
<br />
<br />
<br />
====Stein's Lemma====<br />
<br />
If <math>\,Z</math> is <math>\,N(\mu,\sigma^2)</math> and if <math>\displaystyle g(Z)</math> is weakly differentiable,such that<math>\displaystyle E[\vert g'(Z)\vert]<\infty</math>, then <math>\displaystyle E[g(Z)(Z-\mu)]=\sigma^2E(g'(Z))</math>.<br />
<br />
<br />
According to Stein's Lemma, the last cross term of <math>\displaystyle (1)</math>, <math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]</math> can be written as <math>\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math>. The derivation is as follows.<br />
<br />
<math>\displaystyle Proof</math>: Let <math>\,Z = \epsilon</math>. Then <math>g(Z) = \hat f-f</math>, since <math>\hat y = f + \epsilon</math>, and <math>\,f</math> is a constant. So <math>\,\mu = 0</math> and <math>\,\sigma^2</math> is the variance in <math>\,\epsilon</math>.<br />
<math>\displaystyle E[g(Z)(Z-\mu)]=E[(\hat f-f)\epsilon]=\sigma^2E(g'(Z))=\sigma^2 E[\frac {\partial (\hat f-f)}{\partial y_i}]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}-\frac {\partial f}{\partial y_i}]</math><br />
<br />
<br />
Since <math>\displaystyle f</math> is the true model, not the function of the observations <math>\displaystyle y_i</math>, then <math>\frac {\partial f}{\partial y_i}=0</math>.<br />
<br />
So,<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math> <math>\displaystyle (2)</math><br />
<br />
====Two Different Cases====<br />
SURE in RBF,<br />
[http://www.cs.ualberta.ca/~papersdb/uploaded_files/801/paper_automatic-basis-selection-for.pdf Automatic basis selection for RBF networks using Stein’s unbiased risk estimator,Ali Ghodsi Dale Schuurmans]<br />
<br />
<br />
=====''Case 1''=====<br />
<br />
Consider the case in which a new data point has been introduced to the estimated model, i.e. <math>(x_i,y_i)\not\in\tau</math>; this new point belong to the validation set <math>\displaystyle \nu</math>,i.e.<math>(x_i,y_i)\in\nu</math>. Since <math>\displaystyle y_i</math> is a new point, <math>\hat f</math> and <math>\displaystyle y_i</math> are independent. Therefore <math>\displaystyle cov(y_i,\hat f)=0</math> (or think about <math>\frac{\partial \hat f}{\partial y_i}</math>, when <math>\,y_i</math>is a new point, then it has nothing with <math>\hat f</math> because the estimation of <math>\hat f</math> is from the training data, so <math>\frac{\partial \hat f}{\partial y_i}=0</math> ) and <math>\displaystyle (1)</math> in this case can be written as:<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2</math>. <br />
<br />
This expectation means <math>\frac {1}{m}\sum_{i=1}^m (\hat y_i-y_i)^2 = \frac {1}{m}\sum_{i=1}^m (\hat f_i-f_i)^2+ \sigma^2</math>.<br />
<br />
<math>\sum_{i=1}^m (\hat y_i-y_i)^2 = \sum_{i=1}^m (\hat f_i-f_i)^2+ m\sigma^2</math><br />
<br />
Based on the notation we denote above, then we obtain:<br />
<math>\displaystyle MSE=Err-m\sigma^2</math><br />
<br />
<br />
<br />
This is the justification behind the technique of cross validation. since <math>\displaystyle \sigma^2</math> is constant, to minimize <math>\displaystyle MSE</math> is equal to minimize the test err <math>\displaystyle Err</math>. In cross vaildation to avoid overfitting or underfitting, a validation data set is independent from the estimated model.<br />
<br />
<br />
=====''Case 2''=====<br />
<br />
A more interesting case is the case in which we do not use new data points to assess the performance of the estimated model. and the training data is used for both estimating and assessing a model <math>\hat f_i</math>. In this case the cross term in <math>\displaystyle (1)</math> cannot be ignored because <math>\hat f_i</math> and <math>\displaystyle y_i</math> are not independent. Therefore the cross term can be estimated by Stein's lemma, which was originally proposed to estimated the mean of a Guassian distribution.<br />
<br />
<br />
Suppose <math>(x_i,y_i)\in\tau</math>, then by applying Stein's lemma, we obtain <math>\displaystyle (2)</math> proved above.<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2\sigma^2E[\frac {\partial \hat f}{\partial y_i}]</math>.<br />
<br />
This expectation means <math>\frac {1}{N}\sum_{i=1}^N (\hat y_i-y_i)^2 = \frac {1}{N}\sum_{i=1}^N (\hat f_i-f_i)^2+ \sigma^2-\frac {2}{N}\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<br />
<math>\sum_{i=1}^N (\hat y_i-y_i)^2 = \sum_{i=1}^N (\hat f_i-f_i)^2+ N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<math>\displaystyle err=MSE+N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math><br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math> <math>\displaystyle (3)</math><br />
<br />
In statistics, this is known as [http://www.reference.com/browse/Stein%27s+unbiased+risk+estimate Stein's unbiased risk estimate (SURE)] is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter, and thus cannot be determined completely.<br />
<br />
===SURE for RBF Network===<br />
<br />
Based on SURE, the optimum number of basis functions should be assigned to have the minimum generalization err <math>\displaystyle err</math>. Based on the Radial Basis Function Network, by setting <math>\frac{\partial err}{\partial W}</math> equal to zero , we get the least squared solution of<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math>.Then we have <math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}Y = HY</math>,where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}</math>, <math>\,H</math> is the hat matrix for this model.<br />
<br />
<br />
<math>\hat f=\,H_{i1}y_2+\,H_{i2}y_2+\cdots+\,H_{in}y_n</math> <math>\displaystyle (3)</math><br />
<br />
where <math>\,H</math> depends on the input vector <math>\displaystyle x_i</math> but not on <math>\displaystyle y_i</math>. <br />
<br />
By taking the derivative of <math>\hat f_i</math> with respect to <math>\displaystyle y_i</math>, we can easily obtain:<br />
<br />
<math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Now, substituing this into<math>\displaystyle (3)</math> ,we get<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Here, we can tell that <math>\sum_{i=1}^N \,H_{ii}= \,Trace(H)</math>, the sum of the diagonal elements of <math>\,H</math>. Thus, we can obtain the further simplification that <math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1})\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=d</math>, where<math>\displaystyle d</math> is the dimension of <math>\displaystyle \Phi</math>. Since <math>\displaystyle \Phi</math> is a projection of input matrix <math>\,X</math> onto a basis set spanned by <math>\,M</math>, the number of basis functions. If considering intercept, then <math>\,Trace(H)= M+1</math>.<br />
<br />
Then,<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1)</math>.<br />
<br />
===SURE Algorithm===<br />
<br />
<br />
[[File:27.1.jpg|350px|thumb|right|Figure 27.1]]<br />
<br />
We use this method to find the optimum number of basis function by choosing the model with smallest MSE over the set of models considered. Given a set of models <math>\hat f_M(x)</math> indexed by the number of basis functions, <math>\displaystyle err(M)</math>. <br />
<br />
Then, <math>\displaystyle MSE(M)=err(M)-N\sigma^2+2\sigma^2(M+1)</math><br />
<br />
where <math>\displaystyle N</math> is the number of training samples and the noise,<math>\sigma^2</math>, can be estimated from the training data as<br />
<br />
<math>\hat \sigma^2=\frac {1}{N-1}\sum_{i=1}^N (\hat y-y)^2</math>.<br />
<br />
<br />
By applying SURE algorithm to SPECT Heart data, we get the optimal number of basis functions is <math>\displaystyle M=4</math>.<br />
<br />
<br />
Pls take a look at figure 27.1 on the right, which shows that<math>\displaystyle MSE</math> is smallest when <math>\displaystyle M=4</math>.<br />
<br />
<br />
Calculating the SURE value is easy if you have access to <math>\,\sigma</math>.<br />
<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
If <math>\,\sigma</math> is not known, it can be estimated using the error.<br />
<br />
error = (output - expected_output) .^ 2;<br />
sigma = (dot(err, ones(1, num_data_point)) / (num_data_point)) / (num_data_point - 1);<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
=='''SURE for RBF network & Support Vector Machine - November 13th, 2009'''==<br />
<br />
===SURE for RBF network===<br />
<br />
====Minimizing MSE====<br />
<br />
By Stein's unbiased risk estimate (SURE) for Radial Basis Function (RBF) Network<br />
we get:<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1) </math> (28.1)<br />
<br />
*<math>\displaystyle MSE</math>(mean square error)= <math>\sum_{i=1}^N (\hat y_i-y_i)^2 </math><br />
*<math>\displaystyle err</math>(training error)= <math>\sum_{i=1}^N (\hat f_i-f_i)^2 </math><br />
*<math>\displaystyle (M+1) </math>( number of hidden units)= <math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math><br />
<br />
<br />
'''Goal''': To minimize MSE<br />
<br />
1. If <math>\displaystyle \sigma </math> is known, then it is no impact (i.e. a constant),<br />
and we can ignore it. Only need to minimize <math>\displaystyle MSE=err +2\sigma^2(M+1)</math>.<br />
<br />
2. In reality, we do not know <math>\displaystyle \sigma</math>, and the estimate <math>\,\hat \sigma</math> changes when <math>\displaystyle (M+1) </math> changes. However, we can estimate <math>\displaystyle \sigma </math>.<br />
<br />
<math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>. Suppose we do not know the variance of <math>\displaystyle \epsilon</math>. Then, <br />
<br />
<math>\displaystyle \sigma^2=\frac{1}{N-1}\sum_{i=1}^N (\hat y_i-y_i)^2 =\frac{1}{N-1}err</math> (28.2)<br />
<br />
Substitute (28.2) into (28.1), get<br />
<br />
<math>\displaystyle MSE=err-N\frac{1}{N-1}err+2\frac{1}{N-1}err(M+1)</math><br />
<br />
<math>\displaystyle MSE=err(1-\frac{N}{N-1}+\frac{2(M+1)}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{N-1-N+2M+2}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{2M+1}{N-1}) </math> (28.3) <br />
<br />
<br />
[[File:28.1.jpg|350px|thumb|Figure 28.1: MSE vs err]]<br />
<br />
Figure 28.1: the training error will decrease and the MSE will increase when increasing the number of hidden units (i.e. the model is more complex).<br />
<br />
<br />
When the number of hidden units gets larger and larger, the training error will decrease until it approaches to <math>\displaystyle 0 </math>. If training error equals <math>\displaystyle 0 </math>, then no matter how large <math>\displaystyle (M+1) </math> is, from (28.3) we can see the estimate of MSE will approach <math>\displaystyle 0 </math> as well. However, in fact it does not happen since when training error is close to <math>\displaystyle 0 </math> [http://en.wikipedia.org/wiki/Overfitting overfitting] happens, and MSE should increase instead of being close to <math>\displaystyle 0 </math>. We can see it from the Figure 28.1. <br />
<br />
<br />
We can see <math>\displaystyle \sigma^2 </math> is the average of <math>\displaystyle err </math>. In order to deal with this problem, we can take the average for <math>\displaystyle err</math> of each hidden unit. For example: we can first take 1 hidden unit, and take 10 hidden units in the next. Since in reality the value of <math>\, \sigma^2</math> is a constant adjustment to the data points, and doesn't depend on <math>\,M+1</math>, using the average <math>\,\sigma^2</math> value for 1 to 10 hidden units has a firm theoretical basis.<br />
<br />
We can also see that unlike the classical Cross Validation (CV) or Leave one out (LOO) techniques, the SURE technique does not need to do the validation to find the optimal model. Hence, SURE technique uses less data than CV or LOO. It is suitable for the case that there is not enough data for validation. However, to implement SURE we need to find <math>\frac {\partial \hat f}{\partial y_i}</math>, which may not be trivial for models that do not have a closed-form solution.<br />
<br />
====Kmeans Clustering====<br />
<br />
Discription:<br /> [http://en.wikipedia.org/wiki/K-means_clustering Kmeans clustering] is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.<br />
<br />
*The number of hidden units is same as the number of clusters, which is <math>\displaystyle \phi </math><br />
<br />
*<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, we set it same for all clusters.<br />
<br />
The basic details for <math>K</math>-means clustering are given:<br />
<br />
The <math>K</math> initial centers are randomly chosen from the training data.<br />
<br />
Then the following two steps are iterated alternatively until convergernce.<br />
<br />
1. for each existing center, we reidentify its cluster (every points in this cluster should be closer to this center than others).<br />
<br />
2. compute the mean for each cluster and make it as the new center for each cluster.<br />
<br />
<br />
Example:<br /><br />
Partition data into 2 clusters (2 hidden values)<br />
<br />
<br />
>> X=rand(30,80); <br />
>> [IDX,C,sumD,D]=kmeans(X,2); <br />
>> size(IDX) <br />
>> 30 1<br />
>> size(C) <br />
>> 2 80<br />
>> size(sumD) <br />
>> 2 1<br />
>> c1=sum(IDX==1)<br />
>> 14<br />
>> c2=sum(IDX==2)<br />
>> 16<br />
>> sumD<br />
>> 85.6643<br />
>> 101.0419<br />
>> v1=sumD(1,1)/c1 <br />
>> 6.1189<br />
>> v2=sumD(2,1)/c2 <br />
>> 6.3151 <br />
<br />
<br />
<br />
Comments:<br />
<br />
We create <math>X</math> randomly as a train set with 80 data points and 30 dimensions, and then apply “kmeans” method to separate X into 2 clusters.IDX is a vector contains 1 or 2 which indicates 2 clusters, and its size is 30*1. <math>\displaystyle C </math> is the center (mean) of each cluster with size 2*80; sumD is sum of the square distance between the data points and center of its cluster. The <math>\displaystyle c1 </math> and <math>\displaystyle c2 </math> indicate the number of data points in cluster 1 and 2. <math>\displaystyle V1 </math> is the variance of the first cluster <math>\displaystyle (v1=\sigma_1)</math>; <math>\displaystyle V2 </math> is the variance of the second cluster <math>\displaystyle (v2=\sigma_2)</math>. Now we can get <math>\displaystyle \phi </math>, <math>\displaystyle w </math>, hat matrix <math>\displaystyle H </math> and <math>\displaystyle \hat Y </math> by following equations. Finally, we will get the <math>\displaystyle MSE </math> and predict the test set. <br />
<br />
<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br />
<br />
<math>\displaystyle W = (\phi^{T}\Phi)^{-1}\phi^{T}Y</math><br />
<br />
<math>\displaystyle H = \phi(\Phi^{T}\Phi)^{-1}\phi^{T}</math><br />
<br />
<math>\displaystyle \hat{Y} = \phi W = \phi(\phi^{T}\phi)^{-1}\phi^{T}Y = HY</math><br />
<br />
<br />
<br />
Aside:<br />
<br />
Similar in spirit to <math>K</math>-means, there is EM algorithm with respect to Gaussian mixture model. Generally speaking, the Gaussian mixture model is referred to as a soft clustering while <math>K</math>-means is hard clustering.<br />
<br />
Similar to <math>K</math>-means, the following two steps are iterated alternatively until convergernce.<br />
<br />
E-step, each point is assigned a weight for each cluster based on the likelihood of each of the corresponding Gaussians. Observations is assigned 1 for one cluster if they are closer to the center of that cluster, and is assigned 0 for other clusters. <br />
<br />
M-step, compute the weighted means and covariances and make them as the new means and covariances for every cluster.<br />
<br />
>>[P,mu,phi,1Pxtr]=mdgEM(X,2,200,0);<br />
<br />
===Support Vector Machine===<br />
<br />
====Introduction====<br />
We have seen that linear discriminate analysis and logistic regression both estimate linear decision boundaries in similar but slightly different ways. Separating hyperplane classifiers provide the basis for support vector classifier. It constructs linear decision boundaries that explicitly try to separate the data into different classes as well as possible. The techniques that make the extensions to the nonseparable case, where the classes overlap are generalized to what is known as the support vector machine. It produces nonlinear boundaries by constructing a linear boundary in a high dimensional, transformed version of the feature space.<br />
<br />
The original basis for SVM was published in the 1960s by [http://en.wikipedia.org/wiki/Vapnik Vapnik], Chervonenkis and co, however the ideas did not gain any attention until strong results were shown in the early 90s.<br />
<br />
Definition: <br /><br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machine (SVM)] are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a Maximum margin hyperplane or set of hyperplanes in a higher or infinite dimensional space. The set of points near the class boundaries, support vectors, define the model. which can be used for classification, regression or other tasks.<br />
<br />
====Optimal Seperating Hyperplane====<br />
<br />
[[File:28.2.jpg|350px|thumb|right|Figure 28.2]]<br />
<br />
Figure 28.2 An example with two classes separated by a hyperplane. The blue line is the least squares solution, which misclassifies one of the training points. Also shown are the black separating hyperplanes found by the [http://en.wikipedia.org/wiki/Perceptron perceptron] learning algorithm with different random starts.<br /><br />
<br />
We can see the data points can be separated by a linear boundary are in two classes in <math>\displaystyle \mathbb{R}^{2} </math>. Suppose a dataset is indeed linearly separable, then there exits infinitely many possible separating hyperplanes including the black lines in the figure as two of them for training data. However, which solution is the best when we introduce the new data. <br /><br />
<br />
Aside: <br /><br />
The blue line is the least squares solution to the problem,obtained by regressing the <math>\displaystyle -1/+1 </math> response <math>\displaystyle Y </math> on <math>\displaystyle X </math> (with intercept); the line is given by<br />
<math>\displaystyle {X:\hat\beta_0+\hat\beta_1X_1+\hat\beta_2X_2=0}</math>.<br />
This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by linear discriminant analysis, in light of its equivalence with linear regression in the two-class case.<br />
<br />
Classifiers such as (28.4) that compute a linear combination of the input features and return the sign were called ''perceptrons'' in the engineering literature in the late 1950s. <br />
<br />
<br />
Identifications:<br />
<br />
*Hyperplane: seprate two classes <br />
<br />
<math>\displaystyle x^{T}\beta+\beta_0=0</math><br />
<br />
*Margin: the distance between the hyperplane and the closest point.<br />
<br />
<math>\displaystyle d_i=x_i^{T}\beta+\beta_0 </math> where <math>\displaystyle i=1,....,N</math><br />
<br />
Note: since distance is positive, if the data is on <math>\displaystyle +1 </math> side the distance is <math>\displaystyle d_i(+1)</math>. If the data is on the <math>\displaystyle -1 </math> side the distance is <math>\displaystyle d_i(-1)</math>.<br />
<br />
*Data points: <math>\displaystyle y_i\in\{-1,+1\}</math>we can classify points as<math>\displaystyle sign\{d_i\}</math> if <math>\displaystyle \beta,\beta_0 </math> is known.<br /><br />
<br />
====Maximum Margin Classifiers in the Linearly separable case====<br />
Choose the line farthest from both classes or Choose the line which has the maximum distance from the closest point. (i.e maximize the margin)<br /><br />
<br />
<math>\displaystyle Margin=min\{y_id_i\}</math> <math>\displaystyle i=1,2,....,N </math> <br />
where <math>\displaystyle y_i </math> is label and <math>\displaystyle d_i </math> is distance<br /><br />
<br />
[[File:28.3.jpg|350px|thumb|right|Figure 28.3 The linear algebra of a hyperplane]]<br />
<br />
<br />
<br />
Fiqure 28.3 depicts a hyperplane defined by the equation <math>\displaystyle x^{T}\beta+\beta_0=0</math>. Since they are in <math>\displaystyle \mathbb{R}^{2} </math>, the hyperplane is a line.<br /><br />
<br />
<br />
Let us rewrite <math>\displaystyle Margin=min\{y_id_i\}</math> by using the following properties:<br /><br />
<br />
1. <math>\displaystyle \beta </math> is orthogonal to the hyperplane <br /><br />
<br />
Two points<math>\displaystyle x_1,x_2</math> lying on the hyperplane.<br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0\beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}(x_1-x_2)=0</math><br />
<br />
Hence,<math>\displaystyle \beta </math> is orthogonal to <math>\displaystyle (x_1-x_2)</math>, and<math>\displaystyle \beta^*=\frac{\beta}{\|\beta\|} </math> is the vector normal to the hyperplane.<br /><br />
<br />
2. For any point <math>\displaystyle x_0 </math> on the hyperplane, <br />
<br />
<math>\displaystyle \beta^{T}x_0+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1=-\beta_0</math><br />
For any point on the hyperplane <math>\displaystyle \beta^{T}</math> gives negative value of the intercept of the hyperplane. <br/><br />
<br />
<br />
3. The signed distance for any point <math>\displaystyle x </math> to the hyperplane is <math>\displaystyle d_i=\beta^{T}(x_i-x_0)</math>. <br/>Since the length of <math>\displaystyle \beta </math> is not known, let it be unit vector.<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}(x_i-x_0)}{\|\beta\|} </math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i-\beta^{T}x_0}{\|\beta\|} </math><br />
<br />
by property 2<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i+\beta_0}{\|\beta\|} </math><br />
<br />
<br />
<br />
[[File:4.jpg|350px|thumb|right|Figure 28.4]]<br />
<br />
<br />
We had <math>\displaystyle Margin=min(y_id_i)</math> <math>\displaystyle i=1,2,....,N </math>, and since we now know how to compute <math>\displaystyle d_i \Rightarrow</math> <br />
<br />
<math>\displaystyle Margin=min\{y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\} </math><br />
<br />
Suppose <math>\displaystyle x_i </math> is not on the hyperplan<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)>0 </math><br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq c </math>for <math>\displaystyle c>0 </math><br />
<br />
<br />
<math>\displaystyle y_i(\frac{\beta^{T}x_i}{c}+\frac{\beta_0}{c})\geq1</math> <br />
<br />
This is known as the cnonical representation of the decision hyperplane.<br />
<br />
For <math>\displaystyle \beta^{T} </math> only the direction is important, so <math>\displaystyle \frac{\beta^{T}}{c} </math> does not change its direction, the hyperplance will be the same.<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq1 </math><br />
<br />
<math>\displaystyle y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\geq\frac{1}{\|\beta\|} </math><br />
<br />
<math>\displaystyle Margin=min(\frac{1}{\|\beta\|} )</math><br />
<br />
which is equivalent as <math>\displaystyle Margin=max(\|\beta\| )</math><br />
<br />
<br />
<br />
<br />
Refrence:<br /><br />
Hastie,T.,Tibshirani,R., Friedman,J.,(2008).The Elements of Statistical Learning:129-130<br />
<br />
=='''Optimizing The Support Vector Machine - November 16th, 2009'''==<br />
We currently derive Support Vector Machine for the case where two classes are separable in the given feature space. This margin can be written as <math>\,min\{y_id_i\}</math>, or the distance of each point from the hyperplane, where <math>\,d_i</math> is the distance and <math>\,y_i</math> is used as the sign.<br />
===Margin Maximizing Problem for the Support Vector Machine===<br />
<math>\,Margin=min\{y_id_i\}</math> can be rewritten as <math>\,min\left\{\frac{y_i\left(\beta^Tx_i+\beta_0\right)}{|\beta|}\right\}</math>. <br />
<br\>Note that the term <math>\,y_i\left(\beta^Tx_i+\beta_0\right) = 0</math> if <math>\,x_i</math> is on the hyperplane, but <math>\,y_i\left(\beta^Tx_i+\beta_0\right) > 0</math> if <math>\,x_i</math> is <math>\displaystyle not</math> on the hyperplane<br />
<br />
This implies <math>\,\exists C>0</math> such that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq C</math>.<br />
<br />
Divide through by C to produce <math>\,y_i\left(\frac{\beta^T}{C}x_i + \frac{\beta_0}{C}\right) \geq 1</math>. <br />
<br />
<math>\,\beta, \beta_0</math> compose a hyperplane that can have different values but we care about the direction, dividng through by a constant does not change the direction of the hyperplane. Thus, by assuming scaled values for <math>\,\beta, \beta_0</math> we eliminate C, so that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq 1</math>. Implying that the lower bound on <math>\,y_i\left(\beta^Tx_i+\beta_0\right)</math> is <math>\displaystyle 1</math><br />
<br />
Now in order to maximize the margin, we simply need to find <math>\,min\left\{\frac{1}{|\beta|}\right\}</math>. <br />
<br />
In other words, our optimization problem is now to find maximum <math>\,|\beta|</math>, under the constraint that <math>\,min_i\{y_i(\beta^Tx_i+\beta_0)\} = 1</math>.<br />
<br />
Note that we're dealing with the norm of <math>\,\beta</math>. There are many different choices of possible norms, in general[http://en.wikipedia.org/wiki/P-norm#p-norm p-norm]. The 1-norm of a vector is simply the sum of the absolute value of each element (also known as the taxicab or Manhattan distance), and is apparently more accurate, but also has a discontinuity in the derivative. 2-norm or the Euclidean norm (the intuitive measure of the length of a vector), is easier to work with - that is <math>\,\|\beta\|_2 = (\beta^T\beta)^{1/2}</math>. For convenience, we will maximize <math>\,\frac{1}{2}\|\beta\|_2^2 = \frac{1}{2}\beta^T\beta</math> where the constant 1/2 has been added for simplification and that the maximizing the function is the same as maximizing the square root of that function.<br />
<br />
This is an example of a quadratic programming problem and we will minimize a quadratic function respective to a linear inequality constraints.<br />
<br />
<br />
====Writing Lagrangian Form of Support Vector Machine====<br />
The Lagrangian form is introduced to ensure that the conditions are satisfied, as well as finding an optimal solution(The optimal saddle point of the lagrangian for the classic quadratic optimization). The problem will be solved in dual space by introducing <math>\,\alpha_i</math> as dual constraints, this is in contrast to solving the problem in primal space as function of the betas. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.<br />
<br />
<math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math>. To find the optimal value, set the derivative equal to zero.<br />
<br />
<math>\,\frac{\partial L}{\partial \beta} = 0</math>, <math>\,\frac{\partial L}{\partial \beta_0} = 0</math>. Note that <math>\,\frac{\partial L}{\partial \alpha_i}</math> is equivalent to the constraints <math>\left(y_i(\beta^Tx_i+\beta_0)-1\right) \geq 0, \,\forall\, i</math><br />
<br />
First, <math>\,\frac{\partial L}{\partial \beta} = \frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\left\{\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i)+\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0-\frac{\partial}{\partial \beta}\alpha_iy_i\right\}}</math><br />
<br />
: <math>\frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 = \beta</math>.<br />
<br />
: <math>\,\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i) = \alpha_iy_ix_i</math><br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0 = 0</math>. <br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i = 0</math>.<br />
<br />
So this simplifies to <math>\,\frac{\partial L}{\partial \beta} = \beta - \sum_{i=1}^n{\alpha_iy_ix_i} = 0</math>. In other words,<br />
<br />
<math>\,\beta = \sum_{i=1}^n{\alpha_iy_ix_i}</math>, <math>\,\beta^T = \sum_{i=1}^n{\alpha_iy_ix_i^T}</math><br />
<br />
Similarly, <math>\,\frac{\partial L}{\partial \beta_0} = -\sum_{i=1}^n{\alpha_iy_i} = 0</math>.<br />
<br />
This allows us to rewrite the Lagrangian without <math>\,\beta</math>.<br />
<br />
<math>\,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} - \sum_{i=1}^n{\alpha_iy_i(\sum_{j=1}^n{\alpha_jy_jx_j^Tx_i} + \beta_0) - 1}</math>. <br />
<br />
Because <math>\,\sum_{i=1}^n{\alpha_iy_i} = 0</math>, and <math>\,\beta_0</math> is constant, <math>\,\sum_{i=1}^n{\alpha_iy_i\beta_0} = 0</math>. So this simplifies further, to<br />
<br />
<math>L(\alpha) = \,-\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} + \sum_{i=1}^n{\alpha_i}</math><br />
A dual representation of the maximum margin.<br />
<br />
Because <math>\,\alpha_i</math> is the Lagrangian multiplier, <math>\,\alpha_i \geq 0 \forall i</math>.<br />
<br />
This is a much simpler optimization problem.<br />
<br />
=='''The Support Vector Machine algorithm - November 18, 2009'''==<br />
<br />
===Solving the Lagrangian===<br />
<br />
Continuing from the above derivation, we now have the equation that we need to minimize, as well as two constraints.<br />
<br />
The Support Vector Machine problem boils down to:<br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math><br />
:such that <math>\alpha_i \geq 0</math><br />
:and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
<br />
We are looking to minimize <math>\,\alpha</math>, which is our only unknown. Once we know <math>\,\alpha</math>, we can easily find <math>\,\beta</math> and <math>\,\beta_0</math> (see the Support Vector algorithm below for complete details).<br />
<br />
If we examine the Lagrangian equation, we can see that <math>\,\alpha</math> is multiplied by itself; that is, the Lagrangian is quadratic with respect to <math>\,\alpha</math>. Our constraints are linear. This is therefore a problem that can be solved through [http://en.wikipedia.org/wiki/Quadratic_programming quadratic programming] techniques. We will examine how to do this in Matlab shortly.<br />
<br />
We can write the Lagrangian equation in matrix form:<br />
<br />
<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:such that <math>\underline{\alpha} \geq \underline{0}</math><br />
:and <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Where:<br />
* <math>\underline{\alpha}</math> denotes an <math>\,n \times 1</math> vector; <math>\underline{\alpha}^T = [\alpha_1, ..., \alpha_n]</math><br />
* Matrix <math>S = y_iy_jx_i^Tx_j = (y_ix_i)^T(y_ix_i)</math><br />
* <math>\,\underline{0}</math> and <math>\,\underline{1}</math> are vectors containing all 0s or all 1s respectively<br />
<br />
Using this matrix notation, we can use Matlab's built in quadratic programming routine, [http://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/quadprog.html quadprog].<br />
<br />
===Quadprog example===<br />
<br />
Let's use <code>quadprog</code> to find the solution to <math>\,L(\alpha)</math>.<br />
<br />
Matlab's <code>quadprog</code> function minimizes an equation of the following form:<br />
:<math>\min_x\frac{1}{2}x^THx+f^Tx</math><br />
:such that: <math>\,A \cdot x \leq b</math>, <math>\,Aeq \cdot x = beq</math> and <math>\,lb \leq x \leq ub</math><br />
<br />
We can now see why we kept the <math>\frac{1}{2}</math> constant in the original derivation of the equation.<br />
<br />
The function is called as such: <code>x = quadprog(H,f,A,b,Aeq,beq,lb,ub)</code>. The variables correspond to values in the equation above.<br />
<br />
We can now try to find the solution to <math>\,L(\alpha)</math> (though, it should be noted, that in <math>\,L(\alpha)</math> we subtract the first term rather than add it; I'm not sure if this difference would change how we call <code>quadprog</code>).<br />
<br />
We'll use a simple one-dimensional data set, which is essentially y = -1 or 1 + Gaussian noise. (Note: you could easily put the values straight into the quadprog equation; they are separated for clarity.)<br />
<br />
x = [mvnrnd([-1],[0.01],100); mvnrnd([1],[0.01],100)]';<br />
y = [-ones(100,1); ones(100,1)];<br />
S = (y*x)' * (y*x);<br />
f = ones(200,1);<br />
A = -ones(1,200);<br />
b = 0;<br />
Aeq = y';<br />
beq = 0;<br />
lb = 0;<br />
ub = []; % There is no upper bound<br />
alpha = quadprog(S,f,A,b,Aeq,beq,lb,ub);<br />
<br />
This gives us the optimal <math>\,\alpha</math>... or at least I think it should, but it does not appear to work for me (that is, despite setting the lower boundary of 0, a number of values are still negative. Whether this is just the nature of <code>quadprog</code> or an error on my part is an exercise for the reader).<br />
<br />
===Examining K.K.T. conditions===<br />
<br />
[http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker conditions] ([http://webrum.uni-mannheim.de/mokuhn/public/KarushKuhnTucker.pdf more info]) give us a closer look into the Lagrangian equation and the associated conditions.<br />
<br />
Suppose we are looking to minimize <math>\,f(x)</math> such that <math>\,g_i(x) \geq 0, \forall{x}</math>. If <math>\,f</math> and <math>\,g</math> are differentiable, then the ''necessary'' conditions for <math>\hat{x}</math> to be a local minimum are:<br />
<br />
# At the optimal point, <math>\frac{\partial L}{\partial \hat{x}} = 0</math>; i.e. <math>f'(\hat{x}) - \sum{\alpha_ig'(\hat{x})}=0</math><br />
# <math>\alpha_i \geq 0</math>. (Dual Feasibility) <br />
# <math>\alpha_ig_i(\hat{x}) = 0, \forall{i}</math> (Complementary Slackness)<br />
# <math>g_i(\hat{x}) \geq 0</math> (Primal Feasibility)<br />
<br />
If any of these conditions are violated, then the problem is deemed not feasible.<br />
<br />
These are all trivial except for condition 4. Let's examine it further in our support vector machine problem.<br />
<br />
===Support Vectors===<br />
<br />
Basically, the support vectors are the training points that actually determine the optimal separating hyperplane which we are look for. Also, they are the most difficult points to classify or the most informative for the classification.<br />
<br />
In our case, the <math>g_i(\hat{x})</math> function is:<br />
:<math>\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1</math><br />
<br />
Substituting <math>\,g_i</math> into KKT condition 3, we get <math>\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0</math>. <br\>In order for this condition to be satisfied either <br/><math>\,\alpha_i= 0</math> or <br/><math>\,y_i(\beta^Tx_i+\beta_0)=1</math><br />
<br />
All points <math>\,x_i</math> will be either 1 or greater than 1 distance away from the hyperplane.<br />
<br />
'''Case 1: a point <math>\displaystyle x_i > 1</math> away from the margin'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) > 1 \Rightarrow \alpha_i = 0</math>.<br />
<br />
If point <math>\, x_i</math> is not on the margin, then the corresponding <math>\,\alpha_i=0</math>.<br />
<br />
'''Case 2: a point <math>\displaystyle x_i = 1</math> away from the margin'''<br />
<br />
If <math>\,\alpha_i > 0 \Rightarrow y_i(\beta^Tx_i+\beta_0) = 1</math> <br />
<br\>If point <math>\, x_i</math> is on the margin, then the corresponding <math>\,\alpha_i>0</math>.<br />
<br />
<br />
Points on the margin, with corresponding <math>\,\alpha_i > 0</math>, are called '''''support vectors'''''.<br />
<br />
===Using support vectors===<br />
<br />
Support vectors are important because they allow the support vector machine algorithm to be insensitive to outliers. If <math>\,\alpha = 0</math>, then the cost function is also 0, and won't contribute to the solution of the SVM problem; only points on the margin -- support vectors -- contribute. Hence the model given by SVM in entirely defined by the set of support vectors, a subset of the entire training set. This is interesting because in the NN methods(and can be generalize to classical statistical learning) previous to this the configuration of the network needed to be specified. In this case we have a data-driven or 'nonparametric' model in which is the training set and algorithm will determine the support vectors.<br />
<br />
References:<br />
Wang, L, 2005. Support Vector Machines: Theory and Applications, Springer, 3<br />
<br />
====The support vector machine algorithm====<br />
<br />
# Solve the quadratic programming problem:<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
## Use Matlab's quadprog to find the optimal <math>\,\underline{\alpha}</math><br />
# Find <math>\beta = \sum_{i=1}^n{\alpha_iy_i\underline{x_i}}</math><br />
# Find <math>\,\beta_0</math> by choosing a support vector (a point with <math>\,\alpha_i > 0</math>) and solving <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math><br />
<br />
===Example in Matlab===<br />
<br />
The following code, taken verbatim from the lecture, shows how to use Matlab built-in SVM routines (found in the Bioinformatics toolkit) to do classification through support vector machines.<br />
<br />
load 2_3;<br />
[U,Y] = princomp(X');<br />
data = Y(:,1:2);<br />
l = [-ones(1,200) ones(1,200)];<br />
[train,test] = crossvalind('holdOut',400);<br />
% Gives indices of train and test; so, train is a matrix of 0 or 1, 1 where the point should be used as part of the training set<br />
svmStruct = svmtrain(data(train,:), l(train), 'showPlot', true);<br />
<br />
[[File:Svm1.png|frame|center|The plot produced by training on some of the 2_3 data's first two features.]]<br />
<br />
yh = svmclassify(svmStruct, data(test,:), 'showPlot', true);<br />
<br />
[[File:Svm2.png|frame|center|The plot produced by testing some of the 2_3 data.]]<br />
=='''Non-linear hypersurfaces and Non-Separable classes - November 20, 2009'''==<br />
==='''Kernel Trick'''===<br />
We have talked about the curse of dimensions at the beginning of this course, however, now we turn to the power of high dimensions in order to make find a linearly separable hyperplane between two classes of data points. To understand this, imagine a two dimensional prison where a two dimensional person is constrained. Suppose magically we give the person a third dimension, then he can escape from the prison. In other words, the prison and the person are linearly separable now with respect to the third dimension. The intuition behind the "kernel trick" is basically to map data to a higher dimension so that they are linearly separable by a hyperplane.<br />
<br />
We have seen SVM as a linear classification problem finding the max margin hyperplane in the given input space. However, for many real world problems a more complex decision boundary is required. The following simple method was devised is order to solve the same linear classification problem but in a usually higher dimensional space, 'feature space' under which the max margin hyperplane is better suited.<br />
<br />
Let <math>\,\phi</math> be a mapping,<br />
<br />
<math>\phi:\Re^d \rightarrow \Re^D </math><br /><br /><br />
<br />
We wish to find a <math>\,\phi</math> such that our data will be suited for separation by a hyperplane. Given this function, we are lead to solving the previous constrained quadratic optimization on the transformed dataset,<br /><br /><br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_j\phi(x_i)^T\phi(x_j)}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br /><br /><br />
<br />
The solution to this optimization problem is now well know; however a workable <math>\,\phi</math> must be determined. Possibly the largest drawback in this method is that we must compute the inner product of two vectors in the high dimensional space. As the number of dimensions in the initial data set increases, the inner product becomes computationally intensive or impossible.<br />
<br />
However, we have a very useful result that says that there exists a class of functions, <math>\,\Phi</math>, which satisfy the above requirements and that for any function <math>\,\phi \in \Phi</math>,<br />
<br />
<math>\,\phi(x_i)^T\phi(x_j) = K(x_i,x_j) </math><br /><br /><br />
<br />
Where K is the kernel function in the input space satisfying [http://en.wikipedia.org/wiki/Mercer%27s_condition Mercer's condition] (to guarantee that it indeed corresponds to certain mapping function <math>\,\phi</math>). As a result, if the objective function depends on inner products but not on coordinates, we can always use the kernel function to implicitly calculate in the feature space without storing the huge data. Not only does this solve the computation problems but it no longer requires us to explicitly determine a specific mapping function in order to use this method. In fact, it is now possible to use an infinite dimensional feature space in SVM without even knowing the function <math>\,\phi</math>.<br />
<br />
==='''Mercer's Theorem in detail'''===<br />
Let <math>\,\phi</math> be a mapping to a high dimensional [http://en.wikipedia.org/wiki/Hilbert_space Hibert Space] <math>\,H</math><br /><br />
<br />
<br />
<math>\phi:x \in \Re^d \rightarrow H </math><br /><br /><br />
<br />
The transformed coordinates can be defined as,<br /><br />
<br />
<math>\phi_1(x)\dots\phi_d(x)\dots </math><br /><br /><br />
<br />
By Hilbert - Schmidt theory we can represent an inner product in Hilbert space as,<br /><br /><br />
<br />
<math>\,\phi(x_i)^T\phi(x_j) = \sum_{r=1}^{\infty}a_k\phi_r(x_i)\phi(x_j) \Leftrightarrow K(x_i,x_j), \ a_r \ge 0 </math><br /><br /><br />
where K is symmetric, then Mercer's theorem gives necessary and sufficient conditions on K for it to satisfy the above relation.<br><br><br />
<br />
'''Mercer's Theorem'''<br />
<br />
Let C be a compact subset of <math>\Re^d</math> and K a function <math> \in L^2(C) </math>, if<br /><br /><br />
<br />
<math>\, \int_C\int_C K(u,v)g(u)g(v)dudv \ge 0, \ \forall g \in L^2(C)</math> <br /><br /><br />
<br />
then,<br /><br /><br />
<br />
<math>\sum_{r=1}^{\infty}a_k\phi_r(u)\phi(v)</math> converges absolutely and uniformly to a symmetric function <math>\,K(u,v)</math><br />
<br />
References:<br />
Vapnik, V., 1998. Statistical Learning Theory. John Wiley & Sons, {423}<br />
<br />
==='''Kernel Functions'''===<br />
There are various kernel functions, for example:<br />
<br />
* Linear kernel: <math>\,k(x,y)=x \cdot y</math><br />
* Polynomial kernel: <math>\,k(x,y)=(x \cdot y)^d</math><br />
* Gaussian kernel: <math>e^{-\frac{|x-y|^2}{2\sigma^2}}</math><br />
<br />
If <math>\,X</math> is a <math>\,d \times n</math> matrix in the original space, and <math>\,\phi(X)</math> is a <math>\,D \times n</math> matrix in the [http://en.wikipedia.org/wiki/Hilbert_space Hilbert space] (good explanation video: [http://www.youtube.com/watch?v=V2pBdH7YzX0 part 1] [http://www.youtube.com/watch?v=YRY5xlk3TC0 part 2]), then <math>\,\phi^T(X) \cdot \phi(X)</math> is an <math>\,n \times n</math> matrix. <br />
The inner product is also illustrated as correlation, which measures the similarity between data points. This gives us some insight in how to choose the kernel. The choice depends on certain prior knowledge of the problem and on how we believe the similarity of our data should be measured. In practice, the Gaussian (RBF) kernel usually works best. Besides the most common kernel functions mentioned above, many novel kernels are also suggested for different problem domains like text classification, gene classification and so on.<br />
<br />
These kernel functions can be applied to many algorithms to derive the "kernel version". For example, kernel PCA, kernel LDA, etc..<br />
<br />
==='''SVM: non-separable case'''===<br />
We have seen how SVMs are able to find an optimally separating hyperplane of two separable classes of data, in which case the margin contains no data points. However, in the real world, data of different classes are usually mixed together at the boundary and it's hard to find a perfect boundary to totally separate them. To address this problem, we slack the classification rule to allow data cross the margin. Mathematically the problem becomes,<br />
:<math>\min_{\beta, \beta_0} \frac{1}{2}|\beta|^2</math><br />
:<math>\,y_i(\beta^Tx_i+\beta_0) \geq 1-\xi_i</math><br />
:<math>\xi_i \geq 0</math><br />
<br />
Now each data point can have some error <math>\,\xi_i</math>. However, we only want data to cross the boundary when they have to and make the minimum sacrifice; thus, a penalty term is added correspondingly in the objective function to constrain the number of points that cross the margin. The optimization problem now becomes:<br />
<br />
:<math>\min_{\alpha} \frac{1}{2}|\beta|^2+\gamma\sum_{i=1}^n{\xi_i}</math><br />
:<math>\,s.t.</math> <math>y_i(\beta^Tx+\beta_0) \geq 1-\xi_i</math> <br />
:<math>\xi_i \geq 0</math><br />
<br />
[[File:non-separable.JPG|350px|thumb|right|Figure non-separable case]]<br />
<br />
<br\>Note that <math>\,\xi_i</math> is not necessarily smaller than one, which means data can not only enter the margin but can also cross the separating hyperplane.<br />
<br />
References:<br />
<br />
Mercer, J., 1909. Functions of positive and negative type and their connection<br />
with the theory of integral equations. Philos. Trans. Roy. Soc. London, A<br />
209:415{446}<br />
<br />
==Support Vector Machine algorithm for non-separable cases - November 23, 2009==<br />
<br />
With the program formulation above, we can form the lagrangian, apply KKT conditions, and come up with a new function to optimize. As we will see, the equation that we will attempt to optimize in the SVM algorithm for non-separable data sets is the same as the optimization for the separable case, with slightly different conditions.<br />
<br />
===Forming the Lagrangian===<br />
<br />
:<math>L: \frac{1}{2} |\beta|^2 + \gamma \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i[y_i(\beta^T x_i+\beta_0)-1+\xi_i]-\sum_{i=1}^n \alpha_i \xi_i</math><br />
:<math>\alpha_i \geq 0, \lambda_i \geq 0</math><br />
<br />
===Applying KKT conditions===<br />
<br />
Someone please fill in<br />
<br />
===Putting it all together===<br />
<br />
With our KKT conditions and the Lagrangian equation, we can now use quadratic programming to find <math>\,\alpha</math>.<br />
<br />
In matrix form, we want to solve the following optimization:<br />
:<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:<math>\,s.t.</math> <math>\underline{0} \leq \underline{\alpha} \leq \gamma</math>, <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Solving this gives us <math>\,\underline{\alpha}</math>, which we can use to find <math>\,\underline{\beta}</math> as before:<br />
:<math>\,\underline{\beta} = \sum{\alpha_i y_i \underline{x_i}}</math><br />
<br />
However, we cannot find <math>\,\beta_0</math> in the same way as before, even if we choose a point with <math>\,\alpha_i > 0</math>, because we do not know the value of <math>\,\xi_i</math> in the equation<br />
:<math>\,y_i(\underline{\beta}^Tx_i + \beta_0) - 1 + \xi_i = 0</math><br />
<br />
From our discussion on the KKT conditions, we know that <math>\,\lambda_i \xi_i = 0</math> and <math>\,\gamma = \alpha_i + \lambda_i</math>.<br />
<br />
So, if <math>\,\alpha_i < \gamma</math> then <math>\,\lambda_i > 0</math> and consequently <math>\,\xi_i = 0</math>.<br />
<br />
Therefore, we can solve for <math>\,\beta_0</math> if we choose a point where:<br />
:<math>\,0 < \alpha_i < \gamma</math><br />
<br />
====The SVM algorithm for non-separable data sets====<br />
<br />
The algorithm, then, for non-separable data sets is:<br />
<br />
# Use <code>quadprog</code> (or another quadratic programming technique) to solve the above optimization and find <math>\,\alpha</math><br />
# Find <math>\,\underline{\beta}</math> by solving <math>\,\underline{\beta} = \sum{\alpha_i y_i x_i}</math><br />
# Find <math>\,\beta_0</math> by choosing a point where <math>\,0 < \alpha_i < \gamma</math> and then solving <math>\,y_i(\underline{\beta}^Tx_i + \beta_0) - 1 = 0</math></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=5508stat8412009-11-23T23:52:48Z<p>Ipargaru: /* Forming the Lagrangian */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rise of fields such as data-mining, bioinformatics, and machine learning, classification has becomes a fast-developing topic. In the age of information, vast amounts of data are generated constantly, and the goal of classification is to learn from data. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
'''Definition''': The problem of Prediction a discrete random variable <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math> is called Classification.<br />
<br />
In classification,, we attempt to approximate a function <math>\,h</math>, by using a training data set, which will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>d</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data which are identical independent distributions, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, for instance, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability. Intuitively speaking, to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we find <math>\,y\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y|X=x)</math>, and classify <math>\,X</math> into class <math>\,y</math>. In order to calculate the value of <math>\,P(Y=y|X=x)</math>, we use ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posterior probability, <math>\,P(Y=y)</math> as the prior probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two classes, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<sub></sub><br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''3 different approaches to classification''':<br />
<br />
1) Empirical Risk Minimization: Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of <math>\,L(h)</math><br />
<br />
2) Regression: Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3) Density Estimation: estimate <math>\,P(X=x|Y=0)</math> and <math>\,P(X=x|Y=1)</math> (less popular in high-dimension cases)<br />
<br />
<br />
<br />
'''Bayes Classification Rule Optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when<math></math> the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
<br />
:<math>\, h^*(X)= \left\{\begin{matrix} <br />
1 & if P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
Remark:<br />
<br />
1)Bayes classification rule is optimal. Proof:[http://www.ee.columbia.edu/~vittorio/BayesProof.pdf]<br />
<br />
2)We still need any other method, since we cannot define prior probability in realistic.<br />
<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, since it is generally impossible for us to know the prior <math>\,P(Y=1)</math>, and class conditional density <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], tree-augmented naive Bayes (TAN), Bayesian network augmented naive Bayes (BAN) and general Bayesian network (GBN).<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as having objective existence. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayesian and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayesian'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In Bayesian method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
'''Multi-class Classification''':<br />
<br />
Y takes on more than two values.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
''Theorem'': Suppose that <math>\,Y \in \mathcal{Y}= \{1,\dots,k\}</math>, the optimal rule is :<math>\,h^*(X) = \arg\max_{k}{P(Y = k|X = x)}</math><br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
1 Empirical Risk Minimization:Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
<br />
2 Regression:Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3 Density estimation, estimate <math>P(X = x|Y = 0)</math> and <math>P(X = x|Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
It is quadratic because there is no boundaries.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on [http://academicearth.org/lectures/advice-for-applying-machine-learning LDA and QDA] so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
LDA is for classification and FDA is used for feature extraction.<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
<br />
=== logistic function ===<br />
A logistic function or logistic curve is the most common sigmoid curve. <br />
<br />
:<math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
1. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
2. <math>y(0) = \frac{1}{2}</math><br />
<br />
3. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
4. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
5. The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood, using Pr(Y|X). The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Seeing these equations as a weighted least squares problem makes them easier to derivate.<br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== Perceptron (Foundation of Neural Network) ===<br />
<br />
==== Separating Hyperplane Classifiers ====<br />
Separating hyperplane trys to separate the data using linear decision boundaries. When the classes overlap, it can be generalized to support vector machine, which constructs nonlinear boundaries by constructing a linear boundary in an enlarged and transformed feature space.<br />
<br />
==== Perceptron ====<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
Least Squares returns the sign of a linear combination of data points as the class label<br />
<br />
sign(<math>(\underline{\beta}^T \underline{x} + {\beta}_0)) = sign(\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2})</math><br />
<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the [http://en.wikipedia.org/wiki/Artificial_neural_network Artificial Neural Network] models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br />
<br />
<br />
If we find a hyperplane that is not unique between 2 classes, there will be infinitely many solutions obtained from the perceptron algorithm.<br />
<br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
A Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math>, where <math>\,I</math> indicates the sign of the expression and returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math> (initial guess). Its goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach], which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
====Problems with the Algorithm and Issues Affecting Convergence====<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. This problem can be eliminated by using basis expansions technique. To be specific, we try to find a hyperplane not in the original space, but in the enlarged space obtained by using some basis functions.<br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#A perfect separation is not always available even desirable. If observations comes from different classes sharing the same imput, the classification model seems to be overfitting and will generally have poor predictive performance.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
<br />
</ref>.<br />
====Comment on gradient descent algorithm====<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction in which the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and you initially stand in the middle, then you will finally arrive at the saddle point (local minimum) and get stuck there.<br />
<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over <math>\,i</math> (all data points). Actually, this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent) known as Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>\,{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
*A Perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection with branches that "fan out" onto as many connections as desired, each carrying the same signal - the processing element output signal. <br />
<br />
<ref> Haykin, Simon (2009). Neural Networks and Learning Machines. Pearson Education, Inc. </ref><br />
A neural network resembles the brain in two respects:<br />
1. Knowledge is acquired by the network from its environment through a learning process.<br />
2. Interneuron connection strengths, known as synaptic weights, are used to store the acquired knowledge.<br />
<br />
<ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref> It is a multistage regression or classification model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
A regression problem typically has only one unit in the output layer. In a k-class classification problem, there are usually k units in the output layer that each represent the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1).<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classification by NN. <br />
<br />
In perceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] and is not continuous at 0. Thus, we replace it by a smooth function <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it the '''activation function'''.<br />
<br>The choice of this function <math>\displaystyle \sigma </math> is determined by the properties of the data and the assumed distribution of target variables, but for multiple binary classification problems the logistic function, also known as inverse-logit, is often used: <br />
<math>\sigma(a)=\frac {1}{1+e^{-a}}</math><br />
<br />
There are some important properties for the activation function.<br />
<br />
# Activation function is nonlinear. It can be shown that if the activation function of the hidden units is linear, a three-layer neural network is equivalent to a two layer one. <br />
# Activation function saturate, which means there are maximum and minimum output value. This property ensures that the weights are bounded and therefore the searching time is limited. <br />
# Activation function is continuous and smooth.<br />
# Activation function is monotonic. This property is not necessary, since we know that RBF networks is also a kind of power model. <br />
<br />
'''Note:''' A key difference between a perceptron and a neural network is that a neural network uses continuous nonlinearities in the units, for the purpose of differentiation, whereas the perceptron often uses a non-differentiable activation function. The neural network function is differentiable with respect to the network parameters so that a gradient descent method can be used in training. Moreover, a perceptron is a linear classifier, whereas a neural network, by combining layers of perceptrons, is able to classify non-linear problems through proper training.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes into the perceptron, to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural network is called [http://en.wikipedia.org/wiki/Feedforward_neural_network Feed-Forward Neural Network]. Applications to Feed-Forward Neural Networks include data reduction, speech recognition, sensor signal processing, and ECG abnormality detection, to name a few. <ref>J. Annema, Feed-Forward Neural Networks, (Springer 1995), pp. 9 </ref><br />
<br />
===Back-propagation===<br />
For a while, the Neural Network model was just an idea, since there were no algorithms for training the model until 1986, when Geoffrey Hinton <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> devised an algorithm called '''back-propagation''' [http://en.wikipedia.org/wiki/Backpropagation#Algorithm]. After that, a number of other training algorithms and various configurations of neural networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied a gradient descent algorithm for optimizing weights. Back-propagation uses this idea of gradient descent to train a neural network based on the chain rule in calculus. <br />
<br />
Assume that the last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
For simplicity, there is only 1 unit at the end and assume for the moment we are doing regression.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each unit we have a function <math>\displaystyle z_i=\sigma(a_i)</math> that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classification and the resulting classification output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
''i j are reversed!''<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\delta_j \cdot z_l</math><br />
<br />
Note that a change in <math>\,a_j</math> causes changes in all <math>\,a_i</math> in the next layer on which the error is based, so we need to sum over i in the chain:<br />
<math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} =\sum_i \delta_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\,\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math> Using the activation function <math>\,\sigma(\cdot)</math><br />
<br />
So <math>\delta_j = \sum_i \delta_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \delta_i \cdot u_{ij}</math><br />
<br />
We can propagate the error calculated in the output back through the previous layers and adjust weights to minimize error.<br />
<br />
==Neural Networks (NN) - October 30, 2009 ==<br />
<br />
=== Back-propagation ===<br />
The idea is that we first feed an input from the training set to the Neural Network, then find the error rate at the output and then we propagate the error to previous layers and for each edge of weight <math>\,u_{ij}</math> we find <math>\frac{\partial \mathrm{err}}{\partial u_{ij}}</math>. Having the error rates at hand we adjust the weight of each edge by taking steps proportional to the negative of the gradient to decrease the error at output. The next step is to apply the next input from the training set and go through the described adjustment procedure.<br />
The overview of Back-propagation algorithm:<br />
#Feed a point <math>\,x</math> in the training set to the network, and find the output of all the nodes.<br />
#Evaluate <math>\,\delta_k=y_k-\hat{y_k}</math> for all output units, where <math>y_k</math> is the expected output and <math>\hat{y_k}</math> is the real output.<br />
#By propagating to the previous layers evaluate all <math>\,\delta_j</math>s for hidden units: <math>\,\delta_j=\sigma'(a_j)\sum_i \delta_i u_{ij}</math> where <math>i</math> is associated to the previous layer.<br />
#Using <math>\frac{\partial \mathrm{err}}{\partial u_{jl}} = \delta_j\cdot z_l</math> find all the derivatives.<br />
#Adjust each weight by taking steps proportional to the negative of the gradient: <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} -\rho \frac{\partial \mathrm{err}}{u_{ij}}</math><br />
#Feed the next point in the training set and repeat the above steps.<br />
==== How to initialize the weights ====<br />
This still leaves the question of how to initialize the weights <math>\,u_{ij}, w_i</math>. The method of choosing weights mentioned in class was to randomize the weights before the first step. This is not likely to be near the optimal solution in every case, but is simple to implement. To be more specific, random values near zero will be a good choice for the initial weights(usually from [-1,1]). In this case, the model evolves from a nearly linear one to a nonlinear one as we desired. An alternative is to use an orthogonal least squares method to find the initial weights <ref>http://www.mitpressjournals.org/doi/abs/10.1162/neco.1995.7.5.982</ref>. Regression is performed on the weights and output by using a linear approximation of <math>\,\sigma(a_i)</math>, and finds optimal weights in the linear model. Back propagation is used afterward to find the optimal solution, since the NN is non-linear.<br />
<br />
==== How to set learning rates ====<br />
The learning rate <math>\,\rho</math> is usually a constant. <br />
<br />
If we use On-line learning, as a form of stochastic approximation process, <math>\,\rho</math> should decrease as the iteration increase.<br />
<br />
<br />
Choosing too large learning rate may cause the unstability of the system, while the too small learning rate may lead to a very slow convergence rate(very long time in learning phase). However, the advantage of small learning rate is that it can guarantee the convergence. Thus, generally, it is better to choose a relatively small learning rate to ensure the stability. Usually, choose <math>\,\rho</math> between 0.01 and 0.7<br />
<br />
==== How to determine the number of hidden units ====<br />
<br />
Here we will mainly discuss how to estimate the number of hidden units at very beginning. Obviously, we should adjust it to be more precise using CV, LOO or other complexity control methods. <br />
<br />
Basically, if the patterns are well separated, few hidden units are fairly enough. If the patterns are drawn from some highly complicated mixture models, more hidden units are really needed. <br />
<br />
Actually, the number of hidden units determines the size of the model, and therefore the total number of the weights in the model. Typically speaking, the number of weights should not be larger than the number of training data, say N. Thus, sometimes, N/10 is a good choice. However, in pratice, many well performed models will use more hidden units.<br />
<br />
=== Dimensionality reduction application ===<br />
[[File:NN-bottelneck.png|350px|thumb|right|Figure 1: Bottleneck configuration for applying dimensionality reduction.]]<br />
One possible application of Neural Networks is to perform dimensionality reduction, like other techniques, e.g., PCA, MDS, LLE and Isomap.<br />
<br />
Consider the following configuration as shown in figure 1:<br />
As we go forward in layers of this Neural Network, the number of nodes is reduced, until we reach a layer with the number of nodes representing the desired dimensionality. However, note that at the very first few layers the number of nodes may not be strictly decreasing, as long as finally it can reach a layer with less nodes. From now on in the Neural Network <br />
the previous layers are mirrored. So at the output layer we have the same number of states as we have in the input layer. Now note that if we feed the network with each point and get an output approximately equal to the fed input, that means at the output the same input is reconstructed from the middle layer units. So the output of the middle layer units can represent the input with less dimensions.<br />
<br />
To train this Neural Network, we feed the network with a training point and through back propagation we adjust the network weights based on the error between the input layer and the reconstruction at the output layer. Our low dimensional mapping will be the observed output from the middle layer. Data reconstruction consists of putting the low dimensional data through the second half of the network.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
== Complexity Control October 30, 2009 ==<br />
<br />
[[File:overfitting-model.png|500px|thumb|right|Figure 2. The overfitting model passes through all the points of the training set, but has poor predictive power for new points.<br />
In exchange the line model has some error on the training points but has extracted the main characteristic of the training points, and has good predictive power.]]<br />
There are [http://academicearth.org/lectures/underfitting-and-overfitting two issues] that we have to avoid in Machine Learning:<br />
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]<br />
#Underfitting<br />
<br />
Overfitting occurs when our model is heavily complex with so many degrees of freedom, that we can learn every detail of the training set. Such a model will have very high precision on the training set but will show very poor ability to predict outcomes of new instances, especially outside the domain of the training set.<br />
<br />
In a Neural Network if the depth is too much, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will show a very precise outcome of the training set but will not be able to generalize the commonality of the training set to predict the outcome of new cases.<br />
<br />
Underfitting occurs when the model we picked to describe the data is not complex enough, and has high error rate on the training set.<br />
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.<br />
<br />
'''Example'''<br />
#Consider the example showed in the figure. We have a training set and we want to find a model which fits it the best. We can find a polynomial of high degree which almost passes through all the points in the training set. But, in fact the training set is coming from a line model. Now the problem is although the complex model has less error on the training set it diverges from the line in other ranges which we have no training point. Because of that the high degree polynomail has very poor predictive result on test cases. This is an example of overfitting model.<br />
#Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem.<br />
#Consider a simple classification example. If our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is a yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features typical of bananas, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie., we have overfit the data. This occurs when we have considered so many features that we have perfectly described the existing bananas, but if presented with a new banana of slightly different shape than the existing bananas, for example, it cannot be detected. This is the tradeoff; what is the right level of complexity?<br />
<br />
== Complexity Control - Nov 2, 2009 ==<br />
<br />
Overfitting occurs when the model becomes too complex and underfitting occurs when it is not complex enough, both of which are not desirable. To control complexity, it is necessary to make assumptions for the model before fitting the data. Assumptions that we can make for a model are with polynomials or a neural network. There are other ways as well.<br />
<br />
[[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 1: An example of a model with a family of polynomials]]<br />
We do not want a model to get too complex, so we control it by making an assumption on the model. With complexity control, we want a model or a classifier with a low error rate.<br />
<br />
=== '''How do we choose a good classifier?''' ===<br />
<br />
Our goal is to find a classifier that minimizes the true error rate. <br />
Recall the empirical error rate<br />
<br />
<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math><br />
<br />
<math>\,h</math> is a classifier and we want to minimize the error rate. So we apply <math>\displaystyle x_1</math> to <math>\displaystyle x_n</math> to <math>\displaystyle h</math>, and take the average to get the empirical true error rate estimation of probability that <br />
<math>h(x_{i}) \neq y_{i}</math>.<br />
<br />
<span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 2]]</span><br />
There is a downward bias to this estimate meaning that it is always less than the true error rate. <br />
<br />
If there is a change in our complexity from low to high, our error rate is always decreasing. When we apply our model to the test data, our error rate will start to decrease to a point, but then it will increase since the model hasn't seen it before. This can be explained since training error will decrease when we fit the model better by increasing its complexity, but as we have seen, this complex model will not generalize well, resulting in a larger test error. <br />
<br />
We use our test data (from the test sample line shown on Figure 2) to get our empirical error rate.<br />
Right complexity is defined as where error rate of the test data is minimum; and this is one idea behind complexity control.<br />
<br />
<br />
<br />
[[File:Bias.jpg|200px|thumb|left|Figure 3]]<br />
<br />
We assume that we have samples <math>\,X_1, . . . ,X_n</math> that follow some (possibly unknown) distribution. We want to estimate a parameter <math>\,f</math> of the unknown distribution. This parameter may be the mean <math>\,E(X_i)</math>, the variance <math>\,var(X_i)</math> or some other quantity.<br />
<br />
The unknown parameter <math>\,f</math> is a fixed real number <math>f\in R</math>. To estimate it, we use an estimator which is a<br />
function of our observations, <math>\hat{f}(X_1,...,X_n)</math>. <br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f</math><br />
<br />
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]</math><br />
<br />
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math><br />
<br />
One property we desire of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>.<br />
However, there is a more important property for an estimator than just being unbiased: the mean squared error. In statistics, there are problems for which it may be good to use an estimator with a small bias. In some cases, an estimator with a small bias may have lesser mean squared error or be median-unbiased (rather than mean-unbiased, the standard unbiasedness property). The property of median-unbiasedness is invariant under transformations while the property of mean-unbiasedness may be lost under nonlinear transformations. For example, while using an unbiased estimator with large mean square error to estimate the parameter, we highly risk a big error. In contrast, a biased estimator with small mean square error will well improve the precision of our prediction.<br />
<br />
Hence, our goal is to minimize <math>MSE (\hat{f})</math>.<br />
<br />
From figure 3, we can see that the relationship of the three parameters is:<br />
<math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.<br />
<br />
A Test error is a good estimation on MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias.<br />
<br />
<br />
Referring to Figure 2, overfitting happens after the point where training data (training sample line) starts to decrease and test data (test sample line) starts to increase. There are 2 main approaches to avoid overfitting:<br />
<br />
1. Estimating error rate<br />
<br />
<math>\hookrightarrow</math> Empirical training error is not a good estimation<br />
<br />
<math>\hookrightarrow</math> Empirical test error is a better estimation<br />
<br />
<math>\hookrightarrow</math> Cross-Validation is fast<br />
<br />
<math>\hookrightarrow</math> Computing error bound (analytically) using some probability inequality.<br />
<br />
We will not discuss computing the error bound in class; however, a popular method for doing this computation is called VC Dimension (short for Vapnik–Chervonenkis Dimension). Information can be found from [http://www.autonlab.org/tutorials/vcdim.html Andrew Moore] and [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.7171&rep=rep1&type=pdf Steve Gunn].<br />
<br />
2. Regularization<br />
<br />
<math>\hookrightarrow</math> Use of shrinkage method<br />
<br />
<math>\hookrightarrow</math> Decrease the chance of overfitting by controlling the weights<br />
<br />
=== '''Example of under and overfitting in R''' ===<br />
<br />
To give further intuition of over and underfitting, consider this example. A simple quadratic data set with some random noise is generated, and then polynomials of varying degrees are fitted. The errors for the training set and a test set are calculated.<br />
[[File:Curvefitting-rex2.png|250px|thumb|right|Polynomial fits to curved data set.]]<br />
<br />
>> x <- rnorm(200,0,1)<br />
>> y <- x^2-0.5*x+rnorm(200,0,0.3)<br />
>> xtest <- rnorm(50,1,1)<br />
>> ytest <- xtest^2-0.5*xtest+rnorm(50,0,0.3)<br />
>> p1 <- lm(y~x)<br />
>> p2 <- lm(y ~ poly(x,2))<br />
>> pn <- lm(y ~ poly(x,10))<br />
>> psi <- lm(y~I(sin(x))+I(cos(x)))<br />
<br />
: <code>x</code> values for the training set are based on a <math>\,N(0,1)</math> distribution, while the test set has a <math>\,N(1,1)</math> distribution. <code>y</code> values are determined by <math>\,y = x^2 + x + N(0,0.3)</math>, a quadratic function with some random variation. Polynomial least square fits of degree 1, 2, and 10 are calculated, as well as a fit of <math>\,sin(x)+cos(x)</math>.<br />
<br />
>> > # calculate the mean squared error of degree 1 poly<br />
>> > sum((y-predict(p1,data.frame(x)))^2)/length(y)<br />
>> [1] 1.576042<br />
>> > sum((ytest-predict(p1,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 7.727615<br />
: Training and test mean squared errors for the linear fit. These are both quite high - and since the data is non-linear, the different mean value of the test data increases the error quite a bit.<br />
>> > # calculate the mean squared error of degree 2 poly<br />
>> > sum((y-predict(p2,data.frame(x)))^2)/length(y)<br />
>> [1] 0.08608467<br />
>> > sum((ytest-predict(p2,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 0.08407432<br />
: This fit is far better - and there is not much difference between the training and test error, either.<br />
>> > # calculate the mean squared error of degree 10 poly<br />
>> > sum((y-predict(pn,data.frame(x)))^2)/length(y)<br />
>> [1] 0.07967558<br />
>> > sum((ytest-predict(pn,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 156.7139<br />
: With a high-degree polynomial, the training error continues to decrease, but not by much - and the test set error has risen again. The overfitting makes it a poor predictor. As the degree of the polynomial rises further, the accuracy of the computer becomes an issue - and a good fit is not even consistently produced for the training data.<br />
>> > # calculate mse of sin/cos fit<br />
>> > sum((y-predict(psi,data.frame(x)))^2)/length(y)<br />
>> [1] 0.1105446<br />
>> > sum((ytest-predict(psi,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 1.320404<br />
: Fitting a function of the form sin(x)+cos(x) works pretty well on the training set, but because it is not the real underlying function, it fails on test data which doesn't lie on the same domain.<br />
<br />
== ''' Cross-Validation (CV) - Introduction ''' ==<br />
<br />
[[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]]<br />
Cross-Validation is used to estimate the error rate of a classifier with respect to test data rather than data used in the model. Here is a general introduction to CV:<br />
<br />
<math>\hookrightarrow</math> We have a set of collected data for which we know the proper labels<br />
<br />
<math>\hookrightarrow</math> We divide it into 2 parts, Training data (T) and Validation data (V)<br />
<br />
<math>\hookrightarrow</math> For our calculation, we pretend that we do not know the label of V and we use data in T to train the classifier<br />
<br />
<math>\hookrightarrow</math> We estimate an empirical error rate on V since the model hasn't seen V yet and we know the proper label of all elements in V to know how many were misclassified.<br />
<br />
CV has different implementations which can reduce the variance of the calculated error rate, but sometimes with a tradeoff of a higher calculation time.<br />
<br />
== ''' Complexity Control - Nov 4, 2009''' ==<br />
<br />
== Cross-validation ==<br />
[[File:Cross-validation.png|350px|thumb|right|Figure 1: Classical/Standard cross-validation]]<br />
Cross-validation is the simplest and most widely used method to estimate the true error. It comes from the observation that although training error always decreases with the increasing complexity of the model, the test error starts to increase from a certain point, which is noted as overfitting (see [[#prediction-error|figure 2]] above). Since test error estimates MSE (mean square error) best, people came up with the idea to divide the data set into three parts: training set, validation set, and test set. training set is used to build the model, validation set is used to deside the parameters and the optimal model, and the test set is used to estimate the performance of the chosen model. A classical division is 50% for training set, and 25% each for validation set and test set. All of them are randomly selected from the original data set. <br />
<br />
Then, we only use the part of our data marked as the "training set" to train our algorithm, while keeping the remaining marked as the "validation set" untouched. As a result, the validation set will be totally unknown to the trained model. The error rate is then estimated by:<br />
<br />
<math>\hat L(h) = \frac{1}{|\nu|}\sum_{X_i \in \nu}(h(x_i) \neq y_i)</math>, where <math>\,|\nu|</math> is the cardinality of the validation set.<br />
<br />
When we change the complexity, the error generated by the validation set will have the same behavior as the test set, so we are able to choose the best parameters to get the lowest error.<br />
<br />
<br />
=== K-fold Cross-validation ===<br />
[[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]]<br />
Above is the simplest form of complexity control. However, in reality, it may be hard to collect data ??and we usually suffer from the curse of dimensionality??, and a larger data set may be hard to come by. Consequently, we may not be able to afford to sacrifice part of the limited resources. In this case we use another method that addresses this problem, K-fold cross-validation. We divide the data set into <math>\,K</math> subsets roughly equal in size. The usual choice is <math>\,K = 10</math>.<br />
<br />
Generally, how to choose <math>\,K</math>:<br />
<br />
if <math>\,K=n</math>, leave one out, low bias, high variance. Each subset contains a single element, so the model is trained with all except one point, and then validated using that point.<br />
<br />
if <math>\,K=2</math>, say 2-fold, 5-fold, high bias, low variance. Each subset contains approximately <math>\,\frac{1}{2}</math> or <math>\,\frac{1}{5}</math> of the data.<br />
<br />
For every <math>\,k</math>th <math>( \,k \in [ 1, K ] )</math> part, we use the other <math>\,K-1</math> parts to fit the model and test on the <math>\,k</math>th part to estimate the prediction error <math>\hat L_k</math>, where<br />
<br />
<math>\hat L(h) = \frac{1}{K}\sum_{k=1}^K\hat L_k</math><br />
<br />
For example, suppose we want to fit a polynomial model to the data set and split the set into four equal subsets as shown in Figure 2. First we choose the degree to be 1, i.e. a linear model. Next we use the first three sets as training sets and the last as validation set, then the 1st, 2nd, 4th subsets as training set and the 3rd as validation set, so on and so forth until all the subsets have been the validation set once (all observations are used for both training and validation). After we get <math>\hat L_1, \hat L_2, \hat L_3, \hat L_4</math>, we can calculate the average <math>\hat L</math> for degree 1 model. Similarly, we can estimate the error for n degree model and generate a simulating curve. Now we are able to choose the right degree which corresponds to the minimum error. Also, we can use this method to find the optimal unit number of hidden layers of neural networks. We can begin with 1 unit number, then 2, 3 and so on and so forth. Then find the unit number of hidden layers with lowest average error.<br />
<br />
=== Generalized Cross-validation ===<br />
If the vector of observed values is denoted by <math>\mathbf{y}</math>, and the vector of fitted values by <math>\hat\mathbf{y}</math>.<br />
<br />
<math>\mathbf{y} = \mathbf{H}\hat\mathbf{y}</math>, <br />
<br />
where the hat matrix is given by<br />
<br />
<math>\mathbf{H} = \mathbf{X}( \mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X}^{T}</math>,<br />
<br />
<math> \frac{1}{N}\sum_{i=1}^{N}[y_{i} - \hat f^{-i}(\mathbf{x}_{i})]^{2}=\frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-\mathbf{H}_{ii}}]^{2}</math>,<br />
<br />
Then the GCV approximation is given by<br />
<br />
<math> GCV(\hat f) = \frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-trace(\mathbf{H})/N}]^{2}</math>,<br />
<br />
Thus, one of the biggest advantages of the GCV is that the trace is more easily computed.<br />
<br />
=== Leave-one-out Cross-validation ===<br />
Leave-one-out cross-validation involves using all but one data point in the original training data set to train our model, then using the data point that we initially left out as a means to estimate true error. But repeating this process for every data point in our original data set, we can obtain a good estimation of true error.<br />
<br />
In other words, leave-one-out cross-validation is k-fold cross-validation in which we set the subset number <math>\,K</math> to be the cardinality of the whole data set.<br />
<br />
In the above example, we can see that k-fold cross-validation can be computationally expensive: for every possible value of the parameter, we must train the model <math>\,K</math> times. This deficiency is even more obvious in leave-one-out cross-validation, where we must train the model <math>\,n</math> times, where <math>\,n</math> is the number of data point in the data set.<br />
<br />
Fortunately, when adding data points to the classifier is reversible, calculating the difference between two classifiers is computationally more efficient than calculating the two classifiers separately. So, if the classifier on all the data points is known, we simply undo the changes from a data point <math>\,K</math> times to calculate the leave-one-out cross-validation error rate.<br />
<br />
== Regularization for Neural Network — Weight Decay ==<br />
[[File:figure 2.png|350px|thumb|right|Figure 1: activation function]]<br />
Weight decay training is suggested as an implementation for achieving a robust neural network which is insensitive to noise. Since the number of hidden layers in NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting.<br />
<br />
It can be seen from Figure 1 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, we can avoid overfitting by constraining the weights to be small. This gives us a hint to initialize the random weights to be close to zero.<br />
<br />
Formally, we penalize nonlinear weights by adding a penalty term in the error function. Now the regularized error function becomes:<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}|w_i|^2 + \sum_{jk}|u_{jk}|^2)</math>, where <math>\,err</math> is the original error in back-propagation; <math>\,w_i</math> is the weights of the output layer; <math>\,u_{jk}</math> is the weights of the hidden layers.<br />
<br />
Usually, too large <math>\,\lambda</math> will make the weights <math>\,w_i</math> and <math>\,u_{jk}</math> too small. We can use cross validation to estimate <math>\,\lambda</math>.<br />
<br />
A similar penalty, weight elimination, is given by,<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}\frac{|w_i|^2}{1 + |w_i|^2} + \sum_{jk}\frac{|u_{jk}|^2}{1+|u_{jk}|^2})</math>.<br />
<br />
As in back-propagation, we take partial derivative with respect to the weights:<br />
<br />
<math>\frac{\partial REG}{\partial w_i} = \frac{\partial err}{\partial w_i} + 2\lambda w_i</math><br />
<br />
<math>\frac{\partial REG}{\partial u_{jk}} = \frac{\partial err}{\partial u_{jk}} + 2\lambda u_{jk}</math><br />
<br />
<math>w^{new} \leftarrow w^{old} - \rho\left(\frac{\partial err}{\partial w} + 2\lambda w\right)</math><br />
<br />
<math>u^{new} \leftarrow u^{old} - \rho\left(\frac{\partial err}{\partial u} + 2\lambda u\right)</math><br />
<br />
Note that here <math>\,\lambda</math> serves as a trade-off parameter, tuning between the error rate and the linearity. Actually, we may also set <math>\,\lambda</math> by cross-validation. The tuning parameter is important since weights of zero will lead to zero derivatives and the algorithm will not change. On the other hand, starting with weights that are too large means starting with a nonlinear model which can often lead to poor solutions. <ref>Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning (Springer 2009) pp.398</ref><br />
<br />
== Radial Basis Function (RBF) Networks - November 6, 2009 ==<br />
<br />
[[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]]<br />
<br />
=== Introduction === <br />
<br />
A Radial Basis Function (RBF) network [http://en.wikipedia.org/wiki/Radial_basis_function_network] is a type of artificial neural network with an output layer and a single hidden layer, with weights from the hidden layer to the output layer, and can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. One choice that has been widely used is that of radial basis functions, which have the property that each basis function depends only on the radial distance (typically Euclidean) from a center <math>\displaystyle\mu_{j}</math>, so that <math>\phi_{j}(x)= h({\Vert x - \mu_{j}\Vert})</math>.<br />
<br />
<br />
The output of an RBF network can be expressed as a weighted sum of its radial basis functions as follows:<br />
<br />
<math>\hat y_{k} = \sum_{j=1}^M\phi_{j}(x) w_{jk}</math><br />
<br />
The radial basis function is: <br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br /><br />
(Gaussian without a normalization constant)<br /><br /><br />
'''note:'''The hidden layer has a variable number of neurons (the optimal number is determined by the training process). As usual the more neurons in the hidden layer, the higher the model complexity. Each neuron consists of a radial basis function centered on a point with the same dimensions as the input data. The radii of the RBF functions may be different. The centers and radii can be determined through clustering or an EM algorithm. When the x vector is given from the input layer, the hidden neuron computes the radial distance from the neuron’s center point and then applies RBF function to this distance. The resulting value is passed to the the output layer and weighed together to form the output. <br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat{Y}_{n,k} = \left[ \begin{matrix}<br />
\hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\<br />
\hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\<br />
\vdots &\vdots & \ddots & \vdots \\<br />
\hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k}<br />
\end{matrix}\right] </math> is the matrix of output variables. <br />
<br />
:<math>\Phi_{n,m} = \left[ \begin{matrix}<br />
\phi_{1,1} & \phi_{1,2} & \cdots & \phi_{1,m} \\<br />
\phi_{2,1} & \phi_{2,2} & \cdots & \phi_{2,m} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
\phi_{n,1} & \phi_{n,2} & \cdots & \phi_{n,m}<br />
\end{matrix}\right] </math> is the matrix of Radial Basis Functions.<br />
<br />
:<math>W_{m,k} = \left[ \begin{matrix}<br />
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\<br />
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
w_{m,1} & w_{m,2} & \cdots & w_{m,k}<br />
\end{matrix}\right] </math> is the matrix of weights.<br />
<br />
Here, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>k = 1</math>, <math>\hat Y</math> and <math>W</math> are column vectors.<br />
<br />
''related reading'':<br />
<br />
Introduction of the Radial Basis Function (RBF) Networks [http://axiom.anu.edu.au/~daa/courses/GSAC6017/rbf.pdf]<br />
<br />
Radial Basis Function (RBF) Networks [http://documents.wolfram.com/applications/neuralnetworks/index6.html] [http://lcn.epfl.ch/tutorial/english/rbf/html/index.html]<br />
<br />
=== Estimation of weight matrix W ===<br />
<br />
We minimize the training error, <math>\Vert Y - \hat{Y}\Vert^2</math> in order to find <math>\,W</math>.<br /><br /><br />
From a previous result in linear algebra we know that <br />
<br />
<math>\Vert A \Vert^2 = Tr(A^{T}A)</math><br />
<br />
Thus we have a problem similar to linear regression:<br />
<br />
<math>\ err = \Vert Y - \Phi W\Vert^{2} = Tr[(Y - \Phi W)^{T}(Y - \Phi W)]</math><br />
<br />
<math>\ err = Tr[Y^{T}Y - Y^{T}\Phi W - W^{T} \Phi^{T} Y + W^{T}\Phi^{T} \Phi W]</math><br />
<br />
<br />
==== Useful properties of matrix differentiation ====<br />
<br />
<br />
<math>\frac{\partial Tr(AX)}{\partial X} = A^{T}</math><br />
<br />
<math>\frac{\partial Tr(X^{T}A)}{\partial X} = A</math><br />
<br />
<math>\frac{\partial Tr(X^{T}AX)}{\partial X} = (A^{T} + A)X</math><br />
<br />
==== Solving for W ====<br />
<br />
We find the minimum over <math>\,W</math> by setting <math>\frac{\partial err}{\partial W}</math> equal to zero and using the aforementioned properties of matrix differentiation.<br />
<br />
<math>\frac{\partial err}{\partial W} = 0</math><br />
<br />
<math>\ 0 - \Phi^{T}Y - \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ -2 \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math><br />
<br />
<math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY</math><br />
<br />
where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}</math><br />
<br />
<math>\,H</math> is the hat matrix for this model. This gives us a nice results since the solution has a closed form and we do not have to worry about convexity problems in this case.<br />
<br />
=== Including an additional bias ===<br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat Y = \left[ \begin{matrix}<br />
y_{11} & y_{12} & \cdots & y_{1k} \\<br />
y_{21} & y_{22} & \cdots & y_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
y_{n1} & y_{n2} & \cdots & y_{nk}<br />
\end{matrix}\right] </math> is the matrix(n by k) of output variables.<br />
<br />
:<math>\Phi = \left[ \begin{matrix}<br />
\phi_{10} &\phi_{11} & \phi_{12} & \cdots & \phi_{1M} \\<br />
\phi_{20} & \phi_{21} & \phi_{22} & \cdots & \phi_{2M} \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{n0} &\phi_{n1} & \phi_{n2} & \cdots & \phi_{nM}<br />
\end{matrix}\right] </math> is the matrix(n by M+1) of Radial Basis Functions.<br />
<br />
:<math>W = \left[ \begin{matrix}<br />
w_{01} & w_{02} & \cdots & w_{0k} \\<br />
w_{11} & w_{12} & \cdots & w_{1k} \\<br />
w_{21} & w_{22} & \cdots & w_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
w_{M1} & w_{M2} & \cdots & w_{Mk}<br />
\end{matrix}\right] </math> is the matrix(M+1 by k) of weights.<br />
<br />
where the extra basis function <math>\Phi_{0}</math> is set to 1.<br />
<br />
==== Normalized RBF ====<br />
<br />
In addition to the above unnormalized architecture, the normalized RBF can be represented as:<br />
<br />
<math>\hat{y}_{k}(X) = \frac{\sum_{j=1}^{M} w_{jk}\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math><br /><br /><br />
<br />
<br />
Actually, <math>\Phi^{\ast}_{j}(X) = \frac{\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math> is known as a normalized radial basis function. Giving the familiar form,<br /><br />
<br />
<math>\hat{y}_{k}(X) = \sum_{j=1}^{M} w_{jk}\Phi^{\ast}_{j}(X)</math><br /><br /><br />
<br />
=== Conceptualizing RBF networks ===<br />
<br />
In the past, we have classified data using models that were explicitly linear, quadratic, or otherwise definite. In RBF networks, like in Neural Networks, we can fit an arbitrary model. How can we do this without changing the equations being used?<br />
<br />
Recall a [[#Trick:_Using_LDA_to_do_QDA_-_October_7.2C_2009|trick]] that was discussed in the October 7 lecture: if we add new features to our original data set, we can project into higher dimensions, use a linear algorithm, and get a quadratic result by collapsing to a lower dimension afterward. In RBF networks, something similar can happen.<br />
<br />
Think of <math>\,\Phi</math>, our matrix of radial basis functions, as a feature space of the input. Each hidden unit, then, can be thought to represent a feature; we can see that, if there are more hidden units than input units, we can essentially project to a higher-dimensional space, as we did in our earlier trick. However, this does not mean that an RBF network will actually do this, it is merely a way to convince yourself that RBF networks (and neural networks) can fit arbitrary models. Nevertheless, it is also noticed that just because of such great power, the problem of overfitting appears more important. We have to control its complexity so that it does not fit to an arbitrary training model but to a general one.<br />
<br />
=== RBF networks for classification -- a probabilistic paradigm ===<br />
<br />
[[File:Rbf_graphical_model.png|350px|thumb|left|Figure 1: RBF graphical model]]<br />
<br />
An RBF network is akin to fitting a Gaussian mixture model to data. We assume that each class can be modelled by a single function <math>\,\phi</math> and data is generated by a mixture model. According to Bayes Rule,<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(x|y_{k})*Pr(y_{k})}{Pr(x)}</math><br />
<br />
While all classifiers that we have seen thus far in the course have been in discriminative form, the RBF network is a generative model that can be represented using a directed graph.<br />
<br />
We can replace the class conditional density in the above conditional probability expression by marginalizing <math>\,x</math> over <math>\,j</math>:<br />
<math>\Pr(x|y_{k}) = \sum_{j} Pr(x|j)*Pr(j|y_{k})</math><br />
<br />
<br />
<br />
<br/><br/><br />
*'''Note''' We made the assumption that each class can be modelled by a single function <math>\displaystyle\Phi</math> and that the data was generated by a mixture model. The Gaussian mixture model has the form:<br />
<math>f(x)=\sum_{m=1}^M \alpha_m \phi(x;\mu_m,\Sigma_m)</math> where <math>\displaystyle\alpha_m</math> are mixing proportions, <math>\displaystyle\sum_m \alpha_m=1</math>, and <math>\displaystyle\mu_m</math> and <math>\displaystyle\Sigma_m</math> are the mean and covariance of each Gaussian density respectively. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009), pp. 214. </ref> The generative model in Figure 1 shows graphically how each Gaussian in the mixture model is chosen to sample from.<br />
<br />
== '''Radial Basis Function (RBF) Networks - November 9th, 2009''' ==<br />
<br />
=== RBF Network for classification (A probabilistic point of view) ===<br />
Using RBF Network to do classification, we usually treat it as regression problem. We want to set a threshold to decide what the data’s class membership is. However, to find some insight to the classification problem of what we are doing in terms of RBF Network, we often think of mixture models and make certain assumptions.<br />
<br />
[[File:RBF.png|350px|thumb|right|Figure 26.1: RBF Network Classification Demo]] <br />
<br />
We assume, as we can see in the graph on the right hand side, that we have three random variables, <math>\displaystyle y_k</math>, <math>\displaystyle j</math>, and <math>\displaystyle x</math> where <math>\displaystyle y_k</math> denotes class <math>\,k</math>, <math>\displaystyle x</math> is what we observed here, and <math>\displaystyle j</math> is some hidden random variable. There is a process that there are different classes, and each class can trigger a different hidden random variable <math>\displaystyle j</math>. To understand this, we can assume that, for instance, this random variable <math>\displaystyle j</math> has a Gaussian distribution (it could have any other distribution as well) and that all the <math>\displaystyle j</math>’s have the same distribution (Gaussian), but with different parameters. From each Gaussian distribution triggered by each class, we are going to sample some data points. Therefore, in the end, we are going to get a set of data, which are not strictly Gaussian, but are actually a mixture of Gaussians.<br />
<br />
Again, we look at the posterior distribution from [http://en.wikipedia.org/wiki/Bayes'_theorem Bayes' Rule].<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(X = x | Y = y_{k})*Pr(Y = y_{k})}{Pr(X = x)}</math><br />
<br />
Since we made the assumption that the data has been generated from a mixture model, we can estimate this conditional probability by<br />
<br />
<math>\Pr(X = x | Y = y_{k}) = \sum_{j} Pr(X = x | j)*Pr(j | Y = y_{k})</math>, <br />
<br />
which is the class conditional distribution (or probability) of the mixture model. Note, here, if we only have a simple model from <math>\displaystyle y_k</math> to <math>\displaystyle x</math>, then we won’t have this summation.<br />
<br />
We can substitute this class conditional distribution into Bayes' formula. We can see that the posterior of class <math>\displaystyle k</math> is the summation over <math>\displaystyle j</math> of the probability of <math>\displaystyle x</math> given <math>\displaystyle j</math> times the probability of <math>\displaystyle j</math> given <math>\displaystyle y_k</math>, times the prior distribution of class <math>\displaystyle k</math>, and lastly divided by the marginal probability of <math>\displaystyle x</math>. That is,<br />
<br />
<math>\Pr(y_k | x) = \frac {\sum_{j} Pr(x | j)*Pr(j | y_{k})*Pr(y_{k})}{Pr(x)}</math>.<br />
<br />
Since, the prior probability of class <math>\displaystyle k</math>, <math>\displaystyle Pr(y_{k})</math>, does not have an index of <math>\displaystyle j</math>, it can be taken out of the summation. This yields,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)}</math>.<br />
<br />
We multiply this by <math>\displaystyle 1 = \frac {Pr(j)}{Pr(j)}</math>. Then, it becomes,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)} * \frac {Pr(j)}{Pr(j)}</math>.<br />
<br />
Next, note that <math>\displaystyle Pr(j | x) = \frac {Pr(x | j)*Pr(j)}{Pr(x)}</math>, and <math>\displaystyle Pr(y_k | j) = \frac {Pr(j | y_k)*Pr(y_k)}{Pr(j)}</math>. Then rearranging the terms, we finally have the posterior:<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} Pr(j | x)Pr(y_k | j)</math>.<br />
<br />
Interestingly, this is just the product of the posterior of the two functions that are summed.<br />
<br />
==== Interpretation of RBF Network classification ====<br />
<br />
[[File:2.png|350px|thumb|right|Figure 26.1.2(2): RBF Nerwork ]]<br />
<br />
We want to relate the results that we derived above to our RBF Network. In a RBF Network, as we can see on the right hand side, we have a set of data, <math>\displaystyle x_1</math> to <math>\displaystyle x_d</math>, and the hidden basis function, <math>\displaystyle \phi_{1}</math> to <math>\displaystyle \phi_{M}</math>, and then we have some output, <math>\displaystyle y_1</math> to <math>\displaystyle y_k</math>. Also, we have weights from the hidden layer to output layer. The output is just the linear sum of <math>\displaystyle \phi</math>’s. <br />
<br />
Now consider probability of <math>\displaystyle j</math> given <math>\displaystyle x</math> to be <math>\displaystyle \phi</math>, and the probability of <math>\displaystyle y_k</math> given <math>\displaystyle j</math> to be the weights <math>\displaystyle w_{jk}</math>, then the posterior can be written as,<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math>.<br />
<br />
[[File:3.png|350px|thumb|left|Figure 26.1.2(1): Gaussian mixture ]]<br />
<br />
Now, let us look at an example in one dimensional case. Suppose,<br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, and <math>\displaystyle j</math> is from 1 to 2. <br />
<br />
We know that <math>\displaystyle \phi</math> is a radial basis function. It's as if we put some Gaussian over data. And for each Gaussian, we consider the center <math>\displaystyle \mu</math>. Then, what <math>\displaystyle \phi</math> computes is the similarity of any data point to the center. <br />
<br />
We can see the graph on the left which plots the density of <math>\displaystyle \phi_{1}</math> and <math>\displaystyle \phi_{2}</math>. Take <math>\displaystyle \phi_{1}</math> for instance, if the point gets far from the center <math>\displaystyle \mu_{1}</math>, then it will reduce <math>\displaystyle \phi_{1}</math> to become nearly zero. Remember that, we can usually find a non-linear regression or classification of input space by doing a linear one in some extended space or some feature space (more details in Aside). Here, the <math>\displaystyle \phi</math>’s actually produce that feature space. <br />
<br />
So, one way to look at this is that this <math>\displaystyle \phi</math> is telling us that given an input, how likely the probability of presence of a particular feature is. Say, for example, we define the features as the centers of these Gaussian distributions. Then, this <math>\displaystyle \phi</math> function somehow computes the possibility given certain data points, of this kind of feature appearing. If the data point is right at the center, then the value of that <math>\displaystyle \phi</math> would be one, i.e. the probability is 1. If the point is far from the center, then the probability (<math>\displaystyle \phi</math> function value) will be close to zero, that is, it’s less likely. Therefore, we can treat <math>\displaystyle Pr(j | x)</math> as the probability of a particular feature given data. <br />
<br />
When we have those features, then <math>\displaystyle y</math> is the linear combination of the features. Hence, any of the weights <math>\displaystyle w</math>, which is equal to <math>\displaystyle Pr(y_k | j)</math>, tells us how likely this particular <math>\displaystyle y</math> will appear given those features. Therefore, the weight <math>\displaystyle w_{jk}</math> shows the probability of class membership given feature. <br />
<br />
Hence, we have found a probabilistic point of view to look at RBF Network!<br />
<br />
*'''Note''' There are some inconsistencies with this probabilistic point of view. There are no restrictions that force <math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math> to be between 0 and 1. So if least squares is used to solve this, <math>\displaystyle w_{jk}</math> cannot be interpreted as a probability. <br />
<br />
<br />
''' Aside '''<br />
*Feature Space:<br />
:One way to produce a feature space is LDA<br />
:Suppose, we have n data points <math>\mathbf{x}_1</math> to <math>\mathbf{x}_n </math>. Each data point has d features. And these n data points consist of the <math>X</math> matrix, <br />
:<math>X = \left[ \begin{matrix}<br />
x_{11} & x_{21} & \cdots & x_{n1} \\<br />
x_{12} & x_{22} & \cdots & x_{n2} \\<br />
\vdots & & \ddots & \vdots \\<br />
x_{1d} & x_{2d} & \cdots & x_{nd}<br />
\end{matrix}\right] </math><br />
:Also, we have feature space,<br />
:<math>\Phi^{T} = \left[ \begin{matrix}<br />
\phi_{1}(\mathbf{x_1}) & \phi_{1}(\mathbf{x_2})& \cdots & \phi_{1}(\mathbf{x_n})\\<br />
\phi_{2}(\mathbf{x_1})& \phi_{2}(\mathbf{x_2})& \cdots & \phi_{2}(\mathbf{x_n}) \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{M}(\mathbf{x_1}) & \phi_{M}(\mathbf{x_2}) & \cdots & \phi_{M}(\mathbf{x_n})<br />
\end{matrix}\right] </math> <br />
:If we want to solve a regression problem for the input data, we don’t perform Least Square on this <math>\displaystyle X</math> matrix, we do Least Square on the feature space, i.e. on the <math>\displaystyle \Phi^{T}</math> matrix. The dimensionality of <math>\displaystyle \Phi^{T}</math> is M by n.<br />
:Now, we still have n data points, but we define these n data points in terms of a new set of features. So, originally, we define our data points by d features, but now, we define them by M features. And what are those M features telling us? <br />
:Let us look at the first column of <math>\displaystyle \Phi^{T}</math> matrix. The first entry is <math>\displaystyle \phi_1</math> applied to <math>\mathbf{x_1}</math>, and so on, until the last entry is <math>\displaystyle \phi_M</math> applied to <math>\mathbf{x_1}</math>. Suppose each of these <math>\displaystyle \phi_j</math> is defined by<br />
:<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>.<br />
:Then, each <math>\displaystyle \phi_j</math> checks the similarity of the data point with its center. Hence, the new set of features are actually representing M centers in our data set, and for each data point, its new features check how this point is similar to the first center; how it is similar to the second center; and how it is similar to the <math>\displaystyle M^{th}</math> center. And this checking process will apply to all data points. Therefore, feature space gives another representation of our data set. <br />
<br />
</noinclude><br />
<br />
=== Model selection or complexity control for RBF Network - a brief introduction ===<br />
In order to obtain a better fit for the training data, we often want to increase the complexity of our RBF Network. By its construction, we know that to change the complexity of a RBF Network, the only way is to add or decrease the number of basis functions. A large number of basis function yields a more complex network. In theory, if we add enough basis functions, the RFB Network would work for any training; however, it doesn't mean this RBF Network model can generalize well. Therefore, for the purpose of avoiding overfitting problem (see Notes below), we only want to increase the number of basis function to certain point, i.e. its optimal level. <br />
<br />
For the model selection, what we usually do is estimate the training error. After working through the training error, we’ll see that the training error in fact can be decomposed, and one component of training error is called Mean Squared Error (MSE). In the later notes, we will find that our final goal is to get a good estimate of MSE. Moreover, in order to find an optimal model for our data, we select the model with the smallest MSE.<br />
<br />
Now, let us introduce some notations that we will use in the analysis:<br />
*<math>\hat f</math> -- the prediction model estimated by a RBF network from the training data<br />
*<math>\displaystyle f</math> -- the real model (not null), and ideally, we want <math>\hat f</math> to be close to <math>\displaystyle f</math><br />
*<math>\displaystyle err</math> -- the training error<br />
*<math>\displaystyle Err</math> -- the testing error<br />
*<math>\displaystyle MSE</math> -- the Mean Squared Error<br />
<br />
''' Notes '''<br />
<br />
[[File:overfitting.png|350px|thumb|left|Figure 26.2: Overfitting]]<br />
<br />
*Being more complex isn’t always a good thing. Sometime, [http://en.wikipedia.org/wiki/Overfitting overfitting] causes the model to lose its generality. For example in the graph on left hand side, the data points are sampled from the model <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle f(x_i)</math> is a linear function, which is shown by the blue line, and <math>\displaystyle \epsilon_i</math>is additive Gaussian noise from <math>~N(0,\sigma^2)</math>. The red curve displayed in the graph shows the over-fitted model. Clearly, this over-fitted model only works for any training data, and is useless for any further prediction when new data points are introduced.<br />
<br />
> n<-20;<br />
> x<-seq(1,10,length=n);<br />
> alpha<-2.5;<br />
> beta<-1.75;<br />
> y<-alpha+beta*x+rnorm(n);<br />
> plot(y~x, pch=16, lwd=3, cex=0.5, main='Overfitting');<br />
> abline(alpha, beta, col='blue');<br />
> lines(spline(x, y), col = 2);<br />
<br />
*More details on this topic later on.<br />
<br />
<br />
<br />
</noinclude><br />
<br />
<br />
<br />
<br />
<br />
<br />
== '''Model Selection(Stein's Unbiased Risk Estimate)- November 11th, 2009''' ==<br />
<br />
===Model Selection===<br />
<br />
Model selection is a task of selecting a model of optimal complexity for a given data. Learning a radial basis function network from data is a parameter estimation problem. One difficulty with this problem is selecting parameters that show good performance on both training and testing data. In principle, a model is selected to have parameters associated with the best observed performance on training data, although our goal really is to achieve good performance on unseen testing data. Not surprisingly, a model selected on the basis of training data does not necessarily exhibit comparable performance on the testing data. When squared error is used as the performance index, a zero-error model on the training data can always be achieved by using a sufficient number of basis functions.<br />
<br />
<br />
But, training error and testing error do not demonstrate a linear relationship. In particular, a smaller training error does do not necessarily result in a smaller testing error. In practice, one often observes that, up to a certain point, the model error on testing data tends to decrease as the training error decreases. However, if one attempts to decrease the training error too far by increasing model complexity, the testing error often can take a dramatic increase.<br />
<br />
<br />
The basic reason behind this phenomenon is that in the process of minimizing training error, after a certain point, the model begins to over-fit the training set. Over-fitting in this context means fitting the model to training data at the expense of losing generality. In the extreme form, a set of <math>\displaystyle N</math> training data points can be modeled exactly with <math>\displaystyle N</math> radial basis functions. Such a model follows the training data perfectly. However, the model is not representative features of the true underlying data source, and this is why it fails to correctly model new data points.<br />
<br />
<br />
In general, the training error rate will be less than the testing error on the new data. A model typically adapts to the training data, and hence the training error will be an overly optimistic estimate of the testing error. An obvious way to well estimate testing err is to add a penalty term to the training error to compensate. Actually, SURE is developed based on this.<br />
<br />
<br />
<br />
===Stein's unbiased risk estimate (SURE)===<br />
<br />
<br />
====Important Notation====<br />
<br />
Let:<br />
*<math>\hat f(X)</math> denote the ''prediction model'', which is estimated from a training sample by the RBF neural network model.<br />
*<math>\displaystyle f(X)</math> denote the ''true model''.<br />
*<math>\displaystyle err=\sum_{i=1}^N (\hat y_i-y_i)^2 </math> denote the ''training error'',which is the average loss over the training sample.<br />
*<math>\displaystyle Err=\sum_{i=1}^M (\hat y_i-y_i)^2 </math> denote the ''test error'', which is the expected prediction error on an independent test sample.<br />
*<math>\displaystyle MSE=E(\hat f-f)^2</math> denote the ''mean squared error'', where <math>\hat f(X)</math> is the estimated model and <math>\displaystyle f(X)</math> is the true model.<br />
<br />
<br />
<br />
Suppose the observations <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>.We need to estimate <math>\hat f</math> from training data set <math>T=\{(x_i,y_i)\}^2_{i=1}</math>. Let <math>\hat f_i=\hat f_(x_i)</math> and <math>\displaystyle f_i= f(x_i)</math>, then <br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i-\epsilon_i)^2]</math><math>=E[(\hat f_i-f_i)^2]+E[\epsilon_i^2]-2E[\epsilon_i(\hat f_i-f_i)]</math><br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2E[\epsilon_i(\hat f_i-f_i)]</math> <math>\displaystyle (1)</math><br />
<br />
The last term can be written as:<br />
<br />
<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=E[(y_i-f_i)(\hat f_i-f_i)]=cov(y_i,\hat f)</math>, where<math>\displaystyle y_i</math> and <math>\hat f_i</math> both have same mean <math>\displaystyle f_i</math>.<br />
<br />
<br />
<br />
====Stein's Lemma====<br />
<br />
If <math>\,Z</math> is <math>\,N(\mu,\sigma^2)</math> and if <math>\displaystyle g(Z)</math> is weakly differentiable,such that<math>\displaystyle E[\vert g'(Z)\vert]<\infty</math>, then <math>\displaystyle E[g(Z)(Z-\mu)]=\sigma^2E(g'(Z))</math>.<br />
<br />
<br />
According to Stein's Lemma, the last cross term of <math>\displaystyle (1)</math>, <math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]</math> can be written as <math>\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math>. The derivation is as follows.<br />
<br />
<math>\displaystyle Proof</math>: Let <math>\,Z = \epsilon</math>. Then <math>g(Z) = \hat f-f</math>, since <math>\hat y = f + \epsilon</math>, and <math>\,f</math> is a constant. So <math>\,\mu = 0</math> and <math>\,\sigma^2</math> is the variance in <math>\,\epsilon</math>.<br />
<math>\displaystyle E[g(Z)(Z-\mu)]=E[(\hat f-f)\epsilon]=\sigma^2E(g'(Z))=\sigma^2 E[\frac {\partial (\hat f-f)}{\partial y_i}]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}-\frac {\partial f}{\partial y_i}]</math><br />
<br />
<br />
Since <math>\displaystyle f</math> is the true model, not the function of the observations <math>\displaystyle y_i</math>, then <math>\frac {\partial f}{\partial y_i}=0</math>.<br />
<br />
So,<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math> <math>\displaystyle (2)</math><br />
<br />
====Two Different Cases====<br />
SURE in RBF,<br />
[http://www.cs.ualberta.ca/~papersdb/uploaded_files/801/paper_automatic-basis-selection-for.pdf Automatic basis selection for RBF networks using Stein’s unbiased risk estimator,Ali Ghodsi Dale Schuurmans]<br />
<br />
<br />
=====''Case 1''=====<br />
<br />
Consider the case in which a new data point has been introduced to the estimated model, i.e. <math>(x_i,y_i)\not\in\tau</math>; this new point belong to the validation set <math>\displaystyle \nu</math>,i.e.<math>(x_i,y_i)\in\nu</math>. Since <math>\displaystyle y_i</math> is a new point, <math>\hat f</math> and <math>\displaystyle y_i</math> are independent. Therefore <math>\displaystyle cov(y_i,\hat f)=0</math> (or think about <math>\frac{\partial \hat f}{\partial y_i}</math>, when <math>\,y_i</math>is a new point, then it has nothing with <math>\hat f</math> because the estimation of <math>\hat f</math> is from the training data, so <math>\frac{\partial \hat f}{\partial y_i}=0</math> ) and <math>\displaystyle (1)</math> in this case can be written as:<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2</math>. <br />
<br />
This expectation means <math>\frac {1}{m}\sum_{i=1}^m (\hat y_i-y_i)^2 = \frac {1}{m}\sum_{i=1}^m (\hat f_i-f_i)^2+ \sigma^2</math>.<br />
<br />
<math>\sum_{i=1}^m (\hat y_i-y_i)^2 = \sum_{i=1}^m (\hat f_i-f_i)^2+ m\sigma^2</math><br />
<br />
Based on the notation we denote above, then we obtain:<br />
<math>\displaystyle MSE=Err-m\sigma^2</math><br />
<br />
<br />
<br />
This is the justification behind the technique of cross validation. since <math>\displaystyle \sigma^2</math> is constant, to minimize <math>\displaystyle MSE</math> is equal to minimize the test err <math>\displaystyle Err</math>. In cross vaildation to avoid overfitting or underfitting, a validation data set is independent from the estimated model.<br />
<br />
<br />
=====''Case 2''=====<br />
<br />
A more interesting case is the case in which we do not use new data points to assess the performance of the estimated model. and the training data is used for both estimating and assessing a model <math>\hat f_i</math>. In this case the cross term in <math>\displaystyle (1)</math> cannot be ignored because <math>\hat f_i</math> and <math>\displaystyle y_i</math> are not independent. Therefore the cross term can be estimated by Stein's lemma, which was originally proposed to estimated the mean of a Guassian distribution.<br />
<br />
<br />
Suppose <math>(x_i,y_i)\in\tau</math>, then by applying Stein's lemma, we obtain <math>\displaystyle (2)</math> proved above.<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2\sigma^2E[\frac {\partial \hat f}{\partial y_i}]</math>.<br />
<br />
This expectation means <math>\frac {1}{N}\sum_{i=1}^N (\hat y_i-y_i)^2 = \frac {1}{N}\sum_{i=1}^N (\hat f_i-f_i)^2+ \sigma^2-\frac {2}{N}\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<br />
<math>\sum_{i=1}^N (\hat y_i-y_i)^2 = \sum_{i=1}^N (\hat f_i-f_i)^2+ N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<math>\displaystyle err=MSE+N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math><br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math> <math>\displaystyle (3)</math><br />
<br />
In statistics, this is known as [http://www.reference.com/browse/Stein%27s+unbiased+risk+estimate Stein's unbiased risk estimate (SURE)] is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter, and thus cannot be determined completely.<br />
<br />
===SURE for RBF Network===<br />
<br />
Based on SURE, the optimum number of basis functions should be assigned to have the minimum generalization err <math>\displaystyle err</math>. Based on the Radial Basis Function Network, by setting <math>\frac{\partial err}{\partial W}</math> equal to zero , we get the least squared solution of<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math>.Then we have <math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}Y = HY</math>,where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}</math>, <math>\,H</math> is the hat matrix for this model.<br />
<br />
<br />
<math>\hat f=\,H_{i1}y_2+\,H_{i2}y_2+\cdots+\,H_{in}y_n</math> <math>\displaystyle (3)</math><br />
<br />
where <math>\,H</math> depends on the input vector <math>\displaystyle x_i</math> but not on <math>\displaystyle y_i</math>. <br />
<br />
By taking the derivative of <math>\hat f_i</math> with respect to <math>\displaystyle y_i</math>, we can easily obtain:<br />
<br />
<math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Now, substituing this into<math>\displaystyle (3)</math> ,we get<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Here, we can tell that <math>\sum_{i=1}^N \,H_{ii}= \,Trace(H)</math>, the sum of the diagonal elements of <math>\,H</math>. Thus, we can obtain the further simplification that <math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1})\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=d</math>, where<math>\displaystyle d</math> is the dimension of <math>\displaystyle \Phi</math>. Since <math>\displaystyle \Phi</math> is a projection of input matrix <math>\,X</math> onto a basis set spanned by <math>\,M</math>, the number of basis functions. If considering intercept, then <math>\,Trace(H)= M+1</math>.<br />
<br />
Then,<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1)</math>.<br />
<br />
===SURE Algorithm===<br />
<br />
<br />
[[File:27.1.jpg|350px|thumb|right|Figure 27.1]]<br />
<br />
We use this method to find the optimum number of basis function by choosing the model with smallest MSE over the set of models considered. Given a set of models <math>\hat f_M(x)</math> indexed by the number of basis functions, <math>\displaystyle err(M)</math>. <br />
<br />
Then, <math>\displaystyle MSE(M)=err(M)-N\sigma^2+2\sigma^2(M+1)</math><br />
<br />
where <math>\displaystyle N</math> is the number of training samples and the noise,<math>\sigma^2</math>, can be estimated from the training data as<br />
<br />
<math>\hat \sigma^2=\frac {1}{N-1}\sum_{i=1}^N (\hat y-y)^2</math>.<br />
<br />
<br />
By applying SURE algorithm to SPECT Heart data, we get the optimal number of basis functions is <math>\displaystyle M=4</math>.<br />
<br />
<br />
Pls take a look at figure 27.1 on the right, which shows that<math>\displaystyle MSE</math> is smallest when <math>\displaystyle M=4</math>.<br />
<br />
<br />
Calculating the SURE value is easy if you have access to <math>\,\sigma</math>.<br />
<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
If <math>\,\sigma</math> is not known, it can be estimated using the error.<br />
<br />
error = (output - expected_output) .^ 2;<br />
sigma = (dot(err, ones(1, num_data_point)) / (num_data_point)) / (num_data_point - 1);<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
=='''SURE for RBF network & Support Vector Machine - November 13th, 2009'''==<br />
<br />
===SURE for RBF network===<br />
<br />
====Minimizing MSE====<br />
<br />
By Stein's unbiased risk estimate (SURE) for Radial Basis Function (RBF) Network<br />
we get:<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1) </math> (28.1)<br />
<br />
*<math>\displaystyle MSE</math>(mean square error)= <math>\sum_{i=1}^N (\hat y_i-y_i)^2 </math><br />
*<math>\displaystyle err</math>(training error)= <math>\sum_{i=1}^N (\hat f_i-f_i)^2 </math><br />
*<math>\displaystyle (M+1) </math>( number of hidden units)= <math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math><br />
<br />
<br />
'''Goal''': To minimize MSE<br />
<br />
1. If <math>\displaystyle \sigma </math> is known, then it is no impact (i.e. a constant),<br />
and we can ignore it. Only need to minimize <math>\displaystyle MSE=err +2\sigma^2(M+1)</math>.<br />
<br />
2. In reality, we do not know <math>\displaystyle \sigma</math>, and the estimate <math>\,\hat \sigma</math> changes when <math>\displaystyle (M+1) </math> changes. However, we can estimate <math>\displaystyle \sigma </math>.<br />
<br />
<math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>. Suppose we do not know the variance of <math>\displaystyle \epsilon</math>. Then, <br />
<br />
<math>\displaystyle \sigma^2=\frac{1}{N-1}\sum_{i=1}^N (\hat y_i-y_i)^2 =\frac{1}{N-1}err</math> (28.2)<br />
<br />
Substitute (28.2) into (28.1), get<br />
<br />
<math>\displaystyle MSE=err-N\frac{1}{N-1}err+2\frac{1}{N-1}err(M+1)</math><br />
<br />
<math>\displaystyle MSE=err(1-\frac{N}{N-1}+\frac{2(M+1)}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{N-1-N+2M+2}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{2M+1}{N-1}) </math> (28.3) <br />
<br />
<br />
[[File:28.1.jpg|350px|thumb|Figure 28.1: MSE vs err]]<br />
<br />
Figure 28.1: the training error will decrease and the MSE will increase when increasing the number of hidden units (i.e. the model is more complex).<br />
<br />
<br />
When the number of hidden units gets larger and larger, the training error will decrease until it approaches to <math>\displaystyle 0 </math>. If training error equals <math>\displaystyle 0 </math>, then no matter how large <math>\displaystyle (M+1) </math> is, from (28.3) we can see the estimate of MSE will approach <math>\displaystyle 0 </math> as well. However, in fact it does not happen since when training error is close to <math>\displaystyle 0 </math> [http://en.wikipedia.org/wiki/Overfitting overfitting] happens, and MSE should increase instead of being close to <math>\displaystyle 0 </math>. We can see it from the Figure 28.1. <br />
<br />
<br />
We can see <math>\displaystyle \sigma^2 </math> is the average of <math>\displaystyle err </math>. In order to deal with this problem, we can take the average for <math>\displaystyle err</math> of each hidden unit. For example: we can first take 1 hidden unit, and take 10 hidden units in the next. Since in reality the value of <math>\, \sigma^2</math> is a constant adjustment to the data points, and doesn't depend on <math>\,M+1</math>, using the average <math>\,\sigma^2</math> value for 1 to 10 hidden units has a firm theoretical basis.<br />
<br />
We can also see that unlike the classical Cross Validation (CV) or Leave one out (LOO) techniques, the SURE technique does not need to do the validation to find the optimal model. Hence, SURE technique uses less data than CV or LOO. It is suitable for the case that there is not enough data for validation. However, to implement SURE we need to find <math>\frac {\partial \hat f}{\partial y_i}</math>, which may not be trivial for models that do not have a closed-form solution.<br />
<br />
====Kmeans Clustering====<br />
<br />
Discription:<br /> [http://en.wikipedia.org/wiki/K-means_clustering Kmeans clustering] is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.<br />
<br />
*The number of hidden units is same as the number of clusters, which is <math>\displaystyle \phi </math><br />
<br />
*<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, we set it same for all clusters.<br />
<br />
The basic details for <math>K</math>-means clustering are given:<br />
<br />
The <math>K</math> initial centers are randomly chosen from the training data.<br />
<br />
Then the following two steps are iterated alternatively until convergernce.<br />
<br />
1. for each existing center, we reidentify its cluster (every points in this cluster should be closer to this center than others).<br />
<br />
2. compute the mean for each cluster and make it as the new center for each cluster.<br />
<br />
<br />
Example:<br /><br />
Partition data into 2 clusters (2 hidden values)<br />
<br />
<br />
>> X=rand(30,80); <br />
>> [IDX,C,sumD,D]=kmeans(X,2); <br />
>> size(IDX) <br />
>> 30 1<br />
>> size(C) <br />
>> 2 80<br />
>> size(sumD) <br />
>> 2 1<br />
>> c1=sum(IDX==1)<br />
>> 14<br />
>> c2=sum(IDX==2)<br />
>> 16<br />
>> sumD<br />
>> 85.6643<br />
>> 101.0419<br />
>> v1=sumD(1,1)/c1 <br />
>> 6.1189<br />
>> v2=sumD(2,1)/c2 <br />
>> 6.3151 <br />
<br />
<br />
<br />
Comments:<br />
<br />
We create <math>X</math> randomly as a train set with 80 data points and 30 dimensions, and then apply “kmeans” method to separate X into 2 clusters.IDX is a vector contains 1 or 2 which indicates 2 clusters, and its size is 30*1. <math>\displaystyle C </math> is the center (mean) of each cluster with size 2*80; sumD is sum of the square distance between the data points and center of its cluster. The <math>\displaystyle c1 </math> and <math>\displaystyle c2 </math> indicate the number of data points in cluster 1 and 2. <math>\displaystyle V1 </math> is the variance of the first cluster <math>\displaystyle (v1=\sigma_1)</math>; <math>\displaystyle V2 </math> is the variance of the second cluster <math>\displaystyle (v2=\sigma_2)</math>. Now we can get <math>\displaystyle \phi </math>, <math>\displaystyle w </math>, hat matrix <math>\displaystyle H </math> and <math>\displaystyle \hat Y </math> by following equations. Finally, we will get the <math>\displaystyle MSE </math> and predict the test set. <br />
<br />
<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br />
<br />
<math>\displaystyle W = (\phi^{T}\Phi)^{-1}\phi^{T}Y</math><br />
<br />
<math>\displaystyle H = \phi(\Phi^{T}\Phi)^{-1}\phi^{T}</math><br />
<br />
<math>\displaystyle \hat{Y} = \phi W = \phi(\phi^{T}\phi)^{-1}\phi^{T}Y = HY</math><br />
<br />
<br />
<br />
Aside:<br />
<br />
Similar in spirit to <math>K</math>-means, there is EM algorithm with respect to Gaussian mixture model. Generally speaking, the Gaussian mixture model is referred to as a soft clustering while <math>K</math>-means is hard clustering.<br />
<br />
Similar to <math>K</math>-means, the following two steps are iterated alternatively until convergernce.<br />
<br />
E-step, each point is assigned a weight for each cluster based on the likelihood of each of the corresponding Gaussians. Observations is assigned 1 for one cluster if they are closer to the center of that cluster, and is assigned 0 for other clusters. <br />
<br />
M-step, compute the weighted means and covariances and make them as the new means and covariances for every cluster.<br />
<br />
>>[P,mu,phi,1Pxtr]=mdgEM(X,2,200,0);<br />
<br />
===Support Vector Machine===<br />
<br />
====Introduction====<br />
We have seen that linear discriminate analysis and logistic regression both estimate linear decision boundaries in similar but slightly different ways. Separating hyperplane classifiers provide the basis for support vector classifier. It constructs linear decision boundaries that explicitly try to separate the data into different classes as well as possible. The techniques that make the extensions to the nonseparable case, where the classes overlap are generalized to what is known as the support vector machine. It produces nonlinear boundaries by constructing a linear boundary in a high dimensional, transformed version of the feature space.<br />
<br />
The original basis for SVM was published in the 1960s by [http://en.wikipedia.org/wiki/Vapnik Vapnik], Chervonenkis and co, however the ideas did not gain any attention until strong results were shown in the early 90s.<br />
<br />
Definition: <br /><br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machine (SVM)] are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a Maximum margin hyperplane or set of hyperplanes in a higher or infinite dimensional space. The set of points near the class boundaries, support vectors, define the model. which can be used for classification, regression or other tasks.<br />
<br />
====Optimal Seperating Hyperplane====<br />
<br />
[[File:28.2.jpg|350px|thumb|right|Figure 28.2]]<br />
<br />
Figure 28.2 An example with two classes separated by a hyperplane. The blue line is the least squares solution, which misclassifies one of the training points. Also shown are the black separating hyperplanes found by the [http://en.wikipedia.org/wiki/Perceptron perceptron] learning algorithm with different random starts.<br /><br />
<br />
We can see the data points can be separated by a linear boundary are in two classes in <math>\displaystyle \mathbb{R}^{2} </math>. Suppose a dataset is indeed linearly separable, then there exits infinitely many possible separating hyperplanes including the black lines in the figure as two of them for training data. However, which solution is the best when we introduce the new data. <br /><br />
<br />
Aside: <br /><br />
The blue line is the least squares solution to the problem,obtained by regressing the <math>\displaystyle -1/+1 </math> response <math>\displaystyle Y </math> on <math>\displaystyle X </math> (with intercept); the line is given by<br />
<math>\displaystyle {X:\hat\beta_0+\hat\beta_1X_1+\hat\beta_2X_2=0}</math>.<br />
This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by linear discriminant analysis, in light of its equivalence with linear regression in the two-class case.<br />
<br />
Classifiers such as (28.4) that compute a linear combination of the input features and return the sign were called ''perceptrons'' in the engineering literature in the late 1950s. <br />
<br />
<br />
Identifications:<br />
<br />
*Hyperplane: seprate two classes <br />
<br />
<math>\displaystyle x^{T}\beta+\beta_0=0</math><br />
<br />
*Margin: the distance between the hyperplane and the closest point.<br />
<br />
<math>\displaystyle d_i=x_i^{T}\beta+\beta_0 </math> where <math>\displaystyle i=1,....,N</math><br />
<br />
Note: since distance is positive, if the data is on <math>\displaystyle +1 </math> side the distance is <math>\displaystyle d_i(+1)</math>. If the data is on the <math>\displaystyle -1 </math> side the distance is <math>\displaystyle d_i(-1)</math>.<br />
<br />
*Data points: <math>\displaystyle y_i\in\{-1,+1\}</math>we can classify points as<math>\displaystyle sign\{d_i\}</math> if <math>\displaystyle \beta,\beta_0 </math> is known.<br /><br />
<br />
====Maximum Margin Classifiers in the Linearly separable case====<br />
Choose the line farthest from both classes or Choose the line which has the maximum distance from the closest point. (i.e maximize the margin)<br /><br />
<br />
<math>\displaystyle Margin=min\{y_id_i\}</math> <math>\displaystyle i=1,2,....,N </math> <br />
where <math>\displaystyle y_i </math> is label and <math>\displaystyle d_i </math> is distance<br /><br />
<br />
[[File:28.3.jpg|350px|thumb|right|Figure 28.3 The linear algebra of a hyperplane]]<br />
<br />
<br />
<br />
Fiqure 28.3 depicts a hyperplane defined by the equation <math>\displaystyle x^{T}\beta+\beta_0=0</math>. Since they are in <math>\displaystyle \mathbb{R}^{2} </math>, the hyperplane is a line.<br /><br />
<br />
<br />
Let us rewrite <math>\displaystyle Margin=min\{y_id_i\}</math> by using the following properties:<br /><br />
<br />
1. <math>\displaystyle \beta </math> is orthogonal to the hyperplane <br /><br />
<br />
Two points<math>\displaystyle x_1,x_2</math> lying on the hyperplane.<br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0\beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}(x_1-x_2)=0</math><br />
<br />
Hence,<math>\displaystyle \beta </math> is orthogonal to <math>\displaystyle (x_1-x_2)</math>, and<math>\displaystyle \beta^*=\frac{\beta}{\|\beta\|} </math> is the vector normal to the hyperplane.<br /><br />
<br />
2. For any point <math>\displaystyle x_0 </math> on the hyperplane, <br />
<br />
<math>\displaystyle \beta^{T}x_0+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1=-\beta_0</math><br />
For any point on the hyperplane <math>\displaystyle \beta^{T}</math> gives negative value of the intercept of the hyperplane. <br/><br />
<br />
<br />
3. The signed distance for any point <math>\displaystyle x </math> to the hyperplane is <math>\displaystyle d_i=\beta^{T}(x_i-x_0)</math>. <br/>Since the length of <math>\displaystyle \beta </math> is not known, let it be unit vector.<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}(x_i-x_0)}{\|\beta\|} </math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i-\beta^{T}x_0}{\|\beta\|} </math><br />
<br />
by property 2<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i+\beta_0}{\|\beta\|} </math><br />
<br />
<br />
<br />
[[File:4.jpg|350px|thumb|right|Figure 28.4]]<br />
<br />
<br />
We had <math>\displaystyle Margin=min(y_id_i)</math> <math>\displaystyle i=1,2,....,N </math>, and since we now know how to compute <math>\displaystyle d_i \Rightarrow</math> <br />
<br />
<math>\displaystyle Margin=min\{y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\} </math><br />
<br />
Suppose <math>\displaystyle x_i </math> is not on the hyperplan<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)>0 </math><br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq c </math>for <math>\displaystyle c>0 </math><br />
<br />
<br />
<math>\displaystyle y_i(\frac{\beta^{T}x_i}{c}+\frac{\beta_0}{c})\geq1</math> <br />
<br />
This is known as the cnonical representation of the decision hyperplane.<br />
<br />
For <math>\displaystyle \beta^{T} </math> only the direction is important, so <math>\displaystyle \frac{\beta^{T}}{c} </math> does not change its direction, the hyperplance will be the same.<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq1 </math><br />
<br />
<math>\displaystyle y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\geq\frac{1}{\|\beta\|} </math><br />
<br />
<math>\displaystyle Margin=min(\frac{1}{\|\beta\|} )</math><br />
<br />
which is equivalent as <math>\displaystyle Margin=max(\|\beta\| )</math><br />
<br />
<br />
<br />
<br />
Refrence:<br /><br />
Hastie,T.,Tibshirani,R., Friedman,J.,(2008).The Elements of Statistical Learning:129-130<br />
<br />
=='''Optimizing The Support Vector Machine - November 16th, 2009'''==<br />
We currently derive Support Vector Machine for the case where two classes are separable in the given feature space. This margin can be written as <math>\,min\{y_id_i\}</math>, or the distance of each point from the hyperplane, where <math>\,d_i</math> is the distance and <math>\,y_i</math> is used as the sign.<br />
===Margin Maximizing Problem for the Support Vector Machine===<br />
<math>\,Margin=min\{y_id_i\}</math> can be rewritten as <math>\,min\left\{\frac{y_i\left(\beta^Tx_i+\beta_0\right)}{|\beta|}\right\}</math>. <br />
<br\>Note that the term <math>\,y_i\left(\beta^Tx_i+\beta_0\right) = 0</math> if <math>\,x_i</math> is on the hyperplane, but <math>\,y_i\left(\beta^Tx_i+\beta_0\right) > 0</math> if <math>\,x_i</math> is <math>\displaystyle not</math> on the hyperplane<br />
<br />
This implies <math>\,\exists C>0</math> such that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq C</math>.<br />
<br />
Divide through by C to produce <math>\,y_i\left(\frac{\beta^T}{C}x_i + \frac{\beta_0}{C}\right) \geq 1</math>. <br />
<br />
<math>\,\beta, \beta_0</math> compose a hyperplane that can have different values but we care about the direction, dividng through by a constant does not change the direction of the hyperplane. Thus, by assuming scaled values for <math>\,\beta, \beta_0</math> we eliminate C, so that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq 1</math>. Implying that the lower bound on <math>\,y_i\left(\beta^Tx_i+\beta_0\right)</math> is <math>\displaystyle 1</math><br />
<br />
Now in order to maximize the margin, we simply need to find <math>\,min\left\{\frac{1}{|\beta|}\right\}</math>. <br />
<br />
In other words, our optimization problem is now to find maximum <math>\,|\beta|</math>, under the constraint that <math>\,min_i\{y_i(\beta^Tx_i+\beta_0)\} = 1</math>.<br />
<br />
Note that we're dealing with the norm of <math>\,\beta</math>. There are many different choices of possible norms, in general[http://en.wikipedia.org/wiki/P-norm#p-norm p-norm]. The 1-norm of a vector is simply the sum of the absolute value of each element (also known as the taxicab or Manhattan distance), and is apparently more accurate, but also has a discontinuity in the derivative. 2-norm or the Euclidean norm (the intuitive measure of the length of a vector), is easier to work with - that is <math>\,\|\beta\|_2 = (\beta^T\beta)^{1/2}</math>. For convenience, we will maximize <math>\,\frac{1}{2}\|\beta\|_2^2 = \frac{1}{2}\beta^T\beta</math> where the constant 1/2 has been added for simplification and that the maximizing the function is the same as maximizing the square root of that function.<br />
<br />
This is an example of a quadratic programming problem and we will minimize a quadratic function respective to a linear inequality constraints.<br />
<br />
<br />
====Writing Lagrangian Form of Support Vector Machine====<br />
The Lagrangian form is introduced to ensure that the conditions are satisfied, as well as finding an optimal solution(The optimal saddle point of the lagrangian for the classic quadratic optimization). The problem will be solved in dual space by introducing <math>\,\alpha_i</math> as dual constraints, this is in contrast to solving the problem in primal space as function of the betas. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.<br />
<br />
<math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math>. To find the optimal value, set the derivative equal to zero.<br />
<br />
<math>\,\frac{\partial L}{\partial \beta} = 0</math>, <math>\,\frac{\partial L}{\partial \beta_0} = 0</math>. Note that <math>\,\frac{\partial L}{\partial \alpha_i}</math> is equivalent to the constraints <math>\left(y_i(\beta^Tx_i+\beta_0)-1\right) \geq 0, \,\forall\, i</math><br />
<br />
First, <math>\,\frac{\partial L}{\partial \beta} = \frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\left\{\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i)+\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0-\frac{\partial}{\partial \beta}\alpha_iy_i\right\}}</math><br />
<br />
: <math>\frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 = \beta</math>.<br />
<br />
: <math>\,\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i) = \alpha_iy_ix_i</math><br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0 = 0</math>. <br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i = 0</math>.<br />
<br />
So this simplifies to <math>\,\frac{\partial L}{\partial \beta} = \beta - \sum_{i=1}^n{\alpha_iy_ix_i} = 0</math>. In other words,<br />
<br />
<math>\,\beta = \sum_{i=1}^n{\alpha_iy_ix_i}</math>, <math>\,\beta^T = \sum_{i=1}^n{\alpha_iy_ix_i^T}</math><br />
<br />
Similarly, <math>\,\frac{\partial L}{\partial \beta_0} = -\sum_{i=1}^n{\alpha_iy_i} = 0</math>.<br />
<br />
This allows us to rewrite the Lagrangian without <math>\,\beta</math>.<br />
<br />
<math>\,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} - \sum_{i=1}^n{\alpha_iy_i(\sum_{j=1}^n{\alpha_jy_jx_j^Tx_i} + \beta_0) - 1}</math>. <br />
<br />
Because <math>\,\sum_{i=1}^n{\alpha_iy_i} = 0</math>, and <math>\,\beta_0</math> is constant, <math>\,\sum_{i=1}^n{\alpha_iy_i\beta_0} = 0</math>. So this simplifies further, to<br />
<br />
<math>L(\alpha) = \,-\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} + \sum_{i=1}^n{\alpha_i}</math><br />
A dual representation of the maximum margin.<br />
<br />
Because <math>\,\alpha_i</math> is the Lagrangian multiplier, <math>\,\alpha_i \geq 0 \forall i</math>.<br />
<br />
This is a much simpler optimization problem.<br />
<br />
=='''The Support Vector Machine algorithm - November 18, 2009'''==<br />
<br />
===Solving the Lagrangian===<br />
<br />
Continuing from the above derivation, we now have the equation that we need to minimize, as well as two constraints.<br />
<br />
The Support Vector Machine problem boils down to:<br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math><br />
:such that <math>\alpha_i \geq 0</math><br />
:and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
<br />
We are looking to minimize <math>\,\alpha</math>, which is our only unknown. Once we know <math>\,\alpha</math>, we can easily find <math>\,\beta</math> and <math>\,\beta_0</math> (see the Support Vector algorithm below for complete details).<br />
<br />
If we examine the Lagrangian equation, we can see that <math>\,\alpha</math> is multiplied by itself; that is, the Lagrangian is quadratic with respect to <math>\,\alpha</math>. Our constraints are linear. This is therefore a problem that can be solved through [http://en.wikipedia.org/wiki/Quadratic_programming quadratic programming] techniques. We will examine how to do this in Matlab shortly.<br />
<br />
We can write the Lagrangian equation in matrix form:<br />
<br />
<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:such that <math>\underline{\alpha} \geq \underline{0}</math><br />
:and <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Where:<br />
* <math>\underline{\alpha}</math> denotes an <math>\,n \times 1</math> vector; <math>\underline{\alpha}^T = [\alpha_1, ..., \alpha_n]</math><br />
* Matrix <math>S = y_iy_jx_i^Tx_j = (y_ix_i)^T(y_ix_i)</math><br />
* <math>\,\underline{0}</math> and <math>\,\underline{1}</math> are vectors containing all 0s or all 1s respectively<br />
<br />
Using this matrix notation, we can use Matlab's built in quadratic programming routine, [http://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/quadprog.html quadprog].<br />
<br />
===Quadprog example===<br />
<br />
Let's use <code>quadprog</code> to find the solution to <math>\,L(\alpha)</math>.<br />
<br />
Matlab's <code>quadprog</code> function minimizes an equation of the following form:<br />
:<math>\min_x\frac{1}{2}x^THx+f^Tx</math><br />
:such that: <math>\,A \cdot x \leq b</math>, <math>\,Aeq \cdot x = beq</math> and <math>\,lb \leq x \leq ub</math><br />
<br />
We can now see why we kept the <math>\frac{1}{2}</math> constant in the original derivation of the equation.<br />
<br />
The function is called as such: <code>x = quadprog(H,f,A,b,Aeq,beq,lb,ub)</code>. The variables correspond to values in the equation above.<br />
<br />
We can now try to find the solution to <math>\,L(\alpha)</math> (though, it should be noted, that in <math>\,L(\alpha)</math> we subtract the first term rather than add it; I'm not sure if this difference would change how we call <code>quadprog</code>).<br />
<br />
We'll use a simple one-dimensional data set, which is essentially y = -1 or 1 + Gaussian noise. (Note: you could easily put the values straight into the quadprog equation; they are separated for clarity.)<br />
<br />
x = [mvnrnd([-1],[0.01],100); mvnrnd([1],[0.01],100)]';<br />
y = [-ones(100,1); ones(100,1)];<br />
S = (y*x)' * (y*x);<br />
f = ones(200,1);<br />
A = -ones(1,200);<br />
b = 0;<br />
Aeq = y';<br />
beq = 0;<br />
lb = 0;<br />
ub = []; % There is no upper bound<br />
alpha = quadprog(S,f,A,b,Aeq,beq,lb,ub);<br />
<br />
This gives us the optimal <math>\,\alpha</math>... or at least I think it should, but it does not appear to work for me (that is, despite setting the lower boundary of 0, a number of values are still negative. Whether this is just the nature of <code>quadprog</code> or an error on my part is an exercise for the reader).<br />
<br />
===Examining K.K.T. conditions===<br />
<br />
[http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker conditions] ([http://webrum.uni-mannheim.de/mokuhn/public/KarushKuhnTucker.pdf more info]) give us a closer look into the Lagrangian equation and the associated conditions.<br />
<br />
Suppose we are looking to minimize <math>\,f(x)</math> such that <math>\,g_i(x) \geq 0, \forall{x}</math>. If <math>\,f</math> and <math>\,g</math> are differentiable, then the ''necessary'' conditions for <math>\hat{x}</math> to be a local minimum are:<br />
<br />
# At the optimal point, <math>\frac{\partial L}{\partial \hat{x}} = 0</math>; i.e. <math>f'(\hat{x}) - \sum{\alpha_ig'(\hat{x})}=0</math><br />
# <math>\alpha_i \geq 0</math>. (Dual Feasibility) <br />
# <math>\alpha_ig_i(\hat{x}) = 0, \forall{i}</math> (Complementary Slackness)<br />
# <math>g_i(\hat{x}) \geq 0</math> (Primal Feasibility)<br />
<br />
If any of these conditions are violated, then the problem is deemed not feasible.<br />
<br />
These are all trivial except for condition 4. Let's examine it further in our support vector machine problem.<br />
<br />
===Support Vectors===<br />
<br />
Basically, the support vectors are the training points that actually determine the optimal separating hyperplane which we are look for. Also, they are the most difficult points to classify or the most informative for the classification.<br />
<br />
In our case, the <math>g_i(\hat{x})</math> function is:<br />
:<math>\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1</math><br />
<br />
Substituting <math>\,g_i</math> into KKT condition 3, we get <math>\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0</math>. <br\>In order for this condition to be satisfied either <br/><math>\,\alpha_i= 0</math> or <br/><math>\,y_i(\beta^Tx_i+\beta_0)=1</math><br />
<br />
All points <math>\,x_i</math> will be either 1 or greater than 1 distance away from the hyperplane.<br />
<br />
'''Case 1: a point <math>\displaystyle x_i > 1</math> away from the margin'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) > 1 \Rightarrow \alpha_i = 0</math>.<br />
<br />
If point <math>\, x_i</math> is not on the margin, then the corresponding <math>\,\alpha_i=0</math>.<br />
<br />
'''Case 2: a point <math>\displaystyle x_i = 1</math> away from the margin'''<br />
<br />
If <math>\,\alpha_i > 0 \Rightarrow y_i(\beta^Tx_i+\beta_0) = 1</math> <br />
<br\>If point <math>\, x_i</math> is on the margin, then the corresponding <math>\,\alpha_i>0</math>.<br />
<br />
<br />
Points on the margin, with corresponding <math>\,\alpha_i > 0</math>, are called '''''support vectors'''''.<br />
<br />
===Using support vectors===<br />
<br />
Support vectors are important because they allow the support vector machine algorithm to be insensitive to outliers. If <math>\,\alpha = 0</math>, then the cost function is also 0, and won't contribute to the solution of the SVM problem; only points on the margin -- support vectors -- contribute. Hence the model given by SVM in entirely defined by the set of support vectors, a subset of the entire training set. This is interesting because in the NN methods(and can be generalize to classical statistical learning) previous to this the configuration of the network needed to be specified. In this case we have a data-driven or 'nonparametric' model in which is the training set and algorithm will determine the support vectors.<br />
<br />
References:<br />
Wang, L, 2005. Support Vector Machines: Theory and Applications, Springer, 3<br />
<br />
====The support vector machine algorithm====<br />
<br />
# Solve the quadratic programming problem:<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
## Use Matlab's quadprog to find the optimal <math>\,\underline{\alpha}</math><br />
# Find <math>\beta = \sum_{i=1}^n{\alpha_iy_i\underline{x_i}}</math><br />
# Find <math>\,\beta_0</math> by choosing a support vector (a point with <math>\,\alpha_i > 0</math>) and solving <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math><br />
<br />
===Example in Matlab===<br />
<br />
The following code, taken verbatim from the lecture, shows how to use Matlab built-in SVM routines (found in the Bioinformatics toolkit) to do classification through support vector machines.<br />
<br />
load 2_3;<br />
[U,Y] = princomp(X');<br />
data = Y(:,1:2);<br />
l = [-ones(1,200) ones(1,200)];<br />
[train,test] = crossvalind('holdOut',400);<br />
% Gives indices of train and test; so, train is a matrix of 0 or 1, 1 where the point should be used as part of the training set<br />
svmStruct = svmtrain(data(train,:), l(train), 'showPlot', true);<br />
<br />
[[File:Svm1.png|frame|center|The plot produced by training on some of the 2_3 data's first two features.]]<br />
<br />
yh = svmclassify(svmStruct, data(test,:), 'showPlot', true);<br />
<br />
[[File:Svm2.png|frame|center|The plot produced by testing some of the 2_3 data.]]<br />
=='''Non-linear hypersurfaces and Non-Separable classes - November 20, 2009'''==<br />
==='''Kernel Trick'''===<br />
We have talked about the curse of dimensions at the beginning of this course, however, now we turn to the power of high dimensions in order to make find a linearly separable hyperplane between two classes of data points. To understand this, imagine a two dimensional prison where a two dimensional person is constrained. Suppose magically we give the person a third dimension, then he can escape from the prison. In other words, the prison and the person are linearly separable now with respect to the third dimension. The intuition behind the "kernel trick" is basically to map data to a higher dimension so that they are linearly separable by a hyperplane.<br />
<br />
We have seen SVM as a linear classification problem finding the max margin hyperplane in the given input space. However, for many real world problems a more complex decision boundary is required. The following simple method was devised is order to solve the same linear classification problem but in a usually higher dimensional space, 'feature space' under which the max margin hyperplane is better suited.<br />
<br />
Let <math>\,\phi</math> be a mapping,<br />
<br />
<math>\phi:\Re^d \rightarrow \Re^D </math><br /><br /><br />
<br />
We wish to find a <math>\,\phi</math> such that our data will be suited for separation by a hyperplane. Given this function, we are lead to solving the previous constrained quadratic optimization on the transformed dataset,<br /><br /><br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_j\phi(x_i)^T\phi(x_j)}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br /><br /><br />
<br />
The solution to this optimization problem is now well know; however a workable <math>\,\phi</math> must be determined. Possibly the largest drawback in this method is that we must compute the inner product of two vectors in the high dimensional space. As the number of dimensions in the initial data set increases, the inner product becomes computationally intensive or impossible.<br />
<br />
However, we have a very useful result that says that there exists a class of functions, <math>\,\Phi</math>, which satisfy the above requirements and that for any function <math>\,\phi \in \Phi</math>,<br />
<br />
<math>\,\phi(x_i)^T\phi(x_j) = K(x_i,x_j) </math><br /><br /><br />
<br />
Where K is the kernel function in the input space satisfying [http://en.wikipedia.org/wiki/Mercer%27s_condition Mercer's condition] (to guarantee that it indeed corresponds to certain mapping function <math>\,\phi</math>). As a result, if the objective function depends on inner products but not on coordinates, we can always use the kernel function to implicitly calculate in the feature space without storing the huge data. Not only does this solve the computation problems but it no longer requires us to explicitly determine a specific mapping function in order to use this method. In fact, it is now possible to use an infinite dimensional feature space in SVM without even knowing the function <math>\,\phi</math>.<br />
<br />
==='''Mercer's Theorem in detail'''===<br />
Let <math>\,\phi</math> be a mapping to a high dimensional [http://en.wikipedia.org/wiki/Hilbert_space Hibert Space] <math>\,H</math><br /><br />
<br />
<br />
<math>\phi:x \in \Re^d \rightarrow H </math><br /><br /><br />
<br />
The transformed coordinates can be defined as,<br /><br />
<br />
<math>\phi_1(x)\dots\phi_d(x)\dots </math><br /><br /><br />
<br />
By Hilbert - Schmidt theory we can represent an inner product in Hilbert space as,<br /><br /><br />
<br />
<math>\,\phi(x_i)^T\phi(x_j) = \sum_{r=1}^{\infty}a_k\phi_r(x_i)\phi(x_j) \Leftrightarrow K(x_i,x_j), \ a_r \ge 0 </math><br /><br /><br />
where K is symmetric, then Mercer's theorem gives necessary and sufficient conditions on K for it to satisfy the above relation.<br><br><br />
<br />
'''Mercer's Theorem'''<br />
<br />
Let C be a compact subset of <math>\Re^d</math> and K a function <math> \in L^2(C) </math>, if<br /><br /><br />
<br />
<math>\, \int_C\int_C K(u,v)g(u)g(v)dudv \ge 0, \ \forall g \in L^2(C)</math> <br /><br /><br />
<br />
then,<br /><br /><br />
<br />
<math>\sum_{r=1}^{\infty}a_k\phi_r(u)\phi(v)</math> converges absolutely and uniformly to a symmetric function <math>\,K(u,v)</math><br />
<br />
References:<br />
Vapnik, V., 1998. Statistical Learning Theory. John Wiley & Sons, {423}<br />
<br />
==='''Kernel Functions'''===<br />
There are various kernel functions, for example:<br />
<br />
* Linear kernel: <math>\,k(x,y)=x \cdot y</math><br />
* Polynomial kernel: <math>\,k(x,y)=(x \cdot y)^d</math><br />
* Gaussian kernel: <math>e^{-\frac{|x-y|^2}{2\sigma^2}}</math><br />
<br />
If <math>\,X</math> is a <math>\,d \times n</math> matrix in the original space, and <math>\,\phi(X)</math> is a <math>\,D \times n</math> matrix in the [http://en.wikipedia.org/wiki/Hilbert_space Hilbert space] (good explanation video: [http://www.youtube.com/watch?v=V2pBdH7YzX0 part 1] [http://www.youtube.com/watch?v=YRY5xlk3TC0 part 2]), then <math>\,\phi^T(X) \cdot \phi(X)</math> is an <math>\,n \times n</math> matrix. <br />
The inner product is also illustrated as correlation, which measures the similarity between data points. This gives us some insight in how to choose the kernel. The choice depends on certain prior knowledge of the problem and on how we believe the similarity of our data should be measured. In practice, the Gaussian (RBF) kernel usually works best. Besides the most common kernel functions mentioned above, many novel kernels are also suggested for different problem domains like text classification, gene classification and so on.<br />
<br />
These kernel functions can be applied to many algorithms to derive the "kernel version". For example, kernel PCA, kernel LDA, etc..<br />
<br />
==='''SVM: non-separable case'''===<br />
We have seen how SVMs are able to find an optimally separating hyperplane of two separable classes of data, in which case the margin contains no data points. However, in the real world, data of different classes are usually mixed together at the boundary and it's hard to find a perfect boundary to totally separate them. To address this problem, we slack the classification rule to allow data cross the margin. Mathematically the problem becomes,<br />
:<math>\min_{\beta, \beta_0} \frac{1}{2}|\beta|^2</math><br />
:<math>\,y_i(\beta^Tx_i+\beta_0) \geq 1-\xi_i</math><br />
:<math>\xi_i \geq 0</math><br />
<br />
Now each data point can have some error <math>\,\xi_i</math>. However, we only want data to cross the boundary when they have to and make the minimum sacrifice; thus, a penalty term is added correspondingly in the objective function to constrain the number of points that cross the margin. The optimization problem now becomes:<br />
<br />
:<math>\min_{\alpha} \frac{1}{2}|\beta|^2+\gamma\sum_{i=1}^n{\xi_i}</math><br />
:<math>\,s.t.</math> <math>y_i(\beta^Tx+\beta_0) \geq 1-\xi_i</math> <br />
:<math>\xi_i \geq 0</math><br />
<br />
[[File:non-separable.JPG|350px|thumb|right|Figure non-separable case]]<br />
<br />
<br\>Note that <math>\,\xi_i</math> is not necessarily smaller than one, which means data can not only enter the margin but can also cross the separating hyperplane.<br />
<br />
References:<br />
<br />
Mercer, J., 1909. Functions of positive and negative type and their connection<br />
with the theory of integral equations. Philos. Trans. Roy. Soc. London, A<br />
209:415{446}<br />
<br />
==Support Vector Machine algorithm for non-separable cases - November 23, 2009==<br />
<br />
With the equation above, we can form the lagrangian, apply KKT conditions, and come up with the equation to optimize. As we will see, the equation that we will attempt to optimize in the SVM algorithm for non-separable data sets is the same as the optimization for the separable case, with slightly different conditions.<br />
<br />
===Forming the Lagrangian===<br />
<br />
:<math>L: \frac{1}{2} |\beta|^2 + \gamma \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i[y_i(\beta^T x_i+\beta_0)-1+\xi_i]-\sum_{i=1}^n \alpha_i \xi_i</math><br />
:<math>\alpha_i \geq 0, \lambda_i \geq 0</math><br />
<br />
===Applying KKT conditions===<br />
<br />
Someone please fill in<br />
<br />
===Putting it all together===<br />
<br />
With our KKT conditions and the Lagrangian equation, we can now use quadratic programming to find <math>\,\alpha</math>.<br />
<br />
In matrix form, we want to solve the following optimization:<br />
:<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:<math>\,s.t.</math> <math>\underline{0} \leq \underline{\alpha} \leq \gamma</math>, <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Solving this gives us <math>\,\underline{\alpha}</math>, which we can use to find <math>\,\underline{\beta}</math> as before:<br />
:<math>\,\underline{\beta} = \sum{\alpha_i y_i \underline{x_i}}</math><br />
<br />
However, we cannot find <math>\,\beta_0</math> in the same way as before, even if we choose a point with <math>\,\alpha_i > 0</math>, because we do not know the value of <math>\,\xi_i</math> in the equation<br />
:<math>\,y_i(\underline{\beta}^Tx_i + \beta_0) - 1 + \xi_i = 0</math><br />
<br />
From our discussion on the KKT conditions, we know that <math>\,\lambda_i \xi_i = 0</math> and <math>\,\gamma = \alpha_i + \lambda_i</math>.<br />
<br />
So, if <math>\,\alpha_i < \gamma</math> then <math>\,\lambda_i > 0</math> and consequently <math>\,\xi_i = 0</math>.<br />
<br />
Therefore, we can solve for <math>\,\beta_0</math> if we choose a point where:<br />
:<math>\,0 < \alpha_i < \gamma</math><br />
<br />
====The SVM algorithm for non-separable data sets====<br />
<br />
The algorithm, then, for non-separable data sets is:<br />
<br />
# Use <code>quadprog</code> (or another quadratic programming technique) to solve the above optimization and find <math>\,\alpha</math><br />
# Find <math>\,\underline{\beta}</math> by solving <math>\,\underline{\beta} = \sum{\alpha_i y_i x_i}</math><br />
# Find <math>\,\beta_0</math> by choosing a point where <math>\,0 < \alpha_i < \gamma</math> and then solving <math>\,y_i(\underline{\beta}^Tx_i + \beta_0) - 1 = 0</math></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=5507stat8412009-11-23T23:44:00Z<p>Ipargaru: /* SVM: non-separable case */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rise of fields such as data-mining, bioinformatics, and machine learning, classification has becomes a fast-developing topic. In the age of information, vast amounts of data are generated constantly, and the goal of classification is to learn from data. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
'''Definition''': The problem of Prediction a discrete random variable <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math> is called Classification.<br />
<br />
In classification,, we attempt to approximate a function <math>\,h</math>, by using a training data set, which will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>d</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data which are identical independent distributions, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, for instance, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability. Intuitively speaking, to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we find <math>\,y\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y|X=x)</math>, and classify <math>\,X</math> into class <math>\,y</math>. In order to calculate the value of <math>\,P(Y=y|X=x)</math>, we use ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posterior probability, <math>\,P(Y=y)</math> as the prior probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two classes, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<sub></sub><br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''3 different approaches to classification''':<br />
<br />
1) Empirical Risk Minimization: Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of <math>\,L(h)</math><br />
<br />
2) Regression: Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3) Density Estimation: estimate <math>\,P(X=x|Y=0)</math> and <math>\,P(X=x|Y=1)</math> (less popular in high-dimension cases)<br />
<br />
<br />
<br />
'''Bayes Classification Rule Optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when<math></math> the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
<br />
:<math>\, h^*(X)= \left\{\begin{matrix} <br />
1 & if P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
Remark:<br />
<br />
1)Bayes classification rule is optimal. Proof:[http://www.ee.columbia.edu/~vittorio/BayesProof.pdf]<br />
<br />
2)We still need any other method, since we cannot define prior probability in realistic.<br />
<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, since it is generally impossible for us to know the prior <math>\,P(Y=1)</math>, and class conditional density <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], tree-augmented naive Bayes (TAN), Bayesian network augmented naive Bayes (BAN) and general Bayesian network (GBN).<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as having objective existence. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayesian and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayesian'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In Bayesian method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
'''Multi-class Classification''':<br />
<br />
Y takes on more than two values.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
''Theorem'': Suppose that <math>\,Y \in \mathcal{Y}= \{1,\dots,k\}</math>, the optimal rule is :<math>\,h^*(X) = \arg\max_{k}{P(Y = k|X = x)}</math><br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
1 Empirical Risk Minimization:Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
<br />
2 Regression:Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3 Density estimation, estimate <math>P(X = x|Y = 0)</math> and <math>P(X = x|Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
It is quadratic because there is no boundaries.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on [http://academicearth.org/lectures/advice-for-applying-machine-learning LDA and QDA] so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
LDA is for classification and FDA is used for feature extraction.<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
<br />
=== logistic function ===<br />
A logistic function or logistic curve is the most common sigmoid curve. <br />
<br />
:<math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
1. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
2. <math>y(0) = \frac{1}{2}</math><br />
<br />
3. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
4. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
5. The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood, using Pr(Y|X). The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Seeing these equations as a weighted least squares problem makes them easier to derivate.<br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== Perceptron (Foundation of Neural Network) ===<br />
<br />
==== Separating Hyperplane Classifiers ====<br />
Separating hyperplane trys to separate the data using linear decision boundaries. When the classes overlap, it can be generalized to support vector machine, which constructs nonlinear boundaries by constructing a linear boundary in an enlarged and transformed feature space.<br />
<br />
==== Perceptron ====<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
Least Squares returns the sign of a linear combination of data points as the class label<br />
<br />
sign(<math>(\underline{\beta}^T \underline{x} + {\beta}_0)) = sign(\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2})</math><br />
<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the [http://en.wikipedia.org/wiki/Artificial_neural_network Artificial Neural Network] models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br />
<br />
<br />
If we find a hyperplane that is not unique between 2 classes, there will be infinitely many solutions obtained from the perceptron algorithm.<br />
<br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
A Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math>, where <math>\,I</math> indicates the sign of the expression and returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math> (initial guess). Its goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach], which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
====Problems with the Algorithm and Issues Affecting Convergence====<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. This problem can be eliminated by using basis expansions technique. To be specific, we try to find a hyperplane not in the original space, but in the enlarged space obtained by using some basis functions.<br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#A perfect separation is not always available even desirable. If observations comes from different classes sharing the same imput, the classification model seems to be overfitting and will generally have poor predictive performance.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
<br />
</ref>.<br />
====Comment on gradient descent algorithm====<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction in which the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and you initially stand in the middle, then you will finally arrive at the saddle point (local minimum) and get stuck there.<br />
<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over <math>\,i</math> (all data points). Actually, this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent) known as Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>\,{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
*A Perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection with branches that "fan out" onto as many connections as desired, each carrying the same signal - the processing element output signal. <br />
<br />
<ref> Haykin, Simon (2009). Neural Networks and Learning Machines. Pearson Education, Inc. </ref><br />
A neural network resembles the brain in two respects:<br />
1. Knowledge is acquired by the network from its environment through a learning process.<br />
2. Interneuron connection strengths, known as synaptic weights, are used to store the acquired knowledge.<br />
<br />
<ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref> It is a multistage regression or classification model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
A regression problem typically has only one unit in the output layer. In a k-class classification problem, there are usually k units in the output layer that each represent the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1).<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classification by NN. <br />
<br />
In perceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] and is not continuous at 0. Thus, we replace it by a smooth function <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it the '''activation function'''.<br />
<br>The choice of this function <math>\displaystyle \sigma </math> is determined by the properties of the data and the assumed distribution of target variables, but for multiple binary classification problems the logistic function, also known as inverse-logit, is often used: <br />
<math>\sigma(a)=\frac {1}{1+e^{-a}}</math><br />
<br />
There are some important properties for the activation function.<br />
<br />
# Activation function is nonlinear. It can be shown that if the activation function of the hidden units is linear, a three-layer neural network is equivalent to a two layer one. <br />
# Activation function saturate, which means there are maximum and minimum output value. This property ensures that the weights are bounded and therefore the searching time is limited. <br />
# Activation function is continuous and smooth.<br />
# Activation function is monotonic. This property is not necessary, since we know that RBF networks is also a kind of power model. <br />
<br />
'''Note:''' A key difference between a perceptron and a neural network is that a neural network uses continuous nonlinearities in the units, for the purpose of differentiation, whereas the perceptron often uses a non-differentiable activation function. The neural network function is differentiable with respect to the network parameters so that a gradient descent method can be used in training. Moreover, a perceptron is a linear classifier, whereas a neural network, by combining layers of perceptrons, is able to classify non-linear problems through proper training.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes into the perceptron, to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural network is called [http://en.wikipedia.org/wiki/Feedforward_neural_network Feed-Forward Neural Network]. Applications to Feed-Forward Neural Networks include data reduction, speech recognition, sensor signal processing, and ECG abnormality detection, to name a few. <ref>J. Annema, Feed-Forward Neural Networks, (Springer 1995), pp. 9 </ref><br />
<br />
===Back-propagation===<br />
For a while, the Neural Network model was just an idea, since there were no algorithms for training the model until 1986, when Geoffrey Hinton <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> devised an algorithm called '''back-propagation''' [http://en.wikipedia.org/wiki/Backpropagation#Algorithm]. After that, a number of other training algorithms and various configurations of neural networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied a gradient descent algorithm for optimizing weights. Back-propagation uses this idea of gradient descent to train a neural network based on the chain rule in calculus. <br />
<br />
Assume that the last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
For simplicity, there is only 1 unit at the end and assume for the moment we are doing regression.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each unit we have a function <math>\displaystyle z_i=\sigma(a_i)</math> that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classification and the resulting classification output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
''i j are reversed!''<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\delta_j \cdot z_l</math><br />
<br />
Note that a change in <math>\,a_j</math> causes changes in all <math>\,a_i</math> in the next layer on which the error is based, so we need to sum over i in the chain:<br />
<math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} =\sum_i \delta_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\,\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math> Using the activation function <math>\,\sigma(\cdot)</math><br />
<br />
So <math>\delta_j = \sum_i \delta_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \delta_i \cdot u_{ij}</math><br />
<br />
We can propagate the error calculated in the output back through the previous layers and adjust weights to minimize error.<br />
<br />
==Neural Networks (NN) - October 30, 2009 ==<br />
<br />
=== Back-propagation ===<br />
The idea is that we first feed an input from the training set to the Neural Network, then find the error rate at the output and then we propagate the error to previous layers and for each edge of weight <math>\,u_{ij}</math> we find <math>\frac{\partial \mathrm{err}}{\partial u_{ij}}</math>. Having the error rates at hand we adjust the weight of each edge by taking steps proportional to the negative of the gradient to decrease the error at output. The next step is to apply the next input from the training set and go through the described adjustment procedure.<br />
The overview of Back-propagation algorithm:<br />
#Feed a point <math>\,x</math> in the training set to the network, and find the output of all the nodes.<br />
#Evaluate <math>\,\delta_k=y_k-\hat{y_k}</math> for all output units, where <math>y_k</math> is the expected output and <math>\hat{y_k}</math> is the real output.<br />
#By propagating to the previous layers evaluate all <math>\,\delta_j</math>s for hidden units: <math>\,\delta_j=\sigma'(a_j)\sum_i \delta_i u_{ij}</math> where <math>i</math> is associated to the previous layer.<br />
#Using <math>\frac{\partial \mathrm{err}}{\partial u_{jl}} = \delta_j\cdot z_l</math> find all the derivatives.<br />
#Adjust each weight by taking steps proportional to the negative of the gradient: <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} -\rho \frac{\partial \mathrm{err}}{u_{ij}}</math><br />
#Feed the next point in the training set and repeat the above steps.<br />
==== How to initialize the weights ====<br />
This still leaves the question of how to initialize the weights <math>\,u_{ij}, w_i</math>. The method of choosing weights mentioned in class was to randomize the weights before the first step. This is not likely to be near the optimal solution in every case, but is simple to implement. To be more specific, random values near zero will be a good choice for the initial weights(usually from [-1,1]). In this case, the model evolves from a nearly linear one to a nonlinear one as we desired. An alternative is to use an orthogonal least squares method to find the initial weights <ref>http://www.mitpressjournals.org/doi/abs/10.1162/neco.1995.7.5.982</ref>. Regression is performed on the weights and output by using a linear approximation of <math>\,\sigma(a_i)</math>, and finds optimal weights in the linear model. Back propagation is used afterward to find the optimal solution, since the NN is non-linear.<br />
<br />
==== How to set learning rates ====<br />
The learning rate <math>\,\rho</math> is usually a constant. <br />
<br />
If we use On-line learning, as a form of stochastic approximation process, <math>\,\rho</math> should decrease as the iteration increase.<br />
<br />
<br />
Choosing too large learning rate may cause the unstability of the system, while the too small learning rate may lead to a very slow convergence rate(very long time in learning phase). However, the advantage of small learning rate is that it can guarantee the convergence. Thus, generally, it is better to choose a relatively small learning rate to ensure the stability. Usually, choose <math>\,\rho</math> between 0.01 and 0.7<br />
<br />
==== How to determine the number of hidden units ====<br />
<br />
Here we will mainly discuss how to estimate the number of hidden units at very beginning. Obviously, we should adjust it to be more precise using CV, LOO or other complexity control methods. <br />
<br />
Basically, if the patterns are well separated, few hidden units are fairly enough. If the patterns are drawn from some highly complicated mixture models, more hidden units are really needed. <br />
<br />
Actually, the number of hidden units determines the size of the model, and therefore the total number of the weights in the model. Typically speaking, the number of weights should not be larger than the number of training data, say N. Thus, sometimes, N/10 is a good choice. However, in pratice, many well performed models will use more hidden units.<br />
<br />
=== Dimensionality reduction application ===<br />
[[File:NN-bottelneck.png|350px|thumb|right|Figure 1: Bottleneck configuration for applying dimensionality reduction.]]<br />
One possible application of Neural Networks is to perform dimensionality reduction, like other techniques, e.g., PCA, MDS, LLE and Isomap.<br />
<br />
Consider the following configuration as shown in figure 1:<br />
As we go forward in layers of this Neural Network, the number of nodes is reduced, until we reach a layer with the number of nodes representing the desired dimensionality. However, note that at the very first few layers the number of nodes may not be strictly decreasing, as long as finally it can reach a layer with less nodes. From now on in the Neural Network <br />
the previous layers are mirrored. So at the output layer we have the same number of states as we have in the input layer. Now note that if we feed the network with each point and get an output approximately equal to the fed input, that means at the output the same input is reconstructed from the middle layer units. So the output of the middle layer units can represent the input with less dimensions.<br />
<br />
To train this Neural Network, we feed the network with a training point and through back propagation we adjust the network weights based on the error between the input layer and the reconstruction at the output layer. Our low dimensional mapping will be the observed output from the middle layer. Data reconstruction consists of putting the low dimensional data through the second half of the network.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
== Complexity Control October 30, 2009 ==<br />
<br />
[[File:overfitting-model.png|500px|thumb|right|Figure 2. The overfitting model passes through all the points of the training set, but has poor predictive power for new points.<br />
In exchange the line model has some error on the training points but has extracted the main characteristic of the training points, and has good predictive power.]]<br />
There are [http://academicearth.org/lectures/underfitting-and-overfitting two issues] that we have to avoid in Machine Learning:<br />
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]<br />
#Underfitting<br />
<br />
Overfitting occurs when our model is heavily complex with so many degrees of freedom, that we can learn every detail of the training set. Such a model will have very high precision on the training set but will show very poor ability to predict outcomes of new instances, especially outside the domain of the training set.<br />
<br />
In a Neural Network if the depth is too much, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will show a very precise outcome of the training set but will not be able to generalize the commonality of the training set to predict the outcome of new cases.<br />
<br />
Underfitting occurs when the model we picked to describe the data is not complex enough, and has high error rate on the training set.<br />
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.<br />
<br />
'''Example'''<br />
#Consider the example showed in the figure. We have a training set and we want to find a model which fits it the best. We can find a polynomial of high degree which almost passes through all the points in the training set. But, in fact the training set is coming from a line model. Now the problem is although the complex model has less error on the training set it diverges from the line in other ranges which we have no training point. Because of that the high degree polynomail has very poor predictive result on test cases. This is an example of overfitting model.<br />
#Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem.<br />
#Consider a simple classification example. If our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is a yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features typical of bananas, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie., we have overfit the data. This occurs when we have considered so many features that we have perfectly described the existing bananas, but if presented with a new banana of slightly different shape than the existing bananas, for example, it cannot be detected. This is the tradeoff; what is the right level of complexity?<br />
<br />
== Complexity Control - Nov 2, 2009 ==<br />
<br />
Overfitting occurs when the model becomes too complex and underfitting occurs when it is not complex enough, both of which are not desirable. To control complexity, it is necessary to make assumptions for the model before fitting the data. Assumptions that we can make for a model are with polynomials or a neural network. There are other ways as well.<br />
<br />
[[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 1: An example of a model with a family of polynomials]]<br />
We do not want a model to get too complex, so we control it by making an assumption on the model. With complexity control, we want a model or a classifier with a low error rate.<br />
<br />
=== '''How do we choose a good classifier?''' ===<br />
<br />
Our goal is to find a classifier that minimizes the true error rate. <br />
Recall the empirical error rate<br />
<br />
<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math><br />
<br />
<math>\,h</math> is a classifier and we want to minimize the error rate. So we apply <math>\displaystyle x_1</math> to <math>\displaystyle x_n</math> to <math>\displaystyle h</math>, and take the average to get the empirical true error rate estimation of probability that <br />
<math>h(x_{i}) \neq y_{i}</math>.<br />
<br />
<span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 2]]</span><br />
There is a downward bias to this estimate meaning that it is always less than the true error rate. <br />
<br />
If there is a change in our complexity from low to high, our error rate is always decreasing. When we apply our model to the test data, our error rate will start to decrease to a point, but then it will increase since the model hasn't seen it before. This can be explained since training error will decrease when we fit the model better by increasing its complexity, but as we have seen, this complex model will not generalize well, resulting in a larger test error. <br />
<br />
We use our test data (from the test sample line shown on Figure 2) to get our empirical error rate.<br />
Right complexity is defined as where error rate of the test data is minimum; and this is one idea behind complexity control.<br />
<br />
<br />
<br />
[[File:Bias.jpg|200px|thumb|left|Figure 3]]<br />
<br />
We assume that we have samples <math>\,X_1, . . . ,X_n</math> that follow some (possibly unknown) distribution. We want to estimate a parameter <math>\,f</math> of the unknown distribution. This parameter may be the mean <math>\,E(X_i)</math>, the variance <math>\,var(X_i)</math> or some other quantity.<br />
<br />
The unknown parameter <math>\,f</math> is a fixed real number <math>f\in R</math>. To estimate it, we use an estimator which is a<br />
function of our observations, <math>\hat{f}(X_1,...,X_n)</math>. <br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f</math><br />
<br />
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]</math><br />
<br />
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math><br />
<br />
One property we desire of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>.<br />
However, there is a more important property for an estimator than just being unbiased: the mean squared error. In statistics, there are problems for which it may be good to use an estimator with a small bias. In some cases, an estimator with a small bias may have lesser mean squared error or be median-unbiased (rather than mean-unbiased, the standard unbiasedness property). The property of median-unbiasedness is invariant under transformations while the property of mean-unbiasedness may be lost under nonlinear transformations. For example, while using an unbiased estimator with large mean square error to estimate the parameter, we highly risk a big error. In contrast, a biased estimator with small mean square error will well improve the precision of our prediction.<br />
<br />
Hence, our goal is to minimize <math>MSE (\hat{f})</math>.<br />
<br />
From figure 3, we can see that the relationship of the three parameters is:<br />
<math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.<br />
<br />
A Test error is a good estimation on MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias.<br />
<br />
<br />
Referring to Figure 2, overfitting happens after the point where training data (training sample line) starts to decrease and test data (test sample line) starts to increase. There are 2 main approaches to avoid overfitting:<br />
<br />
1. Estimating error rate<br />
<br />
<math>\hookrightarrow</math> Empirical training error is not a good estimation<br />
<br />
<math>\hookrightarrow</math> Empirical test error is a better estimation<br />
<br />
<math>\hookrightarrow</math> Cross-Validation is fast<br />
<br />
<math>\hookrightarrow</math> Computing error bound (analytically) using some probability inequality.<br />
<br />
We will not discuss computing the error bound in class; however, a popular method for doing this computation is called VC Dimension (short for Vapnik–Chervonenkis Dimension). Information can be found from [http://www.autonlab.org/tutorials/vcdim.html Andrew Moore] and [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.7171&rep=rep1&type=pdf Steve Gunn].<br />
<br />
2. Regularization<br />
<br />
<math>\hookrightarrow</math> Use of shrinkage method<br />
<br />
<math>\hookrightarrow</math> Decrease the chance of overfitting by controlling the weights<br />
<br />
=== '''Example of under and overfitting in R''' ===<br />
<br />
To give further intuition of over and underfitting, consider this example. A simple quadratic data set with some random noise is generated, and then polynomials of varying degrees are fitted. The errors for the training set and a test set are calculated.<br />
[[File:Curvefitting-rex2.png|250px|thumb|right|Polynomial fits to curved data set.]]<br />
<br />
>> x <- rnorm(200,0,1)<br />
>> y <- x^2-0.5*x+rnorm(200,0,0.3)<br />
>> xtest <- rnorm(50,1,1)<br />
>> ytest <- xtest^2-0.5*xtest+rnorm(50,0,0.3)<br />
>> p1 <- lm(y~x)<br />
>> p2 <- lm(y ~ poly(x,2))<br />
>> pn <- lm(y ~ poly(x,10))<br />
>> psi <- lm(y~I(sin(x))+I(cos(x)))<br />
<br />
: <code>x</code> values for the training set are based on a <math>\,N(0,1)</math> distribution, while the test set has a <math>\,N(1,1)</math> distribution. <code>y</code> values are determined by <math>\,y = x^2 + x + N(0,0.3)</math>, a quadratic function with some random variation. Polynomial least square fits of degree 1, 2, and 10 are calculated, as well as a fit of <math>\,sin(x)+cos(x)</math>.<br />
<br />
>> > # calculate the mean squared error of degree 1 poly<br />
>> > sum((y-predict(p1,data.frame(x)))^2)/length(y)<br />
>> [1] 1.576042<br />
>> > sum((ytest-predict(p1,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 7.727615<br />
: Training and test mean squared errors for the linear fit. These are both quite high - and since the data is non-linear, the different mean value of the test data increases the error quite a bit.<br />
>> > # calculate the mean squared error of degree 2 poly<br />
>> > sum((y-predict(p2,data.frame(x)))^2)/length(y)<br />
>> [1] 0.08608467<br />
>> > sum((ytest-predict(p2,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 0.08407432<br />
: This fit is far better - and there is not much difference between the training and test error, either.<br />
>> > # calculate the mean squared error of degree 10 poly<br />
>> > sum((y-predict(pn,data.frame(x)))^2)/length(y)<br />
>> [1] 0.07967558<br />
>> > sum((ytest-predict(pn,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 156.7139<br />
: With a high-degree polynomial, the training error continues to decrease, but not by much - and the test set error has risen again. The overfitting makes it a poor predictor. As the degree of the polynomial rises further, the accuracy of the computer becomes an issue - and a good fit is not even consistently produced for the training data.<br />
>> > # calculate mse of sin/cos fit<br />
>> > sum((y-predict(psi,data.frame(x)))^2)/length(y)<br />
>> [1] 0.1105446<br />
>> > sum((ytest-predict(psi,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 1.320404<br />
: Fitting a function of the form sin(x)+cos(x) works pretty well on the training set, but because it is not the real underlying function, it fails on test data which doesn't lie on the same domain.<br />
<br />
== ''' Cross-Validation (CV) - Introduction ''' ==<br />
<br />
[[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]]<br />
Cross-Validation is used to estimate the error rate of a classifier with respect to test data rather than data used in the model. Here is a general introduction to CV:<br />
<br />
<math>\hookrightarrow</math> We have a set of collected data for which we know the proper labels<br />
<br />
<math>\hookrightarrow</math> We divide it into 2 parts, Training data (T) and Validation data (V)<br />
<br />
<math>\hookrightarrow</math> For our calculation, we pretend that we do not know the label of V and we use data in T to train the classifier<br />
<br />
<math>\hookrightarrow</math> We estimate an empirical error rate on V since the model hasn't seen V yet and we know the proper label of all elements in V to know how many were misclassified.<br />
<br />
CV has different implementations which can reduce the variance of the calculated error rate, but sometimes with a tradeoff of a higher calculation time.<br />
<br />
== ''' Complexity Control - Nov 4, 2009''' ==<br />
<br />
== Cross-validation ==<br />
[[File:Cross-validation.png|350px|thumb|right|Figure 1: Classical/Standard cross-validation]]<br />
Cross-validation is the simplest and most widely used method to estimate the true error. It comes from the observation that although training error always decreases with the increasing complexity of the model, the test error starts to increase from a certain point, which is noted as overfitting (see [[#prediction-error|figure 2]] above). Since test error estimates MSE (mean square error) best, people came up with the idea to divide the data set into three parts: training set, validation set, and test set. training set is used to build the model, validation set is used to deside the parameters and the optimal model, and the test set is used to estimate the performance of the chosen model. A classical division is 50% for training set, and 25% each for validation set and test set. All of them are randomly selected from the original data set. <br />
<br />
Then, we only use the part of our data marked as the "training set" to train our algorithm, while keeping the remaining marked as the "validation set" untouched. As a result, the validation set will be totally unknown to the trained model. The error rate is then estimated by:<br />
<br />
<math>\hat L(h) = \frac{1}{|\nu|}\sum_{X_i \in \nu}(h(x_i) \neq y_i)</math>, where <math>\,|\nu|</math> is the cardinality of the validation set.<br />
<br />
When we change the complexity, the error generated by the validation set will have the same behavior as the test set, so we are able to choose the best parameters to get the lowest error.<br />
<br />
<br />
=== K-fold Cross-validation ===<br />
[[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]]<br />
Above is the simplest form of complexity control. However, in reality, it may be hard to collect data ??and we usually suffer from the curse of dimensionality??, and a larger data set may be hard to come by. Consequently, we may not be able to afford to sacrifice part of the limited resources. In this case we use another method that addresses this problem, K-fold cross-validation. We divide the data set into <math>\,K</math> subsets roughly equal in size. The usual choice is <math>\,K = 10</math>.<br />
<br />
Generally, how to choose <math>\,K</math>:<br />
<br />
if <math>\,K=n</math>, leave one out, low bias, high variance. Each subset contains a single element, so the model is trained with all except one point, and then validated using that point.<br />
<br />
if <math>\,K=2</math>, say 2-fold, 5-fold, high bias, low variance. Each subset contains approximately <math>\,\frac{1}{2}</math> or <math>\,\frac{1}{5}</math> of the data.<br />
<br />
For every <math>\,k</math>th <math>( \,k \in [ 1, K ] )</math> part, we use the other <math>\,K-1</math> parts to fit the model and test on the <math>\,k</math>th part to estimate the prediction error <math>\hat L_k</math>, where<br />
<br />
<math>\hat L(h) = \frac{1}{K}\sum_{k=1}^K\hat L_k</math><br />
<br />
For example, suppose we want to fit a polynomial model to the data set and split the set into four equal subsets as shown in Figure 2. First we choose the degree to be 1, i.e. a linear model. Next we use the first three sets as training sets and the last as validation set, then the 1st, 2nd, 4th subsets as training set and the 3rd as validation set, so on and so forth until all the subsets have been the validation set once (all observations are used for both training and validation). After we get <math>\hat L_1, \hat L_2, \hat L_3, \hat L_4</math>, we can calculate the average <math>\hat L</math> for degree 1 model. Similarly, we can estimate the error for n degree model and generate a simulating curve. Now we are able to choose the right degree which corresponds to the minimum error. Also, we can use this method to find the optimal unit number of hidden layers of neural networks. We can begin with 1 unit number, then 2, 3 and so on and so forth. Then find the unit number of hidden layers with lowest average error.<br />
<br />
=== Generalized Cross-validation ===<br />
If the vector of observed values is denoted by <math>\mathbf{y}</math>, and the vector of fitted values by <math>\hat\mathbf{y}</math>.<br />
<br />
<math>\mathbf{y} = \mathbf{H}\hat\mathbf{y}</math>, <br />
<br />
where the hat matrix is given by<br />
<br />
<math>\mathbf{H} = \mathbf{X}( \mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X}^{T}</math>,<br />
<br />
<math> \frac{1}{N}\sum_{i=1}^{N}[y_{i} - \hat f^{-i}(\mathbf{x}_{i})]^{2}=\frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-\mathbf{H}_{ii}}]^{2}</math>,<br />
<br />
Then the GCV approximation is given by<br />
<br />
<math> GCV(\hat f) = \frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-trace(\mathbf{H})/N}]^{2}</math>,<br />
<br />
Thus, one of the biggest advantages of the GCV is that the trace is more easily computed.<br />
<br />
=== Leave-one-out Cross-validation ===<br />
Leave-one-out cross-validation involves using all but one data point in the original training data set to train our model, then using the data point that we initially left out as a means to estimate true error. But repeating this process for every data point in our original data set, we can obtain a good estimation of true error.<br />
<br />
In other words, leave-one-out cross-validation is k-fold cross-validation in which we set the subset number <math>\,K</math> to be the cardinality of the whole data set.<br />
<br />
In the above example, we can see that k-fold cross-validation can be computationally expensive: for every possible value of the parameter, we must train the model <math>\,K</math> times. This deficiency is even more obvious in leave-one-out cross-validation, where we must train the model <math>\,n</math> times, where <math>\,n</math> is the number of data point in the data set.<br />
<br />
Fortunately, when adding data points to the classifier is reversible, calculating the difference between two classifiers is computationally more efficient than calculating the two classifiers separately. So, if the classifier on all the data points is known, we simply undo the changes from a data point <math>\,K</math> times to calculate the leave-one-out cross-validation error rate.<br />
<br />
== Regularization for Neural Network — Weight Decay ==<br />
[[File:figure 2.png|350px|thumb|right|Figure 1: activation function]]<br />
Weight decay training is suggested as an implementation for achieving a robust neural network which is insensitive to noise. Since the number of hidden layers in NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting.<br />
<br />
It can be seen from Figure 1 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, we can avoid overfitting by constraining the weights to be small. This gives us a hint to initialize the random weights to be close to zero.<br />
<br />
Formally, we penalize nonlinear weights by adding a penalty term in the error function. Now the regularized error function becomes:<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}|w_i|^2 + \sum_{jk}|u_{jk}|^2)</math>, where <math>\,err</math> is the original error in back-propagation; <math>\,w_i</math> is the weights of the output layer; <math>\,u_{jk}</math> is the weights of the hidden layers.<br />
<br />
Usually, too large <math>\,\lambda</math> will make the weights <math>\,w_i</math> and <math>\,u_{jk}</math> too small. We can use cross validation to estimate <math>\,\lambda</math>.<br />
<br />
A similar penalty, weight elimination, is given by,<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}\frac{|w_i|^2}{1 + |w_i|^2} + \sum_{jk}\frac{|u_{jk}|^2}{1+|u_{jk}|^2})</math>.<br />
<br />
As in back-propagation, we take partial derivative with respect to the weights:<br />
<br />
<math>\frac{\partial REG}{\partial w_i} = \frac{\partial err}{\partial w_i} + 2\lambda w_i</math><br />
<br />
<math>\frac{\partial REG}{\partial u_{jk}} = \frac{\partial err}{\partial u_{jk}} + 2\lambda u_{jk}</math><br />
<br />
<math>w^{new} \leftarrow w^{old} - \rho\left(\frac{\partial err}{\partial w} + 2\lambda w\right)</math><br />
<br />
<math>u^{new} \leftarrow u^{old} - \rho\left(\frac{\partial err}{\partial u} + 2\lambda u\right)</math><br />
<br />
Note that here <math>\,\lambda</math> serves as a trade-off parameter, tuning between the error rate and the linearity. Actually, we may also set <math>\,\lambda</math> by cross-validation. The tuning parameter is important since weights of zero will lead to zero derivatives and the algorithm will not change. On the other hand, starting with weights that are too large means starting with a nonlinear model which can often lead to poor solutions. <ref>Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning (Springer 2009) pp.398</ref><br />
<br />
== Radial Basis Function (RBF) Networks - November 6, 2009 ==<br />
<br />
[[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]]<br />
<br />
=== Introduction === <br />
<br />
A Radial Basis Function (RBF) network [http://en.wikipedia.org/wiki/Radial_basis_function_network] is a type of artificial neural network with an output layer and a single hidden layer, with weights from the hidden layer to the output layer, and can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. One choice that has been widely used is that of radial basis functions, which have the property that each basis function depends only on the radial distance (typically Euclidean) from a center <math>\displaystyle\mu_{j}</math>, so that <math>\phi_{j}(x)= h({\Vert x - \mu_{j}\Vert})</math>.<br />
<br />
<br />
The output of an RBF network can be expressed as a weighted sum of its radial basis functions as follows:<br />
<br />
<math>\hat y_{k} = \sum_{j=1}^M\phi_{j}(x) w_{jk}</math><br />
<br />
The radial basis function is: <br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br /><br />
(Gaussian without a normalization constant)<br /><br /><br />
'''note:'''The hidden layer has a variable number of neurons (the optimal number is determined by the training process). As usual the more neurons in the hidden layer, the higher the model complexity. Each neuron consists of a radial basis function centered on a point with the same dimensions as the input data. The radii of the RBF functions may be different. The centers and radii can be determined through clustering or an EM algorithm. When the x vector is given from the input layer, the hidden neuron computes the radial distance from the neuron’s center point and then applies RBF function to this distance. The resulting value is passed to the the output layer and weighed together to form the output. <br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat{Y}_{n,k} = \left[ \begin{matrix}<br />
\hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\<br />
\hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\<br />
\vdots &\vdots & \ddots & \vdots \\<br />
\hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k}<br />
\end{matrix}\right] </math> is the matrix of output variables. <br />
<br />
:<math>\Phi_{n,m} = \left[ \begin{matrix}<br />
\phi_{1,1} & \phi_{1,2} & \cdots & \phi_{1,m} \\<br />
\phi_{2,1} & \phi_{2,2} & \cdots & \phi_{2,m} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
\phi_{n,1} & \phi_{n,2} & \cdots & \phi_{n,m}<br />
\end{matrix}\right] </math> is the matrix of Radial Basis Functions.<br />
<br />
:<math>W_{m,k} = \left[ \begin{matrix}<br />
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\<br />
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
w_{m,1} & w_{m,2} & \cdots & w_{m,k}<br />
\end{matrix}\right] </math> is the matrix of weights.<br />
<br />
Here, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>k = 1</math>, <math>\hat Y</math> and <math>W</math> are column vectors.<br />
<br />
''related reading'':<br />
<br />
Introduction of the Radial Basis Function (RBF) Networks [http://axiom.anu.edu.au/~daa/courses/GSAC6017/rbf.pdf]<br />
<br />
Radial Basis Function (RBF) Networks [http://documents.wolfram.com/applications/neuralnetworks/index6.html] [http://lcn.epfl.ch/tutorial/english/rbf/html/index.html]<br />
<br />
=== Estimation of weight matrix W ===<br />
<br />
We minimize the training error, <math>\Vert Y - \hat{Y}\Vert^2</math> in order to find <math>\,W</math>.<br /><br /><br />
From a previous result in linear algebra we know that <br />
<br />
<math>\Vert A \Vert^2 = Tr(A^{T}A)</math><br />
<br />
Thus we have a problem similar to linear regression:<br />
<br />
<math>\ err = \Vert Y - \Phi W\Vert^{2} = Tr[(Y - \Phi W)^{T}(Y - \Phi W)]</math><br />
<br />
<math>\ err = Tr[Y^{T}Y - Y^{T}\Phi W - W^{T} \Phi^{T} Y + W^{T}\Phi^{T} \Phi W]</math><br />
<br />
<br />
==== Useful properties of matrix differentiation ====<br />
<br />
<br />
<math>\frac{\partial Tr(AX)}{\partial X} = A^{T}</math><br />
<br />
<math>\frac{\partial Tr(X^{T}A)}{\partial X} = A</math><br />
<br />
<math>\frac{\partial Tr(X^{T}AX)}{\partial X} = (A^{T} + A)X</math><br />
<br />
==== Solving for W ====<br />
<br />
We find the minimum over <math>\,W</math> by setting <math>\frac{\partial err}{\partial W}</math> equal to zero and using the aforementioned properties of matrix differentiation.<br />
<br />
<math>\frac{\partial err}{\partial W} = 0</math><br />
<br />
<math>\ 0 - \Phi^{T}Y - \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ -2 \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math><br />
<br />
<math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY</math><br />
<br />
where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}</math><br />
<br />
<math>\,H</math> is the hat matrix for this model. This gives us a nice results since the solution has a closed form and we do not have to worry about convexity problems in this case.<br />
<br />
=== Including an additional bias ===<br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat Y = \left[ \begin{matrix}<br />
y_{11} & y_{12} & \cdots & y_{1k} \\<br />
y_{21} & y_{22} & \cdots & y_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
y_{n1} & y_{n2} & \cdots & y_{nk}<br />
\end{matrix}\right] </math> is the matrix(n by k) of output variables.<br />
<br />
:<math>\Phi = \left[ \begin{matrix}<br />
\phi_{10} &\phi_{11} & \phi_{12} & \cdots & \phi_{1M} \\<br />
\phi_{20} & \phi_{21} & \phi_{22} & \cdots & \phi_{2M} \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{n0} &\phi_{n1} & \phi_{n2} & \cdots & \phi_{nM}<br />
\end{matrix}\right] </math> is the matrix(n by M+1) of Radial Basis Functions.<br />
<br />
:<math>W = \left[ \begin{matrix}<br />
w_{01} & w_{02} & \cdots & w_{0k} \\<br />
w_{11} & w_{12} & \cdots & w_{1k} \\<br />
w_{21} & w_{22} & \cdots & w_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
w_{M1} & w_{M2} & \cdots & w_{Mk}<br />
\end{matrix}\right] </math> is the matrix(M+1 by k) of weights.<br />
<br />
where the extra basis function <math>\Phi_{0}</math> is set to 1.<br />
<br />
==== Normalized RBF ====<br />
<br />
In addition to the above unnormalized architecture, the normalized RBF can be represented as:<br />
<br />
<math>\hat{y}_{k}(X) = \frac{\sum_{j=1}^{M} w_{jk}\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math><br /><br /><br />
<br />
<br />
Actually, <math>\Phi^{\ast}_{j}(X) = \frac{\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math> is known as a normalized radial basis function. Giving the familiar form,<br /><br />
<br />
<math>\hat{y}_{k}(X) = \sum_{j=1}^{M} w_{jk}\Phi^{\ast}_{j}(X)</math><br /><br /><br />
<br />
=== Conceptualizing RBF networks ===<br />
<br />
In the past, we have classified data using models that were explicitly linear, quadratic, or otherwise definite. In RBF networks, like in Neural Networks, we can fit an arbitrary model. How can we do this without changing the equations being used?<br />
<br />
Recall a [[#Trick:_Using_LDA_to_do_QDA_-_October_7.2C_2009|trick]] that was discussed in the October 7 lecture: if we add new features to our original data set, we can project into higher dimensions, use a linear algorithm, and get a quadratic result by collapsing to a lower dimension afterward. In RBF networks, something similar can happen.<br />
<br />
Think of <math>\,\Phi</math>, our matrix of radial basis functions, as a feature space of the input. Each hidden unit, then, can be thought to represent a feature; we can see that, if there are more hidden units than input units, we can essentially project to a higher-dimensional space, as we did in our earlier trick. However, this does not mean that an RBF network will actually do this, it is merely a way to convince yourself that RBF networks (and neural networks) can fit arbitrary models. Nevertheless, it is also noticed that just because of such great power, the problem of overfitting appears more important. We have to control its complexity so that it does not fit to an arbitrary training model but to a general one.<br />
<br />
=== RBF networks for classification -- a probabilistic paradigm ===<br />
<br />
[[File:Rbf_graphical_model.png|350px|thumb|left|Figure 1: RBF graphical model]]<br />
<br />
An RBF network is akin to fitting a Gaussian mixture model to data. We assume that each class can be modelled by a single function <math>\,\phi</math> and data is generated by a mixture model. According to Bayes Rule,<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(x|y_{k})*Pr(y_{k})}{Pr(x)}</math><br />
<br />
While all classifiers that we have seen thus far in the course have been in discriminative form, the RBF network is a generative model that can be represented using a directed graph.<br />
<br />
We can replace the class conditional density in the above conditional probability expression by marginalizing <math>\,x</math> over <math>\,j</math>:<br />
<math>\Pr(x|y_{k}) = \sum_{j} Pr(x|j)*Pr(j|y_{k})</math><br />
<br />
<br />
<br />
<br/><br/><br />
*'''Note''' We made the assumption that each class can be modelled by a single function <math>\displaystyle\Phi</math> and that the data was generated by a mixture model. The Gaussian mixture model has the form:<br />
<math>f(x)=\sum_{m=1}^M \alpha_m \phi(x;\mu_m,\Sigma_m)</math> where <math>\displaystyle\alpha_m</math> are mixing proportions, <math>\displaystyle\sum_m \alpha_m=1</math>, and <math>\displaystyle\mu_m</math> and <math>\displaystyle\Sigma_m</math> are the mean and covariance of each Gaussian density respectively. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009), pp. 214. </ref> The generative model in Figure 1 shows graphically how each Gaussian in the mixture model is chosen to sample from.<br />
<br />
== '''Radial Basis Function (RBF) Networks - November 9th, 2009''' ==<br />
<br />
=== RBF Network for classification (A probabilistic point of view) ===<br />
Using RBF Network to do classification, we usually treat it as regression problem. We want to set a threshold to decide what the data’s class membership is. However, to find some insight to the classification problem of what we are doing in terms of RBF Network, we often think of mixture models and make certain assumptions.<br />
<br />
[[File:RBF.png|350px|thumb|right|Figure 26.1: RBF Network Classification Demo]] <br />
<br />
We assume, as we can see in the graph on the right hand side, that we have three random variables, <math>\displaystyle y_k</math>, <math>\displaystyle j</math>, and <math>\displaystyle x</math> where <math>\displaystyle y_k</math> denotes class <math>\,k</math>, <math>\displaystyle x</math> is what we observed here, and <math>\displaystyle j</math> is some hidden random variable. There is a process that there are different classes, and each class can trigger a different hidden random variable <math>\displaystyle j</math>. To understand this, we can assume that, for instance, this random variable <math>\displaystyle j</math> has a Gaussian distribution (it could have any other distribution as well) and that all the <math>\displaystyle j</math>’s have the same distribution (Gaussian), but with different parameters. From each Gaussian distribution triggered by each class, we are going to sample some data points. Therefore, in the end, we are going to get a set of data, which are not strictly Gaussian, but are actually a mixture of Gaussians.<br />
<br />
Again, we look at the posterior distribution from [http://en.wikipedia.org/wiki/Bayes'_theorem Bayes' Rule].<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(X = x | Y = y_{k})*Pr(Y = y_{k})}{Pr(X = x)}</math><br />
<br />
Since we made the assumption that the data has been generated from a mixture model, we can estimate this conditional probability by<br />
<br />
<math>\Pr(X = x | Y = y_{k}) = \sum_{j} Pr(X = x | j)*Pr(j | Y = y_{k})</math>, <br />
<br />
which is the class conditional distribution (or probability) of the mixture model. Note, here, if we only have a simple model from <math>\displaystyle y_k</math> to <math>\displaystyle x</math>, then we won’t have this summation.<br />
<br />
We can substitute this class conditional distribution into Bayes' formula. We can see that the posterior of class <math>\displaystyle k</math> is the summation over <math>\displaystyle j</math> of the probability of <math>\displaystyle x</math> given <math>\displaystyle j</math> times the probability of <math>\displaystyle j</math> given <math>\displaystyle y_k</math>, times the prior distribution of class <math>\displaystyle k</math>, and lastly divided by the marginal probability of <math>\displaystyle x</math>. That is,<br />
<br />
<math>\Pr(y_k | x) = \frac {\sum_{j} Pr(x | j)*Pr(j | y_{k})*Pr(y_{k})}{Pr(x)}</math>.<br />
<br />
Since, the prior probability of class <math>\displaystyle k</math>, <math>\displaystyle Pr(y_{k})</math>, does not have an index of <math>\displaystyle j</math>, it can be taken out of the summation. This yields,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)}</math>.<br />
<br />
We multiply this by <math>\displaystyle 1 = \frac {Pr(j)}{Pr(j)}</math>. Then, it becomes,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)} * \frac {Pr(j)}{Pr(j)}</math>.<br />
<br />
Next, note that <math>\displaystyle Pr(j | x) = \frac {Pr(x | j)*Pr(j)}{Pr(x)}</math>, and <math>\displaystyle Pr(y_k | j) = \frac {Pr(j | y_k)*Pr(y_k)}{Pr(j)}</math>. Then rearranging the terms, we finally have the posterior:<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} Pr(j | x)Pr(y_k | j)</math>.<br />
<br />
Interestingly, this is just the product of the posterior of the two functions that are summed.<br />
<br />
==== Interpretation of RBF Network classification ====<br />
<br />
[[File:2.png|350px|thumb|right|Figure 26.1.2(2): RBF Nerwork ]]<br />
<br />
We want to relate the results that we derived above to our RBF Network. In a RBF Network, as we can see on the right hand side, we have a set of data, <math>\displaystyle x_1</math> to <math>\displaystyle x_d</math>, and the hidden basis function, <math>\displaystyle \phi_{1}</math> to <math>\displaystyle \phi_{M}</math>, and then we have some output, <math>\displaystyle y_1</math> to <math>\displaystyle y_k</math>. Also, we have weights from the hidden layer to output layer. The output is just the linear sum of <math>\displaystyle \phi</math>’s. <br />
<br />
Now consider probability of <math>\displaystyle j</math> given <math>\displaystyle x</math> to be <math>\displaystyle \phi</math>, and the probability of <math>\displaystyle y_k</math> given <math>\displaystyle j</math> to be the weights <math>\displaystyle w_{jk}</math>, then the posterior can be written as,<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math>.<br />
<br />
[[File:3.png|350px|thumb|left|Figure 26.1.2(1): Gaussian mixture ]]<br />
<br />
Now, let us look at an example in one dimensional case. Suppose,<br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, and <math>\displaystyle j</math> is from 1 to 2. <br />
<br />
We know that <math>\displaystyle \phi</math> is a radial basis function. It's as if we put some Gaussian over data. And for each Gaussian, we consider the center <math>\displaystyle \mu</math>. Then, what <math>\displaystyle \phi</math> computes is the similarity of any data point to the center. <br />
<br />
We can see the graph on the left which plots the density of <math>\displaystyle \phi_{1}</math> and <math>\displaystyle \phi_{2}</math>. Take <math>\displaystyle \phi_{1}</math> for instance, if the point gets far from the center <math>\displaystyle \mu_{1}</math>, then it will reduce <math>\displaystyle \phi_{1}</math> to become nearly zero. Remember that, we can usually find a non-linear regression or classification of input space by doing a linear one in some extended space or some feature space (more details in Aside). Here, the <math>\displaystyle \phi</math>’s actually produce that feature space. <br />
<br />
So, one way to look at this is that this <math>\displaystyle \phi</math> is telling us that given an input, how likely the probability of presence of a particular feature is. Say, for example, we define the features as the centers of these Gaussian distributions. Then, this <math>\displaystyle \phi</math> function somehow computes the possibility given certain data points, of this kind of feature appearing. If the data point is right at the center, then the value of that <math>\displaystyle \phi</math> would be one, i.e. the probability is 1. If the point is far from the center, then the probability (<math>\displaystyle \phi</math> function value) will be close to zero, that is, it’s less likely. Therefore, we can treat <math>\displaystyle Pr(j | x)</math> as the probability of a particular feature given data. <br />
<br />
When we have those features, then <math>\displaystyle y</math> is the linear combination of the features. Hence, any of the weights <math>\displaystyle w</math>, which is equal to <math>\displaystyle Pr(y_k | j)</math>, tells us how likely this particular <math>\displaystyle y</math> will appear given those features. Therefore, the weight <math>\displaystyle w_{jk}</math> shows the probability of class membership given feature. <br />
<br />
Hence, we have found a probabilistic point of view to look at RBF Network!<br />
<br />
*'''Note''' There are some inconsistencies with this probabilistic point of view. There are no restrictions that force <math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math> to be between 0 and 1. So if least squares is used to solve this, <math>\displaystyle w_{jk}</math> cannot be interpreted as a probability. <br />
<br />
<br />
''' Aside '''<br />
*Feature Space:<br />
:One way to produce a feature space is LDA<br />
:Suppose, we have n data points <math>\mathbf{x}_1</math> to <math>\mathbf{x}_n </math>. Each data point has d features. And these n data points consist of the <math>X</math> matrix, <br />
:<math>X = \left[ \begin{matrix}<br />
x_{11} & x_{21} & \cdots & x_{n1} \\<br />
x_{12} & x_{22} & \cdots & x_{n2} \\<br />
\vdots & & \ddots & \vdots \\<br />
x_{1d} & x_{2d} & \cdots & x_{nd}<br />
\end{matrix}\right] </math><br />
:Also, we have feature space,<br />
:<math>\Phi^{T} = \left[ \begin{matrix}<br />
\phi_{1}(\mathbf{x_1}) & \phi_{1}(\mathbf{x_2})& \cdots & \phi_{1}(\mathbf{x_n})\\<br />
\phi_{2}(\mathbf{x_1})& \phi_{2}(\mathbf{x_2})& \cdots & \phi_{2}(\mathbf{x_n}) \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{M}(\mathbf{x_1}) & \phi_{M}(\mathbf{x_2}) & \cdots & \phi_{M}(\mathbf{x_n})<br />
\end{matrix}\right] </math> <br />
:If we want to solve a regression problem for the input data, we don’t perform Least Square on this <math>\displaystyle X</math> matrix, we do Least Square on the feature space, i.e. on the <math>\displaystyle \Phi^{T}</math> matrix. The dimensionality of <math>\displaystyle \Phi^{T}</math> is M by n.<br />
:Now, we still have n data points, but we define these n data points in terms of a new set of features. So, originally, we define our data points by d features, but now, we define them by M features. And what are those M features telling us? <br />
:Let us look at the first column of <math>\displaystyle \Phi^{T}</math> matrix. The first entry is <math>\displaystyle \phi_1</math> applied to <math>\mathbf{x_1}</math>, and so on, until the last entry is <math>\displaystyle \phi_M</math> applied to <math>\mathbf{x_1}</math>. Suppose each of these <math>\displaystyle \phi_j</math> is defined by<br />
:<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>.<br />
:Then, each <math>\displaystyle \phi_j</math> checks the similarity of the data point with its center. Hence, the new set of features are actually representing M centers in our data set, and for each data point, its new features check how this point is similar to the first center; how it is similar to the second center; and how it is similar to the <math>\displaystyle M^{th}</math> center. And this checking process will apply to all data points. Therefore, feature space gives another representation of our data set. <br />
<br />
</noinclude><br />
<br />
=== Model selection or complexity control for RBF Network - a brief introduction ===<br />
In order to obtain a better fit for the training data, we often want to increase the complexity of our RBF Network. By its construction, we know that to change the complexity of a RBF Network, the only way is to add or decrease the number of basis functions. A large number of basis function yields a more complex network. In theory, if we add enough basis functions, the RFB Network would work for any training; however, it doesn't mean this RBF Network model can generalize well. Therefore, for the purpose of avoiding overfitting problem (see Notes below), we only want to increase the number of basis function to certain point, i.e. its optimal level. <br />
<br />
For the model selection, what we usually do is estimate the training error. After working through the training error, we’ll see that the training error in fact can be decomposed, and one component of training error is called Mean Squared Error (MSE). In the later notes, we will find that our final goal is to get a good estimate of MSE. Moreover, in order to find an optimal model for our data, we select the model with the smallest MSE.<br />
<br />
Now, let us introduce some notations that we will use in the analysis:<br />
*<math>\hat f</math> -- the prediction model estimated by a RBF network from the training data<br />
*<math>\displaystyle f</math> -- the real model (not null), and ideally, we want <math>\hat f</math> to be close to <math>\displaystyle f</math><br />
*<math>\displaystyle err</math> -- the training error<br />
*<math>\displaystyle Err</math> -- the testing error<br />
*<math>\displaystyle MSE</math> -- the Mean Squared Error<br />
<br />
''' Notes '''<br />
<br />
[[File:overfitting.png|350px|thumb|left|Figure 26.2: Overfitting]]<br />
<br />
*Being more complex isn’t always a good thing. Sometime, [http://en.wikipedia.org/wiki/Overfitting overfitting] causes the model to lose its generality. For example in the graph on left hand side, the data points are sampled from the model <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle f(x_i)</math> is a linear function, which is shown by the blue line, and <math>\displaystyle \epsilon_i</math>is additive Gaussian noise from <math>~N(0,\sigma^2)</math>. The red curve displayed in the graph shows the over-fitted model. Clearly, this over-fitted model only works for any training data, and is useless for any further prediction when new data points are introduced.<br />
<br />
> n<-20;<br />
> x<-seq(1,10,length=n);<br />
> alpha<-2.5;<br />
> beta<-1.75;<br />
> y<-alpha+beta*x+rnorm(n);<br />
> plot(y~x, pch=16, lwd=3, cex=0.5, main='Overfitting');<br />
> abline(alpha, beta, col='blue');<br />
> lines(spline(x, y), col = 2);<br />
<br />
*More details on this topic later on.<br />
<br />
<br />
<br />
</noinclude><br />
<br />
<br />
<br />
<br />
<br />
<br />
== '''Model Selection(Stein's Unbiased Risk Estimate)- November 11th, 2009''' ==<br />
<br />
===Model Selection===<br />
<br />
Model selection is a task of selecting a model of optimal complexity for a given data. Learning a radial basis function network from data is a parameter estimation problem. One difficulty with this problem is selecting parameters that show good performance on both training and testing data. In principle, a model is selected to have parameters associated with the best observed performance on training data, although our goal really is to achieve good performance on unseen testing data. Not surprisingly, a model selected on the basis of training data does not necessarily exhibit comparable performance on the testing data. When squared error is used as the performance index, a zero-error model on the training data can always be achieved by using a sufficient number of basis functions.<br />
<br />
<br />
But, training error and testing error do not demonstrate a linear relationship. In particular, a smaller training error does do not necessarily result in a smaller testing error. In practice, one often observes that, up to a certain point, the model error on testing data tends to decrease as the training error decreases. However, if one attempts to decrease the training error too far by increasing model complexity, the testing error often can take a dramatic increase.<br />
<br />
<br />
The basic reason behind this phenomenon is that in the process of minimizing training error, after a certain point, the model begins to over-fit the training set. Over-fitting in this context means fitting the model to training data at the expense of losing generality. In the extreme form, a set of <math>\displaystyle N</math> training data points can be modeled exactly with <math>\displaystyle N</math> radial basis functions. Such a model follows the training data perfectly. However, the model is not representative features of the true underlying data source, and this is why it fails to correctly model new data points.<br />
<br />
<br />
In general, the training error rate will be less than the testing error on the new data. A model typically adapts to the training data, and hence the training error will be an overly optimistic estimate of the testing error. An obvious way to well estimate testing err is to add a penalty term to the training error to compensate. Actually, SURE is developed based on this.<br />
<br />
<br />
<br />
===Stein's unbiased risk estimate (SURE)===<br />
<br />
<br />
====Important Notation====<br />
<br />
Let:<br />
*<math>\hat f(X)</math> denote the ''prediction model'', which is estimated from a training sample by the RBF neural network model.<br />
*<math>\displaystyle f(X)</math> denote the ''true model''.<br />
*<math>\displaystyle err=\sum_{i=1}^N (\hat y_i-y_i)^2 </math> denote the ''training error'',which is the average loss over the training sample.<br />
*<math>\displaystyle Err=\sum_{i=1}^M (\hat y_i-y_i)^2 </math> denote the ''test error'', which is the expected prediction error on an independent test sample.<br />
*<math>\displaystyle MSE=E(\hat f-f)^2</math> denote the ''mean squared error'', where <math>\hat f(X)</math> is the estimated model and <math>\displaystyle f(X)</math> is the true model.<br />
<br />
<br />
<br />
Suppose the observations <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>.We need to estimate <math>\hat f</math> from training data set <math>T=\{(x_i,y_i)\}^2_{i=1}</math>. Let <math>\hat f_i=\hat f_(x_i)</math> and <math>\displaystyle f_i= f(x_i)</math>, then <br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i-\epsilon_i)^2]</math><math>=E[(\hat f_i-f_i)^2]+E[\epsilon_i^2]-2E[\epsilon_i(\hat f_i-f_i)]</math><br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2E[\epsilon_i(\hat f_i-f_i)]</math> <math>\displaystyle (1)</math><br />
<br />
The last term can be written as:<br />
<br />
<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=E[(y_i-f_i)(\hat f_i-f_i)]=cov(y_i,\hat f)</math>, where<math>\displaystyle y_i</math> and <math>\hat f_i</math> both have same mean <math>\displaystyle f_i</math>.<br />
<br />
<br />
<br />
====Stein's Lemma====<br />
<br />
If <math>\,Z</math> is <math>\,N(\mu,\sigma^2)</math> and if <math>\displaystyle g(Z)</math> is weakly differentiable,such that<math>\displaystyle E[\vert g'(Z)\vert]<\infty</math>, then <math>\displaystyle E[g(Z)(Z-\mu)]=\sigma^2E(g'(Z))</math>.<br />
<br />
<br />
According to Stein's Lemma, the last cross term of <math>\displaystyle (1)</math>, <math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]</math> can be written as <math>\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math>. The derivation is as follows.<br />
<br />
<math>\displaystyle Proof</math>: Let <math>\,Z = \epsilon</math>. Then <math>g(Z) = \hat f-f</math>, since <math>\hat y = f + \epsilon</math>, and <math>\,f</math> is a constant. So <math>\,\mu = 0</math> and <math>\,\sigma^2</math> is the variance in <math>\,\epsilon</math>.<br />
<math>\displaystyle E[g(Z)(Z-\mu)]=E[(\hat f-f)\epsilon]=\sigma^2E(g'(Z))=\sigma^2 E[\frac {\partial (\hat f-f)}{\partial y_i}]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}-\frac {\partial f}{\partial y_i}]</math><br />
<br />
<br />
Since <math>\displaystyle f</math> is the true model, not the function of the observations <math>\displaystyle y_i</math>, then <math>\frac {\partial f}{\partial y_i}=0</math>.<br />
<br />
So,<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math> <math>\displaystyle (2)</math><br />
<br />
====Two Different Cases====<br />
SURE in RBF,<br />
[http://www.cs.ualberta.ca/~papersdb/uploaded_files/801/paper_automatic-basis-selection-for.pdf Automatic basis selection for RBF networks using Stein’s unbiased risk estimator,Ali Ghodsi Dale Schuurmans]<br />
<br />
<br />
=====''Case 1''=====<br />
<br />
Consider the case in which a new data point has been introduced to the estimated model, i.e. <math>(x_i,y_i)\not\in\tau</math>; this new point belong to the validation set <math>\displaystyle \nu</math>,i.e.<math>(x_i,y_i)\in\nu</math>. Since <math>\displaystyle y_i</math> is a new point, <math>\hat f</math> and <math>\displaystyle y_i</math> are independent. Therefore <math>\displaystyle cov(y_i,\hat f)=0</math> (or think about <math>\frac{\partial \hat f}{\partial y_i}</math>, when <math>\,y_i</math>is a new point, then it has nothing with <math>\hat f</math> because the estimation of <math>\hat f</math> is from the training data, so <math>\frac{\partial \hat f}{\partial y_i}=0</math> ) and <math>\displaystyle (1)</math> in this case can be written as:<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2</math>. <br />
<br />
This expectation means <math>\frac {1}{m}\sum_{i=1}^m (\hat y_i-y_i)^2 = \frac {1}{m}\sum_{i=1}^m (\hat f_i-f_i)^2+ \sigma^2</math>.<br />
<br />
<math>\sum_{i=1}^m (\hat y_i-y_i)^2 = \sum_{i=1}^m (\hat f_i-f_i)^2+ m\sigma^2</math><br />
<br />
Based on the notation we denote above, then we obtain:<br />
<math>\displaystyle MSE=Err-m\sigma^2</math><br />
<br />
<br />
<br />
This is the justification behind the technique of cross validation. since <math>\displaystyle \sigma^2</math> is constant, to minimize <math>\displaystyle MSE</math> is equal to minimize the test err <math>\displaystyle Err</math>. In cross vaildation to avoid overfitting or underfitting, a validation data set is independent from the estimated model.<br />
<br />
<br />
=====''Case 2''=====<br />
<br />
A more interesting case is the case in which we do not use new data points to assess the performance of the estimated model. and the training data is used for both estimating and assessing a model <math>\hat f_i</math>. In this case the cross term in <math>\displaystyle (1)</math> cannot be ignored because <math>\hat f_i</math> and <math>\displaystyle y_i</math> are not independent. Therefore the cross term can be estimated by Stein's lemma, which was originally proposed to estimated the mean of a Guassian distribution.<br />
<br />
<br />
Suppose <math>(x_i,y_i)\in\tau</math>, then by applying Stein's lemma, we obtain <math>\displaystyle (2)</math> proved above.<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2\sigma^2E[\frac {\partial \hat f}{\partial y_i}]</math>.<br />
<br />
This expectation means <math>\frac {1}{N}\sum_{i=1}^N (\hat y_i-y_i)^2 = \frac {1}{N}\sum_{i=1}^N (\hat f_i-f_i)^2+ \sigma^2-\frac {2}{N}\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<br />
<math>\sum_{i=1}^N (\hat y_i-y_i)^2 = \sum_{i=1}^N (\hat f_i-f_i)^2+ N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<math>\displaystyle err=MSE+N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math><br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math> <math>\displaystyle (3)</math><br />
<br />
In statistics, this is known as [http://www.reference.com/browse/Stein%27s+unbiased+risk+estimate Stein's unbiased risk estimate (SURE)] is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter, and thus cannot be determined completely.<br />
<br />
===SURE for RBF Network===<br />
<br />
Based on SURE, the optimum number of basis functions should be assigned to have the minimum generalization err <math>\displaystyle err</math>. Based on the Radial Basis Function Network, by setting <math>\frac{\partial err}{\partial W}</math> equal to zero , we get the least squared solution of<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math>.Then we have <math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}Y = HY</math>,where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}</math>, <math>\,H</math> is the hat matrix for this model.<br />
<br />
<br />
<math>\hat f=\,H_{i1}y_2+\,H_{i2}y_2+\cdots+\,H_{in}y_n</math> <math>\displaystyle (3)</math><br />
<br />
where <math>\,H</math> depends on the input vector <math>\displaystyle x_i</math> but not on <math>\displaystyle y_i</math>. <br />
<br />
By taking the derivative of <math>\hat f_i</math> with respect to <math>\displaystyle y_i</math>, we can easily obtain:<br />
<br />
<math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Now, substituing this into<math>\displaystyle (3)</math> ,we get<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Here, we can tell that <math>\sum_{i=1}^N \,H_{ii}= \,Trace(H)</math>, the sum of the diagonal elements of <math>\,H</math>. Thus, we can obtain the further simplification that <math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1})\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=d</math>, where<math>\displaystyle d</math> is the dimension of <math>\displaystyle \Phi</math>. Since <math>\displaystyle \Phi</math> is a projection of input matrix <math>\,X</math> onto a basis set spanned by <math>\,M</math>, the number of basis functions. If considering intercept, then <math>\,Trace(H)= M+1</math>.<br />
<br />
Then,<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1)</math>.<br />
<br />
===SURE Algorithm===<br />
<br />
<br />
[[File:27.1.jpg|350px|thumb|right|Figure 27.1]]<br />
<br />
We use this method to find the optimum number of basis function by choosing the model with smallest MSE over the set of models considered. Given a set of models <math>\hat f_M(x)</math> indexed by the number of basis functions, <math>\displaystyle err(M)</math>. <br />
<br />
Then, <math>\displaystyle MSE(M)=err(M)-N\sigma^2+2\sigma^2(M+1)</math><br />
<br />
where <math>\displaystyle N</math> is the number of training samples and the noise,<math>\sigma^2</math>, can be estimated from the training data as<br />
<br />
<math>\hat \sigma^2=\frac {1}{N-1}\sum_{i=1}^N (\hat y-y)^2</math>.<br />
<br />
<br />
By applying SURE algorithm to SPECT Heart data, we get the optimal number of basis functions is <math>\displaystyle M=4</math>.<br />
<br />
<br />
Pls take a look at figure 27.1 on the right, which shows that<math>\displaystyle MSE</math> is smallest when <math>\displaystyle M=4</math>.<br />
<br />
<br />
Calculating the SURE value is easy if you have access to <math>\,\sigma</math>.<br />
<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
If <math>\,\sigma</math> is not known, it can be estimated using the error.<br />
<br />
error = (output - expected_output) .^ 2;<br />
sigma = (dot(err, ones(1, num_data_point)) / (num_data_point)) / (num_data_point - 1);<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
=='''SURE for RBF network & Support Vector Machine - November 13th, 2009'''==<br />
<br />
===SURE for RBF network===<br />
<br />
====Minimizing MSE====<br />
<br />
By Stein's unbiased risk estimate (SURE) for Radial Basis Function (RBF) Network<br />
we get:<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1) </math> (28.1)<br />
<br />
*<math>\displaystyle MSE</math>(mean square error)= <math>\sum_{i=1}^N (\hat y_i-y_i)^2 </math><br />
*<math>\displaystyle err</math>(training error)= <math>\sum_{i=1}^N (\hat f_i-f_i)^2 </math><br />
*<math>\displaystyle (M+1) </math>( number of hidden units)= <math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math><br />
<br />
<br />
'''Goal''': To minimize MSE<br />
<br />
1. If <math>\displaystyle \sigma </math> is known, then it is no impact (i.e. a constant),<br />
and we can ignore it. Only need to minimize <math>\displaystyle MSE=err +2\sigma^2(M+1)</math>.<br />
<br />
2. In reality, we do not know <math>\displaystyle \sigma</math>, and the estimate <math>\,\hat \sigma</math> changes when <math>\displaystyle (M+1) </math> changes. However, we can estimate <math>\displaystyle \sigma </math>.<br />
<br />
<math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>. Suppose we do not know the variance of <math>\displaystyle \epsilon</math>. Then, <br />
<br />
<math>\displaystyle \sigma^2=\frac{1}{N-1}\sum_{i=1}^N (\hat y_i-y_i)^2 =\frac{1}{N-1}err</math> (28.2)<br />
<br />
Substitute (28.2) into (28.1), get<br />
<br />
<math>\displaystyle MSE=err-N\frac{1}{N-1}err+2\frac{1}{N-1}err(M+1)</math><br />
<br />
<math>\displaystyle MSE=err(1-\frac{N}{N-1}+\frac{2(M+1)}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{N-1-N+2M+2}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{2M+1}{N-1}) </math> (28.3) <br />
<br />
<br />
[[File:28.1.jpg|350px|thumb|Figure 28.1: MSE vs err]]<br />
<br />
Figure 28.1: the training error will decrease and the MSE will increase when increasing the number of hidden units (i.e. the model is more complex).<br />
<br />
<br />
When the number of hidden units gets larger and larger, the training error will decrease until it approaches to <math>\displaystyle 0 </math>. If training error equals <math>\displaystyle 0 </math>, then no matter how large <math>\displaystyle (M+1) </math> is, from (28.3) we can see the estimate of MSE will approach <math>\displaystyle 0 </math> as well. However, in fact it does not happen since when training error is close to <math>\displaystyle 0 </math> [http://en.wikipedia.org/wiki/Overfitting overfitting] happens, and MSE should increase instead of being close to <math>\displaystyle 0 </math>. We can see it from the Figure 28.1. <br />
<br />
<br />
We can see <math>\displaystyle \sigma^2 </math> is the average of <math>\displaystyle err </math>. In order to deal with this problem, we can take the average for <math>\displaystyle err</math> of each hidden unit. For example: we can first take 1 hidden unit, and take 10 hidden units in the next. Since in reality the value of <math>\, \sigma^2</math> is a constant adjustment to the data points, and doesn't depend on <math>\,M+1</math>, using the average <math>\,\sigma^2</math> value for 1 to 10 hidden units has a firm theoretical basis.<br />
<br />
We can also see that unlike the classical Cross Validation (CV) or Leave one out (LOO) techniques, the SURE technique does not need to do the validation to find the optimal model. Hence, SURE technique uses less data than CV or LOO. It is suitable for the case that there is not enough data for validation. However, to implement SURE we need to find <math>\frac {\partial \hat f}{\partial y_i}</math>, which may not be trivial for models that do not have a closed-form solution.<br />
<br />
====Kmeans Clustering====<br />
<br />
Discription:<br /> [http://en.wikipedia.org/wiki/K-means_clustering Kmeans clustering] is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.<br />
<br />
*The number of hidden units is same as the number of clusters, which is <math>\displaystyle \phi </math><br />
<br />
*<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, we set it same for all clusters.<br />
<br />
The basic details for <math>K</math>-means clustering are given:<br />
<br />
The <math>K</math> initial centers are randomly chosen from the training data.<br />
<br />
Then the following two steps are iterated alternatively until convergernce.<br />
<br />
1. for each existing center, we reidentify its cluster (every points in this cluster should be closer to this center than others).<br />
<br />
2. compute the mean for each cluster and make it as the new center for each cluster.<br />
<br />
<br />
Example:<br /><br />
Partition data into 2 clusters (2 hidden values)<br />
<br />
<br />
>> X=rand(30,80); <br />
>> [IDX,C,sumD,D]=kmeans(X,2); <br />
>> size(IDX) <br />
>> 30 1<br />
>> size(C) <br />
>> 2 80<br />
>> size(sumD) <br />
>> 2 1<br />
>> c1=sum(IDX==1)<br />
>> 14<br />
>> c2=sum(IDX==2)<br />
>> 16<br />
>> sumD<br />
>> 85.6643<br />
>> 101.0419<br />
>> v1=sumD(1,1)/c1 <br />
>> 6.1189<br />
>> v2=sumD(2,1)/c2 <br />
>> 6.3151 <br />
<br />
<br />
<br />
Comments:<br />
<br />
We create <math>X</math> randomly as a train set with 80 data points and 30 dimensions, and then apply “kmeans” method to separate X into 2 clusters.IDX is a vector contains 1 or 2 which indicates 2 clusters, and its size is 30*1. <math>\displaystyle C </math> is the center (mean) of each cluster with size 2*80; sumD is sum of the square distance between the data points and center of its cluster. The <math>\displaystyle c1 </math> and <math>\displaystyle c2 </math> indicate the number of data points in cluster 1 and 2. <math>\displaystyle V1 </math> is the variance of the first cluster <math>\displaystyle (v1=\sigma_1)</math>; <math>\displaystyle V2 </math> is the variance of the second cluster <math>\displaystyle (v2=\sigma_2)</math>. Now we can get <math>\displaystyle \phi </math>, <math>\displaystyle w </math>, hat matrix <math>\displaystyle H </math> and <math>\displaystyle \hat Y </math> by following equations. Finally, we will get the <math>\displaystyle MSE </math> and predict the test set. <br />
<br />
<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br />
<br />
<math>\displaystyle W = (\phi^{T}\Phi)^{-1}\phi^{T}Y</math><br />
<br />
<math>\displaystyle H = \phi(\Phi^{T}\Phi)^{-1}\phi^{T}</math><br />
<br />
<math>\displaystyle \hat{Y} = \phi W = \phi(\phi^{T}\phi)^{-1}\phi^{T}Y = HY</math><br />
<br />
<br />
<br />
Aside:<br />
<br />
Similar in spirit to <math>K</math>-means, there is EM algorithm with respect to Gaussian mixture model. Generally speaking, the Gaussian mixture model is referred to as a soft clustering while <math>K</math>-means is hard clustering.<br />
<br />
Similar to <math>K</math>-means, the following two steps are iterated alternatively until convergernce.<br />
<br />
E-step, each point is assigned a weight for each cluster based on the likelihood of each of the corresponding Gaussians. Observations is assigned 1 for one cluster if they are closer to the center of that cluster, and is assigned 0 for other clusters. <br />
<br />
M-step, compute the weighted means and covariances and make them as the new means and covariances for every cluster.<br />
<br />
>>[P,mu,phi,1Pxtr]=mdgEM(X,2,200,0);<br />
<br />
===Support Vector Machine===<br />
<br />
====Introduction====<br />
We have seen that linear discriminate analysis and logistic regression both estimate linear decision boundaries in similar but slightly different ways. Separating hyperplane classifiers provide the basis for support vector classifier. It constructs linear decision boundaries that explicitly try to separate the data into different classes as well as possible. The techniques that make the extensions to the nonseparable case, where the classes overlap are generalized to what is known as the support vector machine. It produces nonlinear boundaries by constructing a linear boundary in a high dimensional, transformed version of the feature space.<br />
<br />
The original basis for SVM was published in the 1960s by [http://en.wikipedia.org/wiki/Vapnik Vapnik], Chervonenkis and co, however the ideas did not gain any attention until strong results were shown in the early 90s.<br />
<br />
Definition: <br /><br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machine (SVM)] are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a Maximum margin hyperplane or set of hyperplanes in a higher or infinite dimensional space. The set of points near the class boundaries, support vectors, define the model. which can be used for classification, regression or other tasks.<br />
<br />
====Optimal Seperating Hyperplane====<br />
<br />
[[File:28.2.jpg|350px|thumb|right|Figure 28.2]]<br />
<br />
Figure 28.2 An example with two classes separated by a hyperplane. The blue line is the least squares solution, which misclassifies one of the training points. Also shown are the black separating hyperplanes found by the [http://en.wikipedia.org/wiki/Perceptron perceptron] learning algorithm with different random starts.<br /><br />
<br />
We can see the data points can be separated by a linear boundary are in two classes in <math>\displaystyle \mathbb{R}^{2} </math>. Suppose a dataset is indeed linearly separable, then there exits infinitely many possible separating hyperplanes including the black lines in the figure as two of them for training data. However, which solution is the best when we introduce the new data. <br /><br />
<br />
Aside: <br /><br />
The blue line is the least squares solution to the problem,obtained by regressing the <math>\displaystyle -1/+1 </math> response <math>\displaystyle Y </math> on <math>\displaystyle X </math> (with intercept); the line is given by<br />
<math>\displaystyle {X:\hat\beta_0+\hat\beta_1X_1+\hat\beta_2X_2=0}</math>.<br />
This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by linear discriminant analysis, in light of its equivalence with linear regression in the two-class case.<br />
<br />
Classifiers such as (28.4) that compute a linear combination of the input features and return the sign were called ''perceptrons'' in the engineering literature in the late 1950s. <br />
<br />
<br />
Identifications:<br />
<br />
*Hyperplane: seprate two classes <br />
<br />
<math>\displaystyle x^{T}\beta+\beta_0=0</math><br />
<br />
*Margin: the distance between the hyperplane and the closest point.<br />
<br />
<math>\displaystyle d_i=x_i^{T}\beta+\beta_0 </math> where <math>\displaystyle i=1,....,N</math><br />
<br />
Note: since distance is positive, if the data is on <math>\displaystyle +1 </math> side the distance is <math>\displaystyle d_i(+1)</math>. If the data is on the <math>\displaystyle -1 </math> side the distance is <math>\displaystyle d_i(-1)</math>.<br />
<br />
*Data points: <math>\displaystyle y_i\in\{-1,+1\}</math>we can classify points as<math>\displaystyle sign\{d_i\}</math> if <math>\displaystyle \beta,\beta_0 </math> is known.<br /><br />
<br />
====Maximum Margin Classifiers in the Linearly separable case====<br />
Choose the line farthest from both classes or Choose the line which has the maximum distance from the closest point. (i.e maximize the margin)<br /><br />
<br />
<math>\displaystyle Margin=min\{y_id_i\}</math> <math>\displaystyle i=1,2,....,N </math> <br />
where <math>\displaystyle y_i </math> is label and <math>\displaystyle d_i </math> is distance<br /><br />
<br />
[[File:28.3.jpg|350px|thumb|right|Figure 28.3 The linear algebra of a hyperplane]]<br />
<br />
<br />
<br />
Fiqure 28.3 depicts a hyperplane defined by the equation <math>\displaystyle x^{T}\beta+\beta_0=0</math>. Since they are in <math>\displaystyle \mathbb{R}^{2} </math>, the hyperplane is a line.<br /><br />
<br />
<br />
Let us rewrite <math>\displaystyle Margin=min\{y_id_i\}</math> by using the following properties:<br /><br />
<br />
1. <math>\displaystyle \beta </math> is orthogonal to the hyperplane <br /><br />
<br />
Two points<math>\displaystyle x_1,x_2</math> lying on the hyperplane.<br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0\beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}(x_1-x_2)=0</math><br />
<br />
Hence,<math>\displaystyle \beta </math> is orthogonal to <math>\displaystyle (x_1-x_2)</math>, and<math>\displaystyle \beta^*=\frac{\beta}{\|\beta\|} </math> is the vector normal to the hyperplane.<br /><br />
<br />
2. For any point <math>\displaystyle x_0 </math> on the hyperplane, <br />
<br />
<math>\displaystyle \beta^{T}x_0+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1=-\beta_0</math><br />
For any point on the hyperplane <math>\displaystyle \beta^{T}</math> gives negative value of the intercept of the hyperplane. <br/><br />
<br />
<br />
3. The signed distance for any point <math>\displaystyle x </math> to the hyperplane is <math>\displaystyle d_i=\beta^{T}(x_i-x_0)</math>. <br/>Since the length of <math>\displaystyle \beta </math> is not known, let it be unit vector.<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}(x_i-x_0)}{\|\beta\|} </math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i-\beta^{T}x_0}{\|\beta\|} </math><br />
<br />
by property 2<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i+\beta_0}{\|\beta\|} </math><br />
<br />
<br />
<br />
[[File:4.jpg|350px|thumb|right|Figure 28.4]]<br />
<br />
<br />
We had <math>\displaystyle Margin=min(y_id_i)</math> <math>\displaystyle i=1,2,....,N </math>, and since we now know how to compute <math>\displaystyle d_i \Rightarrow</math> <br />
<br />
<math>\displaystyle Margin=min\{y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\} </math><br />
<br />
Suppose <math>\displaystyle x_i </math> is not on the hyperplan<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)>0 </math><br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq c </math>for <math>\displaystyle c>0 </math><br />
<br />
<br />
<math>\displaystyle y_i(\frac{\beta^{T}x_i}{c}+\frac{\beta_0}{c})\geq1</math> <br />
<br />
This is known as the cnonical representation of the decision hyperplane.<br />
<br />
For <math>\displaystyle \beta^{T} </math> only the direction is important, so <math>\displaystyle \frac{\beta^{T}}{c} </math> does not change its direction, the hyperplance will be the same.<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq1 </math><br />
<br />
<math>\displaystyle y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\geq\frac{1}{\|\beta\|} </math><br />
<br />
<math>\displaystyle Margin=min(\frac{1}{\|\beta\|} )</math><br />
<br />
which is equivalent as <math>\displaystyle Margin=max(\|\beta\| )</math><br />
<br />
<br />
<br />
<br />
Refrence:<br /><br />
Hastie,T.,Tibshirani,R., Friedman,J.,(2008).The Elements of Statistical Learning:129-130<br />
<br />
=='''Optimizing The Support Vector Machine - November 16th, 2009'''==<br />
We currently derive Support Vector Machine for the case where two classes are separable in the given feature space. This margin can be written as <math>\,min\{y_id_i\}</math>, or the distance of each point from the hyperplane, where <math>\,d_i</math> is the distance and <math>\,y_i</math> is used as the sign.<br />
===Margin Maximizing Problem for the Support Vector Machine===<br />
<math>\,Margin=min\{y_id_i\}</math> can be rewritten as <math>\,min\left\{\frac{y_i\left(\beta^Tx_i+\beta_0\right)}{|\beta|}\right\}</math>. <br />
<br\>Note that the term <math>\,y_i\left(\beta^Tx_i+\beta_0\right) = 0</math> if <math>\,x_i</math> is on the hyperplane, but <math>\,y_i\left(\beta^Tx_i+\beta_0\right) > 0</math> if <math>\,x_i</math> is <math>\displaystyle not</math> on the hyperplane<br />
<br />
This implies <math>\,\exists C>0</math> such that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq C</math>.<br />
<br />
Divide through by C to produce <math>\,y_i\left(\frac{\beta^T}{C}x_i + \frac{\beta_0}{C}\right) \geq 1</math>. <br />
<br />
<math>\,\beta, \beta_0</math> compose a hyperplane that can have different values but we care about the direction, dividng through by a constant does not change the direction of the hyperplane. Thus, by assuming scaled values for <math>\,\beta, \beta_0</math> we eliminate C, so that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq 1</math>. Implying that the lower bound on <math>\,y_i\left(\beta^Tx_i+\beta_0\right)</math> is <math>\displaystyle 1</math><br />
<br />
Now in order to maximize the margin, we simply need to find <math>\,min\left\{\frac{1}{|\beta|}\right\}</math>. <br />
<br />
In other words, our optimization problem is now to find maximum <math>\,|\beta|</math>, under the constraint that <math>\,min_i\{y_i(\beta^Tx_i+\beta_0)\} = 1</math>.<br />
<br />
Note that we're dealing with the norm of <math>\,\beta</math>. There are many different choices of possible norms, in general[http://en.wikipedia.org/wiki/P-norm#p-norm p-norm]. The 1-norm of a vector is simply the sum of the absolute value of each element (also known as the taxicab or Manhattan distance), and is apparently more accurate, but also has a discontinuity in the derivative. 2-norm or the Euclidean norm (the intuitive measure of the length of a vector), is easier to work with - that is <math>\,\|\beta\|_2 = (\beta^T\beta)^{1/2}</math>. For convenience, we will maximize <math>\,\frac{1}{2}\|\beta\|_2^2 = \frac{1}{2}\beta^T\beta</math> where the constant 1/2 has been added for simplification and that the maximizing the function is the same as maximizing the square root of that function.<br />
<br />
This is an example of a quadratic programming problem and we will minimize a quadratic function respective to a linear inequality constraints.<br />
<br />
<br />
====Writing Lagrangian Form of Support Vector Machine====<br />
The Lagrangian form is introduced to ensure that the conditions are satisfied, as well as finding an optimal solution(The optimal saddle point of the lagrangian for the classic quadratic optimization). The problem will be solved in dual space by introducing <math>\,\alpha_i</math> as dual constraints, this is in contrast to solving the problem in primal space as function of the betas. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.<br />
<br />
<math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math>. To find the optimal value, set the derivative equal to zero.<br />
<br />
<math>\,\frac{\partial L}{\partial \beta} = 0</math>, <math>\,\frac{\partial L}{\partial \beta_0} = 0</math>. Note that <math>\,\frac{\partial L}{\partial \alpha_i}</math> is equivalent to the constraints <math>\left(y_i(\beta^Tx_i+\beta_0)-1\right) \geq 0, \,\forall\, i</math><br />
<br />
First, <math>\,\frac{\partial L}{\partial \beta} = \frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\left\{\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i)+\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0-\frac{\partial}{\partial \beta}\alpha_iy_i\right\}}</math><br />
<br />
: <math>\frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 = \beta</math>.<br />
<br />
: <math>\,\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i) = \alpha_iy_ix_i</math><br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0 = 0</math>. <br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i = 0</math>.<br />
<br />
So this simplifies to <math>\,\frac{\partial L}{\partial \beta} = \beta - \sum_{i=1}^n{\alpha_iy_ix_i} = 0</math>. In other words,<br />
<br />
<math>\,\beta = \sum_{i=1}^n{\alpha_iy_ix_i}</math>, <math>\,\beta^T = \sum_{i=1}^n{\alpha_iy_ix_i^T}</math><br />
<br />
Similarly, <math>\,\frac{\partial L}{\partial \beta_0} = -\sum_{i=1}^n{\alpha_iy_i} = 0</math>.<br />
<br />
This allows us to rewrite the Lagrangian without <math>\,\beta</math>.<br />
<br />
<math>\,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} - \sum_{i=1}^n{\alpha_iy_i(\sum_{j=1}^n{\alpha_jy_jx_j^Tx_i} + \beta_0) - 1}</math>. <br />
<br />
Because <math>\,\sum_{i=1}^n{\alpha_iy_i} = 0</math>, and <math>\,\beta_0</math> is constant, <math>\,\sum_{i=1}^n{\alpha_iy_i\beta_0} = 0</math>. So this simplifies further, to<br />
<br />
<math>L(\alpha) = \,-\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} + \sum_{i=1}^n{\alpha_i}</math><br />
A dual representation of the maximum margin.<br />
<br />
Because <math>\,\alpha_i</math> is the Lagrangian multiplier, <math>\,\alpha_i \geq 0 \forall i</math>.<br />
<br />
This is a much simpler optimization problem.<br />
<br />
=='''The Support Vector Machine algorithm - November 18, 2009'''==<br />
<br />
===Solving the Lagrangian===<br />
<br />
Continuing from the above derivation, we now have the equation that we need to minimize, as well as two constraints.<br />
<br />
The Support Vector Machine problem boils down to:<br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math><br />
:such that <math>\alpha_i \geq 0</math><br />
:and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
<br />
We are looking to minimize <math>\,\alpha</math>, which is our only unknown. Once we know <math>\,\alpha</math>, we can easily find <math>\,\beta</math> and <math>\,\beta_0</math> (see the Support Vector algorithm below for complete details).<br />
<br />
If we examine the Lagrangian equation, we can see that <math>\,\alpha</math> is multiplied by itself; that is, the Lagrangian is quadratic with respect to <math>\,\alpha</math>. Our constraints are linear. This is therefore a problem that can be solved through [http://en.wikipedia.org/wiki/Quadratic_programming quadratic programming] techniques. We will examine how to do this in Matlab shortly.<br />
<br />
We can write the Lagrangian equation in matrix form:<br />
<br />
<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:such that <math>\underline{\alpha} \geq \underline{0}</math><br />
:and <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Where:<br />
* <math>\underline{\alpha}</math> denotes an <math>\,n \times 1</math> vector; <math>\underline{\alpha}^T = [\alpha_1, ..., \alpha_n]</math><br />
* Matrix <math>S = y_iy_jx_i^Tx_j = (y_ix_i)^T(y_ix_i)</math><br />
* <math>\,\underline{0}</math> and <math>\,\underline{1}</math> are vectors containing all 0s or all 1s respectively<br />
<br />
Using this matrix notation, we can use Matlab's built in quadratic programming routine, [http://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/quadprog.html quadprog].<br />
<br />
===Quadprog example===<br />
<br />
Let's use <code>quadprog</code> to find the solution to <math>\,L(\alpha)</math>.<br />
<br />
Matlab's <code>quadprog</code> function minimizes an equation of the following form:<br />
:<math>\min_x\frac{1}{2}x^THx+f^Tx</math><br />
:such that: <math>\,A \cdot x \leq b</math>, <math>\,Aeq \cdot x = beq</math> and <math>\,lb \leq x \leq ub</math><br />
<br />
We can now see why we kept the <math>\frac{1}{2}</math> constant in the original derivation of the equation.<br />
<br />
The function is called as such: <code>x = quadprog(H,f,A,b,Aeq,beq,lb,ub)</code>. The variables correspond to values in the equation above.<br />
<br />
We can now try to find the solution to <math>\,L(\alpha)</math> (though, it should be noted, that in <math>\,L(\alpha)</math> we subtract the first term rather than add it; I'm not sure if this difference would change how we call <code>quadprog</code>).<br />
<br />
We'll use a simple one-dimensional data set, which is essentially y = -1 or 1 + Gaussian noise. (Note: you could easily put the values straight into the quadprog equation; they are separated for clarity.)<br />
<br />
x = [mvnrnd([-1],[0.01],100); mvnrnd([1],[0.01],100)]';<br />
y = [-ones(100,1); ones(100,1)];<br />
S = (y*x)' * (y*x);<br />
f = ones(200,1);<br />
A = -ones(1,200);<br />
b = 0;<br />
Aeq = y';<br />
beq = 0;<br />
lb = 0;<br />
ub = []; % There is no upper bound<br />
alpha = quadprog(S,f,A,b,Aeq,beq,lb,ub);<br />
<br />
This gives us the optimal <math>\,\alpha</math>... or at least I think it should, but it does not appear to work for me (that is, despite setting the lower boundary of 0, a number of values are still negative. Whether this is just the nature of <code>quadprog</code> or an error on my part is an exercise for the reader).<br />
<br />
===Examining K.K.T. conditions===<br />
<br />
[http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker conditions] ([http://webrum.uni-mannheim.de/mokuhn/public/KarushKuhnTucker.pdf more info]) give us a closer look into the Lagrangian equation and the associated conditions.<br />
<br />
Suppose we are looking to minimize <math>\,f(x)</math> such that <math>\,g_i(x) \geq 0, \forall{x}</math>. If <math>\,f</math> and <math>\,g</math> are differentiable, then the ''necessary'' conditions for <math>\hat{x}</math> to be a local minimum are:<br />
<br />
# At the optimal point, <math>\frac{\partial L}{\partial \hat{x}} = 0</math>; i.e. <math>f'(\hat{x}) - \sum{\alpha_ig'(\hat{x})}=0</math><br />
# <math>\alpha_i \geq 0</math>. (Dual Feasibility) <br />
# <math>\alpha_ig_i(\hat{x}) = 0, \forall{i}</math> (Complementary Slackness)<br />
# <math>g_i(\hat{x}) \geq 0</math> (Primal Feasibility)<br />
<br />
If any of these conditions are violated, then the problem is deemed not feasible.<br />
<br />
These are all trivial except for condition 4. Let's examine it further in our support vector machine problem.<br />
<br />
===Support Vectors===<br />
<br />
Basically, the support vectors are the training points that actually determine the optimal separating hyperplane which we are look for. Also, they are the most difficult points to classify or the most informative for the classification.<br />
<br />
In our case, the <math>g_i(\hat{x})</math> function is:<br />
:<math>\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1</math><br />
<br />
Substituting <math>\,g_i</math> into KKT condition 3, we get <math>\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0</math>. <br\>In order for this condition to be satisfied either <br/><math>\,\alpha_i= 0</math> or <br/><math>\,y_i(\beta^Tx_i+\beta_0)=1</math><br />
<br />
All points <math>\,x_i</math> will be either 1 or greater than 1 distance away from the hyperplane.<br />
<br />
'''Case 1: a point <math>\displaystyle x_i > 1</math> away from the margin'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) > 1 \Rightarrow \alpha_i = 0</math>.<br />
<br />
If point <math>\, x_i</math> is not on the margin, then the corresponding <math>\,\alpha_i=0</math>.<br />
<br />
'''Case 2: a point <math>\displaystyle x_i = 1</math> away from the margin'''<br />
<br />
If <math>\,\alpha_i > 0 \Rightarrow y_i(\beta^Tx_i+\beta_0) = 1</math> <br />
<br\>If point <math>\, x_i</math> is on the margin, then the corresponding <math>\,\alpha_i>0</math>.<br />
<br />
<br />
Points on the margin, with corresponding <math>\,\alpha_i > 0</math>, are called '''''support vectors'''''.<br />
<br />
===Using support vectors===<br />
<br />
Support vectors are important because they allow the support vector machine algorithm to be insensitive to outliers. If <math>\,\alpha = 0</math>, then the cost function is also 0, and won't contribute to the solution of the SVM problem; only points on the margin -- support vectors -- contribute. Hence the model given by SVM in entirely defined by the set of support vectors, a subset of the entire training set. This is interesting because in the NN methods(and can be generalize to classical statistical learning) previous to this the configuration of the network needed to be specified. In this case we have a data-driven or 'nonparametric' model in which is the training set and algorithm will determine the support vectors.<br />
<br />
References:<br />
Wang, L, 2005. Support Vector Machines: Theory and Applications, Springer, 3<br />
<br />
====The support vector machine algorithm====<br />
<br />
# Solve the quadratic programming problem:<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
## Use Matlab's quadprog to find the optimal <math>\,\underline{\alpha}</math><br />
# Find <math>\beta = \sum_{i=1}^n{\alpha_iy_i\underline{x_i}}</math><br />
# Find <math>\,\beta_0</math> by choosing a support vector (a point with <math>\,\alpha_i > 0</math>) and solving <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math><br />
<br />
===Example in Matlab===<br />
<br />
The following code, taken verbatim from the lecture, shows how to use Matlab built-in SVM routines (found in the Bioinformatics toolkit) to do classification through support vector machines.<br />
<br />
load 2_3;<br />
[U,Y] = princomp(X');<br />
data = Y(:,1:2);<br />
l = [-ones(1,200) ones(1,200)];<br />
[train,test] = crossvalind('holdOut',400);<br />
% Gives indices of train and test; so, train is a matrix of 0 or 1, 1 where the point should be used as part of the training set<br />
svmStruct = svmtrain(data(train,:), l(train), 'showPlot', true);<br />
<br />
[[File:Svm1.png|frame|center|The plot produced by training on some of the 2_3 data's first two features.]]<br />
<br />
yh = svmclassify(svmStruct, data(test,:), 'showPlot', true);<br />
<br />
[[File:Svm2.png|frame|center|The plot produced by testing some of the 2_3 data.]]<br />
=='''Non-linear hypersurfaces and Non-Separable classes - November 20, 2009'''==<br />
==='''Kernel Trick'''===<br />
We have talked about the curse of dimensions at the beginning of this course, however, now we turn to the power of high dimensions in order to make find a linearly separable hyperplane between two classes of data points. To understand this, imagine a two dimensional prison where a two dimensional person is constrained. Suppose magically we give the person a third dimension, then he can escape from the prison. In other words, the prison and the person are linearly separable now with respect to the third dimension. The intuition behind the "kernel trick" is basically to map data to a higher dimension so that they are linearly separable by a hyperplane.<br />
<br />
We have seen SVM as a linear classification problem finding the max margin hyperplane in the given input space. However, for many real world problems a more complex decision boundary is required. The following simple method was devised is order to solve the same linear classification problem but in a usually higher dimensional space, 'feature space' under which the max margin hyperplane is better suited.<br />
<br />
Let <math>\,\phi</math> be a mapping,<br />
<br />
<math>\phi:\Re^d \rightarrow \Re^D </math><br /><br /><br />
<br />
We wish to find a <math>\,\phi</math> such that our data will be suited for separation by a hyperplane. Given this function, we are lead to solving the previous constrained quadratic optimization on the transformed dataset,<br /><br /><br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_j\phi(x_i)^T\phi(x_j)}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br /><br /><br />
<br />
The solution to this optimization problem is now well know; however a workable <math>\,\phi</math> must be determined. Possibly the largest drawback in this method is that we must compute the inner product of two vectors in the high dimensional space. As the number of dimensions in the initial data set increases, the inner product becomes computationally intensive or impossible.<br />
<br />
However, we have a very useful result that says that there exists a class of functions, <math>\,\Phi</math>, which satisfy the above requirements and that for any function <math>\,\phi \in \Phi</math>,<br />
<br />
<math>\,\phi(x_i)^T\phi(x_j) = K(x_i,x_j) </math><br /><br /><br />
<br />
Where K is the kernel function in the input space satisfying [http://en.wikipedia.org/wiki/Mercer%27s_condition Mercer's condition] (to guarantee that it indeed corresponds to certain mapping function <math>\,\phi</math>). As a result, if the objective function depends on inner products but not on coordinates, we can always use the kernel function to implicitly calculate in the feature space without storing the huge data. Not only does this solve the computation problems but it no longer requires us to explicitly determine a specific mapping function in order to use this method. In fact, it is now possible to use an infinite dimensional feature space in SVM without even knowing the function <math>\,\phi</math>.<br />
<br />
==='''Mercer's Theorem in detail'''===<br />
Let <math>\,\phi</math> be a mapping to a high dimensional [http://en.wikipedia.org/wiki/Hilbert_space Hibert Space] <math>\,H</math><br /><br />
<br />
<br />
<math>\phi:x \in \Re^d \rightarrow H </math><br /><br /><br />
<br />
The transformed coordinates can be defined as,<br /><br />
<br />
<math>\phi_1(x)\dots\phi_d(x)\dots </math><br /><br /><br />
<br />
By Hilbert - Schmidt theory we can represent an inner product in Hilbert space as,<br /><br /><br />
<br />
<math>\,\phi(x_i)^T\phi(x_j) = \sum_{r=1}^{\infty}a_k\phi_r(x_i)\phi(x_j) \Leftrightarrow K(x_i,x_j), \ a_r \ge 0 </math><br /><br /><br />
where K is symmetric, then Mercer's theorem gives necessary and sufficient conditions on K for it to satisfy the above relation.<br><br><br />
<br />
'''Mercer's Theorem'''<br />
<br />
Let C be a compact subset of <math>\Re^d</math> and K a function <math> \in L^2(C) </math>, if<br /><br /><br />
<br />
<math>\, \int_C\int_C K(u,v)g(u)g(v)dudv \ge 0, \ \forall g \in L^2(C)</math> <br /><br /><br />
<br />
then,<br /><br /><br />
<br />
<math>\sum_{r=1}^{\infty}a_k\phi_r(u)\phi(v)</math> converges absolutely and uniformly to a symmetric function <math>\,K(u,v)</math><br />
<br />
References:<br />
Vapnik, V., 1998. Statistical Learning Theory. John Wiley & Sons, {423}<br />
<br />
==='''Kernel Functions'''===<br />
There are various kernel functions, for example:<br />
<br />
* Linear kernel: <math>\,k(x,y)=x \cdot y</math><br />
* Polynomial kernel: <math>\,k(x,y)=(x \cdot y)^d</math><br />
* Gaussian kernel: <math>e^{-\frac{|x-y|^2}{2\sigma^2}}</math><br />
<br />
If <math>\,X</math> is a <math>\,d \times n</math> matrix in the original space, and <math>\,\phi(X)</math> is a <math>\,D \times n</math> matrix in the [http://en.wikipedia.org/wiki/Hilbert_space Hilbert space] (good explanation video: [http://www.youtube.com/watch?v=V2pBdH7YzX0 part 1] [http://www.youtube.com/watch?v=YRY5xlk3TC0 part 2]), then <math>\,\phi^T(X) \cdot \phi(X)</math> is an <math>\,n \times n</math> matrix. <br />
The inner product is also illustrated as correlation, which measures the similarity between data points. This gives us some insight in how to choose the kernel. The choice depends on certain prior knowledge of the problem and on how we believe the similarity of our data should be measured. In practice, the Gaussian (RBF) kernel usually works best. Besides the most common kernel functions mentioned above, many novel kernels are also suggested for different problem domains like text classification, gene classification and so on.<br />
<br />
These kernel functions can be applied to many algorithms to derive the "kernel version". For example, kernel PCA, kernel LDA, etc..<br />
<br />
==='''SVM: non-separable case'''===<br />
We have seen how SVMs are able to find an optimally separating hyperplane of two separable classes of data, in which case the margin contains no data points. However, in the real world, data of different classes are usually mixed together at the boundary and it's hard to find a perfect boundary to totally separate them. To address this problem, we slack the classification rule to allow data cross the margin. Mathematically the problem becomes,<br />
:<math>\min_{\beta, \beta_0} \frac{1}{2}|\beta|^2</math><br />
:<math>\,y_i(\beta^Tx_i+\beta_0) \geq 1-\xi_i</math><br />
:<math>\xi_i \geq 0</math><br />
<br />
Now each data point can have some error <math>\,\xi_i</math>. However, we only want data to cross the boundary when they have to and make the minimum sacrifice; thus, a penalty term is added correspondingly in the objective function to constrain the number of points that cross the margin. The optimization problem now becomes:<br />
<br />
:<math>\min_{\alpha} \frac{1}{2}|\beta|^2+\gamma\sum_{i=1}^n{\xi_i}</math><br />
:<math>\,s.t.</math> <math>y_i(\beta^Tx+\beta_0) \geq 1-\xi_i</math> <br />
:<math>\xi_i \geq 0</math><br />
<br />
[[File:non-separable.JPG|350px|thumb|right|Figure non-separable case]]<br />
<br />
<br\>Note that <math>\,\xi_i</math> is not necessarily smaller than one, which means data can not only enter the margin but can also cross the separating hyperplane.<br />
<br />
References:<br />
<br />
Mercer, J., 1909. Functions of positive and negative type and their connection<br />
with the theory of integral equations. Philos. Trans. Roy. Soc. London, A<br />
209:415{446}<br />
<br />
==Support Vector Machine algorithm for non-separable cases - November 23, 2009==<br />
<br />
With the equation above, we can form the lagrangian, apply KKT conditions, and come up with the equation to optimize. As we will see, the equation that we will attempt to optimize in the SVM algorithm for non-separable data sets is the same as the optimization for the separable case, with slightly different conditions.<br />
<br />
===Forming the Lagrangian===<br />
<br />
Someone please fill in<br />
<br />
===Applying KKT conditions===<br />
<br />
Someone please fill in<br />
<br />
===Putting it all together===<br />
<br />
With our KKT conditions and the Lagrangian equation, we can now use quadratic programming to find <math>\,\alpha</math>.<br />
<br />
In matrix form, we want to solve the following optimization:<br />
:<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:<math>\,s.t.</math> <math>\underline{0} \leq \underline{\alpha} \leq \gamma</math>, <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Solving this gives us <math>\,\underline{\alpha}</math>, which we can use to find <math>\,\underline{\beta}</math> as before:<br />
:<math>\,\underline{\beta} = \sum{\alpha_i y_i \underline{x_i}}</math><br />
<br />
However, we cannot find <math>\,\beta_0</math> in the same way as before, even if we choose a point with <math>\,\alpha_i > 0</math>, because we do not know the value of <math>\,\xi_i</math> in the equation<br />
:<math>\,y_i(\underline{\beta}^Tx_i + \beta_0) - 1 + \xi_i = 0</math><br />
<br />
From our discussion on the KKT conditions, we know that <math>\,\lambda_i \xi_i = 0</math> and <math>\,\gamma = \alpha_i + \lambda_i</math>.<br />
<br />
So, if <math>\,\alpha_i < \gamma</math> then <math>\,\lambda_i > 0</math> and consequently <math>\,\xi_i = 0</math>.<br />
<br />
Therefore, we can solve for <math>\,\beta_0</math> if we choose a point where:<br />
:<math>\,0 < \alpha_i < \gamma</math><br />
<br />
====The SVM algorithm for non-separable data sets====<br />
<br />
The algorithm, then, for non-separable data sets is:<br />
<br />
# Use <code>quadprog</code> (or another quadratic programming technique) to solve the above optimization and find <math>\,\alpha</math><br />
# Find <math>\,\underline{\beta}</math> by solving <math>\,\underline{\beta} = \sum{\alpha_i y_i x_i}</math><br />
# Find <math>\,\beta_0</math> by choosing a point where <math>\,0 < \alpha_i < \gamma</math> and then solving <math>\,y_i(\underline{\beta}^Tx_i + \beta_0) - 1 = 0</math></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=5506stat8412009-11-23T23:42:04Z<p>Ipargaru: /* SVM: non-separable case */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rise of fields such as data-mining, bioinformatics, and machine learning, classification has becomes a fast-developing topic. In the age of information, vast amounts of data are generated constantly, and the goal of classification is to learn from data. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
'''Definition''': The problem of Prediction a discrete random variable <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math> is called Classification.<br />
<br />
In classification,, we attempt to approximate a function <math>\,h</math>, by using a training data set, which will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>d</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data which are identical independent distributions, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, for instance, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability. Intuitively speaking, to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we find <math>\,y\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y|X=x)</math>, and classify <math>\,X</math> into class <math>\,y</math>. In order to calculate the value of <math>\,P(Y=y|X=x)</math>, we use ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posterior probability, <math>\,P(Y=y)</math> as the prior probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two classes, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<sub></sub><br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''3 different approaches to classification''':<br />
<br />
1) Empirical Risk Minimization: Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of <math>\,L(h)</math><br />
<br />
2) Regression: Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3) Density Estimation: estimate <math>\,P(X=x|Y=0)</math> and <math>\,P(X=x|Y=1)</math> (less popular in high-dimension cases)<br />
<br />
<br />
<br />
'''Bayes Classification Rule Optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when<math></math> the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
<br />
:<math>\, h^*(X)= \left\{\begin{matrix} <br />
1 & if P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
Remark:<br />
<br />
1)Bayes classification rule is optimal. Proof:[http://www.ee.columbia.edu/~vittorio/BayesProof.pdf]<br />
<br />
2)We still need any other method, since we cannot define prior probability in realistic.<br />
<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, since it is generally impossible for us to know the prior <math>\,P(Y=1)</math>, and class conditional density <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], tree-augmented naive Bayes (TAN), Bayesian network augmented naive Bayes (BAN) and general Bayesian network (GBN).<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as having objective existence. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayesian and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayesian'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In Bayesian method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
'''Multi-class Classification''':<br />
<br />
Y takes on more than two values.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
''Theorem'': Suppose that <math>\,Y \in \mathcal{Y}= \{1,\dots,k\}</math>, the optimal rule is :<math>\,h^*(X) = \arg\max_{k}{P(Y = k|X = x)}</math><br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
1 Empirical Risk Minimization:Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
<br />
2 Regression:Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3 Density estimation, estimate <math>P(X = x|Y = 0)</math> and <math>P(X = x|Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
It is quadratic because there is no boundaries.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on [http://academicearth.org/lectures/advice-for-applying-machine-learning LDA and QDA] so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
LDA is for classification and FDA is used for feature extraction.<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
<br />
=== logistic function ===<br />
A logistic function or logistic curve is the most common sigmoid curve. <br />
<br />
:<math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
1. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
2. <math>y(0) = \frac{1}{2}</math><br />
<br />
3. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
4. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
5. The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood, using Pr(Y|X). The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Seeing these equations as a weighted least squares problem makes them easier to derivate.<br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== Perceptron (Foundation of Neural Network) ===<br />
<br />
==== Separating Hyperplane Classifiers ====<br />
Separating hyperplane trys to separate the data using linear decision boundaries. When the classes overlap, it can be generalized to support vector machine, which constructs nonlinear boundaries by constructing a linear boundary in an enlarged and transformed feature space.<br />
<br />
==== Perceptron ====<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
Least Squares returns the sign of a linear combination of data points as the class label<br />
<br />
sign(<math>(\underline{\beta}^T \underline{x} + {\beta}_0)) = sign(\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2})</math><br />
<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the [http://en.wikipedia.org/wiki/Artificial_neural_network Artificial Neural Network] models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br />
<br />
<br />
If we find a hyperplane that is not unique between 2 classes, there will be infinitely many solutions obtained from the perceptron algorithm.<br />
<br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
A Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math>, where <math>\,I</math> indicates the sign of the expression and returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math> (initial guess). Its goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach], which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
====Problems with the Algorithm and Issues Affecting Convergence====<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. This problem can be eliminated by using basis expansions technique. To be specific, we try to find a hyperplane not in the original space, but in the enlarged space obtained by using some basis functions.<br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#A perfect separation is not always available even desirable. If observations comes from different classes sharing the same imput, the classification model seems to be overfitting and will generally have poor predictive performance.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
<br />
</ref>.<br />
====Comment on gradient descent algorithm====<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction in which the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and you initially stand in the middle, then you will finally arrive at the saddle point (local minimum) and get stuck there.<br />
<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over <math>\,i</math> (all data points). Actually, this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent) known as Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>\,{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
*A Perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection with branches that "fan out" onto as many connections as desired, each carrying the same signal - the processing element output signal. <br />
<br />
<ref> Haykin, Simon (2009). Neural Networks and Learning Machines. Pearson Education, Inc. </ref><br />
A neural network resembles the brain in two respects:<br />
1. Knowledge is acquired by the network from its environment through a learning process.<br />
2. Interneuron connection strengths, known as synaptic weights, are used to store the acquired knowledge.<br />
<br />
<ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref> It is a multistage regression or classification model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
A regression problem typically has only one unit in the output layer. In a k-class classification problem, there are usually k units in the output layer that each represent the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1).<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classification by NN. <br />
<br />
In perceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] and is not continuous at 0. Thus, we replace it by a smooth function <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it the '''activation function'''.<br />
<br>The choice of this function <math>\displaystyle \sigma </math> is determined by the properties of the data and the assumed distribution of target variables, but for multiple binary classification problems the logistic function, also known as inverse-logit, is often used: <br />
<math>\sigma(a)=\frac {1}{1+e^{-a}}</math><br />
<br />
There are some important properties for the activation function.<br />
<br />
# Activation function is nonlinear. It can be shown that if the activation function of the hidden units is linear, a three-layer neural network is equivalent to a two layer one. <br />
# Activation function saturate, which means there are maximum and minimum output value. This property ensures that the weights are bounded and therefore the searching time is limited. <br />
# Activation function is continuous and smooth.<br />
# Activation function is monotonic. This property is not necessary, since we know that RBF networks is also a kind of power model. <br />
<br />
'''Note:''' A key difference between a perceptron and a neural network is that a neural network uses continuous nonlinearities in the units, for the purpose of differentiation, whereas the perceptron often uses a non-differentiable activation function. The neural network function is differentiable with respect to the network parameters so that a gradient descent method can be used in training. Moreover, a perceptron is a linear classifier, whereas a neural network, by combining layers of perceptrons, is able to classify non-linear problems through proper training.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes into the perceptron, to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural network is called [http://en.wikipedia.org/wiki/Feedforward_neural_network Feed-Forward Neural Network]. Applications to Feed-Forward Neural Networks include data reduction, speech recognition, sensor signal processing, and ECG abnormality detection, to name a few. <ref>J. Annema, Feed-Forward Neural Networks, (Springer 1995), pp. 9 </ref><br />
<br />
===Back-propagation===<br />
For a while, the Neural Network model was just an idea, since there were no algorithms for training the model until 1986, when Geoffrey Hinton <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> devised an algorithm called '''back-propagation''' [http://en.wikipedia.org/wiki/Backpropagation#Algorithm]. After that, a number of other training algorithms and various configurations of neural networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied a gradient descent algorithm for optimizing weights. Back-propagation uses this idea of gradient descent to train a neural network based on the chain rule in calculus. <br />
<br />
Assume that the last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
For simplicity, there is only 1 unit at the end and assume for the moment we are doing regression.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each unit we have a function <math>\displaystyle z_i=\sigma(a_i)</math> that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classification and the resulting classification output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
''i j are reversed!''<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\delta_j \cdot z_l</math><br />
<br />
Note that a change in <math>\,a_j</math> causes changes in all <math>\,a_i</math> in the next layer on which the error is based, so we need to sum over i in the chain:<br />
<math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} =\sum_i \delta_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\,\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math> Using the activation function <math>\,\sigma(\cdot)</math><br />
<br />
So <math>\delta_j = \sum_i \delta_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \delta_i \cdot u_{ij}</math><br />
<br />
We can propagate the error calculated in the output back through the previous layers and adjust weights to minimize error.<br />
<br />
==Neural Networks (NN) - October 30, 2009 ==<br />
<br />
=== Back-propagation ===<br />
The idea is that we first feed an input from the training set to the Neural Network, then find the error rate at the output and then we propagate the error to previous layers and for each edge of weight <math>\,u_{ij}</math> we find <math>\frac{\partial \mathrm{err}}{\partial u_{ij}}</math>. Having the error rates at hand we adjust the weight of each edge by taking steps proportional to the negative of the gradient to decrease the error at output. The next step is to apply the next input from the training set and go through the described adjustment procedure.<br />
The overview of Back-propagation algorithm:<br />
#Feed a point <math>\,x</math> in the training set to the network, and find the output of all the nodes.<br />
#Evaluate <math>\,\delta_k=y_k-\hat{y_k}</math> for all output units, where <math>y_k</math> is the expected output and <math>\hat{y_k}</math> is the real output.<br />
#By propagating to the previous layers evaluate all <math>\,\delta_j</math>s for hidden units: <math>\,\delta_j=\sigma'(a_j)\sum_i \delta_i u_{ij}</math> where <math>i</math> is associated to the previous layer.<br />
#Using <math>\frac{\partial \mathrm{err}}{\partial u_{jl}} = \delta_j\cdot z_l</math> find all the derivatives.<br />
#Adjust each weight by taking steps proportional to the negative of the gradient: <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} -\rho \frac{\partial \mathrm{err}}{u_{ij}}</math><br />
#Feed the next point in the training set and repeat the above steps.<br />
==== How to initialize the weights ====<br />
This still leaves the question of how to initialize the weights <math>\,u_{ij}, w_i</math>. The method of choosing weights mentioned in class was to randomize the weights before the first step. This is not likely to be near the optimal solution in every case, but is simple to implement. To be more specific, random values near zero will be a good choice for the initial weights(usually from [-1,1]). In this case, the model evolves from a nearly linear one to a nonlinear one as we desired. An alternative is to use an orthogonal least squares method to find the initial weights <ref>http://www.mitpressjournals.org/doi/abs/10.1162/neco.1995.7.5.982</ref>. Regression is performed on the weights and output by using a linear approximation of <math>\,\sigma(a_i)</math>, and finds optimal weights in the linear model. Back propagation is used afterward to find the optimal solution, since the NN is non-linear.<br />
<br />
==== How to set learning rates ====<br />
The learning rate <math>\,\rho</math> is usually a constant. <br />
<br />
If we use On-line learning, as a form of stochastic approximation process, <math>\,\rho</math> should decrease as the iteration increase.<br />
<br />
<br />
Choosing too large learning rate may cause the unstability of the system, while the too small learning rate may lead to a very slow convergence rate(very long time in learning phase). However, the advantage of small learning rate is that it can guarantee the convergence. Thus, generally, it is better to choose a relatively small learning rate to ensure the stability. Usually, choose <math>\,\rho</math> between 0.01 and 0.7<br />
<br />
==== How to determine the number of hidden units ====<br />
<br />
Here we will mainly discuss how to estimate the number of hidden units at very beginning. Obviously, we should adjust it to be more precise using CV, LOO or other complexity control methods. <br />
<br />
Basically, if the patterns are well separated, few hidden units are fairly enough. If the patterns are drawn from some highly complicated mixture models, more hidden units are really needed. <br />
<br />
Actually, the number of hidden units determines the size of the model, and therefore the total number of the weights in the model. Typically speaking, the number of weights should not be larger than the number of training data, say N. Thus, sometimes, N/10 is a good choice. However, in pratice, many well performed models will use more hidden units.<br />
<br />
=== Dimensionality reduction application ===<br />
[[File:NN-bottelneck.png|350px|thumb|right|Figure 1: Bottleneck configuration for applying dimensionality reduction.]]<br />
One possible application of Neural Networks is to perform dimensionality reduction, like other techniques, e.g., PCA, MDS, LLE and Isomap.<br />
<br />
Consider the following configuration as shown in figure 1:<br />
As we go forward in layers of this Neural Network, the number of nodes is reduced, until we reach a layer with the number of nodes representing the desired dimensionality. However, note that at the very first few layers the number of nodes may not be strictly decreasing, as long as finally it can reach a layer with less nodes. From now on in the Neural Network <br />
the previous layers are mirrored. So at the output layer we have the same number of states as we have in the input layer. Now note that if we feed the network with each point and get an output approximately equal to the fed input, that means at the output the same input is reconstructed from the middle layer units. So the output of the middle layer units can represent the input with less dimensions.<br />
<br />
To train this Neural Network, we feed the network with a training point and through back propagation we adjust the network weights based on the error between the input layer and the reconstruction at the output layer. Our low dimensional mapping will be the observed output from the middle layer. Data reconstruction consists of putting the low dimensional data through the second half of the network.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
== Complexity Control October 30, 2009 ==<br />
<br />
[[File:overfitting-model.png|500px|thumb|right|Figure 2. The overfitting model passes through all the points of the training set, but has poor predictive power for new points.<br />
In exchange the line model has some error on the training points but has extracted the main characteristic of the training points, and has good predictive power.]]<br />
There are [http://academicearth.org/lectures/underfitting-and-overfitting two issues] that we have to avoid in Machine Learning:<br />
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]<br />
#Underfitting<br />
<br />
Overfitting occurs when our model is heavily complex with so many degrees of freedom, that we can learn every detail of the training set. Such a model will have very high precision on the training set but will show very poor ability to predict outcomes of new instances, especially outside the domain of the training set.<br />
<br />
In a Neural Network if the depth is too much, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will show a very precise outcome of the training set but will not be able to generalize the commonality of the training set to predict the outcome of new cases.<br />
<br />
Underfitting occurs when the model we picked to describe the data is not complex enough, and has high error rate on the training set.<br />
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.<br />
<br />
'''Example'''<br />
#Consider the example showed in the figure. We have a training set and we want to find a model which fits it the best. We can find a polynomial of high degree which almost passes through all the points in the training set. But, in fact the training set is coming from a line model. Now the problem is although the complex model has less error on the training set it diverges from the line in other ranges which we have no training point. Because of that the high degree polynomail has very poor predictive result on test cases. This is an example of overfitting model.<br />
#Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem.<br />
#Consider a simple classification example. If our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is a yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features typical of bananas, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie., we have overfit the data. This occurs when we have considered so many features that we have perfectly described the existing bananas, but if presented with a new banana of slightly different shape than the existing bananas, for example, it cannot be detected. This is the tradeoff; what is the right level of complexity?<br />
<br />
== Complexity Control - Nov 2, 2009 ==<br />
<br />
Overfitting occurs when the model becomes too complex and underfitting occurs when it is not complex enough, both of which are not desirable. To control complexity, it is necessary to make assumptions for the model before fitting the data. Assumptions that we can make for a model are with polynomials or a neural network. There are other ways as well.<br />
<br />
[[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 1: An example of a model with a family of polynomials]]<br />
We do not want a model to get too complex, so we control it by making an assumption on the model. With complexity control, we want a model or a classifier with a low error rate.<br />
<br />
=== '''How do we choose a good classifier?''' ===<br />
<br />
Our goal is to find a classifier that minimizes the true error rate. <br />
Recall the empirical error rate<br />
<br />
<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math><br />
<br />
<math>\,h</math> is a classifier and we want to minimize the error rate. So we apply <math>\displaystyle x_1</math> to <math>\displaystyle x_n</math> to <math>\displaystyle h</math>, and take the average to get the empirical true error rate estimation of probability that <br />
<math>h(x_{i}) \neq y_{i}</math>.<br />
<br />
<span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 2]]</span><br />
There is a downward bias to this estimate meaning that it is always less than the true error rate. <br />
<br />
If there is a change in our complexity from low to high, our error rate is always decreasing. When we apply our model to the test data, our error rate will start to decrease to a point, but then it will increase since the model hasn't seen it before. This can be explained since training error will decrease when we fit the model better by increasing its complexity, but as we have seen, this complex model will not generalize well, resulting in a larger test error. <br />
<br />
We use our test data (from the test sample line shown on Figure 2) to get our empirical error rate.<br />
Right complexity is defined as where error rate of the test data is minimum; and this is one idea behind complexity control.<br />
<br />
<br />
<br />
[[File:Bias.jpg|200px|thumb|left|Figure 3]]<br />
<br />
We assume that we have samples <math>\,X_1, . . . ,X_n</math> that follow some (possibly unknown) distribution. We want to estimate a parameter <math>\,f</math> of the unknown distribution. This parameter may be the mean <math>\,E(X_i)</math>, the variance <math>\,var(X_i)</math> or some other quantity.<br />
<br />
The unknown parameter <math>\,f</math> is a fixed real number <math>f\in R</math>. To estimate it, we use an estimator which is a<br />
function of our observations, <math>\hat{f}(X_1,...,X_n)</math>. <br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f</math><br />
<br />
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]</math><br />
<br />
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math><br />
<br />
One property we desire of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>.<br />
However, there is a more important property for an estimator than just being unbiased: the mean squared error. In statistics, there are problems for which it may be good to use an estimator with a small bias. In some cases, an estimator with a small bias may have lesser mean squared error or be median-unbiased (rather than mean-unbiased, the standard unbiasedness property). The property of median-unbiasedness is invariant under transformations while the property of mean-unbiasedness may be lost under nonlinear transformations. For example, while using an unbiased estimator with large mean square error to estimate the parameter, we highly risk a big error. In contrast, a biased estimator with small mean square error will well improve the precision of our prediction.<br />
<br />
Hence, our goal is to minimize <math>MSE (\hat{f})</math>.<br />
<br />
From figure 3, we can see that the relationship of the three parameters is:<br />
<math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.<br />
<br />
A Test error is a good estimation on MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias.<br />
<br />
<br />
Referring to Figure 2, overfitting happens after the point where training data (training sample line) starts to decrease and test data (test sample line) starts to increase. There are 2 main approaches to avoid overfitting:<br />
<br />
1. Estimating error rate<br />
<br />
<math>\hookrightarrow</math> Empirical training error is not a good estimation<br />
<br />
<math>\hookrightarrow</math> Empirical test error is a better estimation<br />
<br />
<math>\hookrightarrow</math> Cross-Validation is fast<br />
<br />
<math>\hookrightarrow</math> Computing error bound (analytically) using some probability inequality.<br />
<br />
We will not discuss computing the error bound in class; however, a popular method for doing this computation is called VC Dimension (short for Vapnik–Chervonenkis Dimension). Information can be found from [http://www.autonlab.org/tutorials/vcdim.html Andrew Moore] and [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.7171&rep=rep1&type=pdf Steve Gunn].<br />
<br />
2. Regularization<br />
<br />
<math>\hookrightarrow</math> Use of shrinkage method<br />
<br />
<math>\hookrightarrow</math> Decrease the chance of overfitting by controlling the weights<br />
<br />
=== '''Example of under and overfitting in R''' ===<br />
<br />
To give further intuition of over and underfitting, consider this example. A simple quadratic data set with some random noise is generated, and then polynomials of varying degrees are fitted. The errors for the training set and a test set are calculated.<br />
[[File:Curvefitting-rex2.png|250px|thumb|right|Polynomial fits to curved data set.]]<br />
<br />
>> x <- rnorm(200,0,1)<br />
>> y <- x^2-0.5*x+rnorm(200,0,0.3)<br />
>> xtest <- rnorm(50,1,1)<br />
>> ytest <- xtest^2-0.5*xtest+rnorm(50,0,0.3)<br />
>> p1 <- lm(y~x)<br />
>> p2 <- lm(y ~ poly(x,2))<br />
>> pn <- lm(y ~ poly(x,10))<br />
>> psi <- lm(y~I(sin(x))+I(cos(x)))<br />
<br />
: <code>x</code> values for the training set are based on a <math>\,N(0,1)</math> distribution, while the test set has a <math>\,N(1,1)</math> distribution. <code>y</code> values are determined by <math>\,y = x^2 + x + N(0,0.3)</math>, a quadratic function with some random variation. Polynomial least square fits of degree 1, 2, and 10 are calculated, as well as a fit of <math>\,sin(x)+cos(x)</math>.<br />
<br />
>> > # calculate the mean squared error of degree 1 poly<br />
>> > sum((y-predict(p1,data.frame(x)))^2)/length(y)<br />
>> [1] 1.576042<br />
>> > sum((ytest-predict(p1,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 7.727615<br />
: Training and test mean squared errors for the linear fit. These are both quite high - and since the data is non-linear, the different mean value of the test data increases the error quite a bit.<br />
>> > # calculate the mean squared error of degree 2 poly<br />
>> > sum((y-predict(p2,data.frame(x)))^2)/length(y)<br />
>> [1] 0.08608467<br />
>> > sum((ytest-predict(p2,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 0.08407432<br />
: This fit is far better - and there is not much difference between the training and test error, either.<br />
>> > # calculate the mean squared error of degree 10 poly<br />
>> > sum((y-predict(pn,data.frame(x)))^2)/length(y)<br />
>> [1] 0.07967558<br />
>> > sum((ytest-predict(pn,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 156.7139<br />
: With a high-degree polynomial, the training error continues to decrease, but not by much - and the test set error has risen again. The overfitting makes it a poor predictor. As the degree of the polynomial rises further, the accuracy of the computer becomes an issue - and a good fit is not even consistently produced for the training data.<br />
>> > # calculate mse of sin/cos fit<br />
>> > sum((y-predict(psi,data.frame(x)))^2)/length(y)<br />
>> [1] 0.1105446<br />
>> > sum((ytest-predict(psi,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 1.320404<br />
: Fitting a function of the form sin(x)+cos(x) works pretty well on the training set, but because it is not the real underlying function, it fails on test data which doesn't lie on the same domain.<br />
<br />
== ''' Cross-Validation (CV) - Introduction ''' ==<br />
<br />
[[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]]<br />
Cross-Validation is used to estimate the error rate of a classifier with respect to test data rather than data used in the model. Here is a general introduction to CV:<br />
<br />
<math>\hookrightarrow</math> We have a set of collected data for which we know the proper labels<br />
<br />
<math>\hookrightarrow</math> We divide it into 2 parts, Training data (T) and Validation data (V)<br />
<br />
<math>\hookrightarrow</math> For our calculation, we pretend that we do not know the label of V and we use data in T to train the classifier<br />
<br />
<math>\hookrightarrow</math> We estimate an empirical error rate on V since the model hasn't seen V yet and we know the proper label of all elements in V to know how many were misclassified.<br />
<br />
CV has different implementations which can reduce the variance of the calculated error rate, but sometimes with a tradeoff of a higher calculation time.<br />
<br />
== ''' Complexity Control - Nov 4, 2009''' ==<br />
<br />
== Cross-validation ==<br />
[[File:Cross-validation.png|350px|thumb|right|Figure 1: Classical/Standard cross-validation]]<br />
Cross-validation is the simplest and most widely used method to estimate the true error. It comes from the observation that although training error always decreases with the increasing complexity of the model, the test error starts to increase from a certain point, which is noted as overfitting (see [[#prediction-error|figure 2]] above). Since test error estimates MSE (mean square error) best, people came up with the idea to divide the data set into three parts: training set, validation set, and test set. training set is used to build the model, validation set is used to deside the parameters and the optimal model, and the test set is used to estimate the performance of the chosen model. A classical division is 50% for training set, and 25% each for validation set and test set. All of them are randomly selected from the original data set. <br />
<br />
Then, we only use the part of our data marked as the "training set" to train our algorithm, while keeping the remaining marked as the "validation set" untouched. As a result, the validation set will be totally unknown to the trained model. The error rate is then estimated by:<br />
<br />
<math>\hat L(h) = \frac{1}{|\nu|}\sum_{X_i \in \nu}(h(x_i) \neq y_i)</math>, where <math>\,|\nu|</math> is the cardinality of the validation set.<br />
<br />
When we change the complexity, the error generated by the validation set will have the same behavior as the test set, so we are able to choose the best parameters to get the lowest error.<br />
<br />
<br />
=== K-fold Cross-validation ===<br />
[[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]]<br />
Above is the simplest form of complexity control. However, in reality, it may be hard to collect data ??and we usually suffer from the curse of dimensionality??, and a larger data set may be hard to come by. Consequently, we may not be able to afford to sacrifice part of the limited resources. In this case we use another method that addresses this problem, K-fold cross-validation. We divide the data set into <math>\,K</math> subsets roughly equal in size. The usual choice is <math>\,K = 10</math>.<br />
<br />
Generally, how to choose <math>\,K</math>:<br />
<br />
if <math>\,K=n</math>, leave one out, low bias, high variance. Each subset contains a single element, so the model is trained with all except one point, and then validated using that point.<br />
<br />
if <math>\,K=2</math>, say 2-fold, 5-fold, high bias, low variance. Each subset contains approximately <math>\,\frac{1}{2}</math> or <math>\,\frac{1}{5}</math> of the data.<br />
<br />
For every <math>\,k</math>th <math>( \,k \in [ 1, K ] )</math> part, we use the other <math>\,K-1</math> parts to fit the model and test on the <math>\,k</math>th part to estimate the prediction error <math>\hat L_k</math>, where<br />
<br />
<math>\hat L(h) = \frac{1}{K}\sum_{k=1}^K\hat L_k</math><br />
<br />
For example, suppose we want to fit a polynomial model to the data set and split the set into four equal subsets as shown in Figure 2. First we choose the degree to be 1, i.e. a linear model. Next we use the first three sets as training sets and the last as validation set, then the 1st, 2nd, 4th subsets as training set and the 3rd as validation set, so on and so forth until all the subsets have been the validation set once (all observations are used for both training and validation). After we get <math>\hat L_1, \hat L_2, \hat L_3, \hat L_4</math>, we can calculate the average <math>\hat L</math> for degree 1 model. Similarly, we can estimate the error for n degree model and generate a simulating curve. Now we are able to choose the right degree which corresponds to the minimum error. Also, we can use this method to find the optimal unit number of hidden layers of neural networks. We can begin with 1 unit number, then 2, 3 and so on and so forth. Then find the unit number of hidden layers with lowest average error.<br />
<br />
=== Generalized Cross-validation ===<br />
If the vector of observed values is denoted by <math>\mathbf{y}</math>, and the vector of fitted values by <math>\hat\mathbf{y}</math>.<br />
<br />
<math>\mathbf{y} = \mathbf{H}\hat\mathbf{y}</math>, <br />
<br />
where the hat matrix is given by<br />
<br />
<math>\mathbf{H} = \mathbf{X}( \mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X}^{T}</math>,<br />
<br />
<math> \frac{1}{N}\sum_{i=1}^{N}[y_{i} - \hat f^{-i}(\mathbf{x}_{i})]^{2}=\frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-\mathbf{H}_{ii}}]^{2}</math>,<br />
<br />
Then the GCV approximation is given by<br />
<br />
<math> GCV(\hat f) = \frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-trace(\mathbf{H})/N}]^{2}</math>,<br />
<br />
Thus, one of the biggest advantages of the GCV is that the trace is more easily computed.<br />
<br />
=== Leave-one-out Cross-validation ===<br />
Leave-one-out cross-validation involves using all but one data point in the original training data set to train our model, then using the data point that we initially left out as a means to estimate true error. But repeating this process for every data point in our original data set, we can obtain a good estimation of true error.<br />
<br />
In other words, leave-one-out cross-validation is k-fold cross-validation in which we set the subset number <math>\,K</math> to be the cardinality of the whole data set.<br />
<br />
In the above example, we can see that k-fold cross-validation can be computationally expensive: for every possible value of the parameter, we must train the model <math>\,K</math> times. This deficiency is even more obvious in leave-one-out cross-validation, where we must train the model <math>\,n</math> times, where <math>\,n</math> is the number of data point in the data set.<br />
<br />
Fortunately, when adding data points to the classifier is reversible, calculating the difference between two classifiers is computationally more efficient than calculating the two classifiers separately. So, if the classifier on all the data points is known, we simply undo the changes from a data point <math>\,K</math> times to calculate the leave-one-out cross-validation error rate.<br />
<br />
== Regularization for Neural Network — Weight Decay ==<br />
[[File:figure 2.png|350px|thumb|right|Figure 1: activation function]]<br />
Weight decay training is suggested as an implementation for achieving a robust neural network which is insensitive to noise. Since the number of hidden layers in NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting.<br />
<br />
It can be seen from Figure 1 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, we can avoid overfitting by constraining the weights to be small. This gives us a hint to initialize the random weights to be close to zero.<br />
<br />
Formally, we penalize nonlinear weights by adding a penalty term in the error function. Now the regularized error function becomes:<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}|w_i|^2 + \sum_{jk}|u_{jk}|^2)</math>, where <math>\,err</math> is the original error in back-propagation; <math>\,w_i</math> is the weights of the output layer; <math>\,u_{jk}</math> is the weights of the hidden layers.<br />
<br />
Usually, too large <math>\,\lambda</math> will make the weights <math>\,w_i</math> and <math>\,u_{jk}</math> too small. We can use cross validation to estimate <math>\,\lambda</math>.<br />
<br />
A similar penalty, weight elimination, is given by,<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}\frac{|w_i|^2}{1 + |w_i|^2} + \sum_{jk}\frac{|u_{jk}|^2}{1+|u_{jk}|^2})</math>.<br />
<br />
As in back-propagation, we take partial derivative with respect to the weights:<br />
<br />
<math>\frac{\partial REG}{\partial w_i} = \frac{\partial err}{\partial w_i} + 2\lambda w_i</math><br />
<br />
<math>\frac{\partial REG}{\partial u_{jk}} = \frac{\partial err}{\partial u_{jk}} + 2\lambda u_{jk}</math><br />
<br />
<math>w^{new} \leftarrow w^{old} - \rho\left(\frac{\partial err}{\partial w} + 2\lambda w\right)</math><br />
<br />
<math>u^{new} \leftarrow u^{old} - \rho\left(\frac{\partial err}{\partial u} + 2\lambda u\right)</math><br />
<br />
Note that here <math>\,\lambda</math> serves as a trade-off parameter, tuning between the error rate and the linearity. Actually, we may also set <math>\,\lambda</math> by cross-validation. The tuning parameter is important since weights of zero will lead to zero derivatives and the algorithm will not change. On the other hand, starting with weights that are too large means starting with a nonlinear model which can often lead to poor solutions. <ref>Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning (Springer 2009) pp.398</ref><br />
<br />
== Radial Basis Function (RBF) Networks - November 6, 2009 ==<br />
<br />
[[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]]<br />
<br />
=== Introduction === <br />
<br />
A Radial Basis Function (RBF) network [http://en.wikipedia.org/wiki/Radial_basis_function_network] is a type of artificial neural network with an output layer and a single hidden layer, with weights from the hidden layer to the output layer, and can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. One choice that has been widely used is that of radial basis functions, which have the property that each basis function depends only on the radial distance (typically Euclidean) from a center <math>\displaystyle\mu_{j}</math>, so that <math>\phi_{j}(x)= h({\Vert x - \mu_{j}\Vert})</math>.<br />
<br />
<br />
The output of an RBF network can be expressed as a weighted sum of its radial basis functions as follows:<br />
<br />
<math>\hat y_{k} = \sum_{j=1}^M\phi_{j}(x) w_{jk}</math><br />
<br />
The radial basis function is: <br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br /><br />
(Gaussian without a normalization constant)<br /><br /><br />
'''note:'''The hidden layer has a variable number of neurons (the optimal number is determined by the training process). As usual the more neurons in the hidden layer, the higher the model complexity. Each neuron consists of a radial basis function centered on a point with the same dimensions as the input data. The radii of the RBF functions may be different. The centers and radii can be determined through clustering or an EM algorithm. When the x vector is given from the input layer, the hidden neuron computes the radial distance from the neuron’s center point and then applies RBF function to this distance. The resulting value is passed to the the output layer and weighed together to form the output. <br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat{Y}_{n,k} = \left[ \begin{matrix}<br />
\hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\<br />
\hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\<br />
\vdots &\vdots & \ddots & \vdots \\<br />
\hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k}<br />
\end{matrix}\right] </math> is the matrix of output variables. <br />
<br />
:<math>\Phi_{n,m} = \left[ \begin{matrix}<br />
\phi_{1,1} & \phi_{1,2} & \cdots & \phi_{1,m} \\<br />
\phi_{2,1} & \phi_{2,2} & \cdots & \phi_{2,m} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
\phi_{n,1} & \phi_{n,2} & \cdots & \phi_{n,m}<br />
\end{matrix}\right] </math> is the matrix of Radial Basis Functions.<br />
<br />
:<math>W_{m,k} = \left[ \begin{matrix}<br />
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\<br />
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
w_{m,1} & w_{m,2} & \cdots & w_{m,k}<br />
\end{matrix}\right] </math> is the matrix of weights.<br />
<br />
Here, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>k = 1</math>, <math>\hat Y</math> and <math>W</math> are column vectors.<br />
<br />
''related reading'':<br />
<br />
Introduction of the Radial Basis Function (RBF) Networks [http://axiom.anu.edu.au/~daa/courses/GSAC6017/rbf.pdf]<br />
<br />
Radial Basis Function (RBF) Networks [http://documents.wolfram.com/applications/neuralnetworks/index6.html] [http://lcn.epfl.ch/tutorial/english/rbf/html/index.html]<br />
<br />
=== Estimation of weight matrix W ===<br />
<br />
We minimize the training error, <math>\Vert Y - \hat{Y}\Vert^2</math> in order to find <math>\,W</math>.<br /><br /><br />
From a previous result in linear algebra we know that <br />
<br />
<math>\Vert A \Vert^2 = Tr(A^{T}A)</math><br />
<br />
Thus we have a problem similar to linear regression:<br />
<br />
<math>\ err = \Vert Y - \Phi W\Vert^{2} = Tr[(Y - \Phi W)^{T}(Y - \Phi W)]</math><br />
<br />
<math>\ err = Tr[Y^{T}Y - Y^{T}\Phi W - W^{T} \Phi^{T} Y + W^{T}\Phi^{T} \Phi W]</math><br />
<br />
<br />
==== Useful properties of matrix differentiation ====<br />
<br />
<br />
<math>\frac{\partial Tr(AX)}{\partial X} = A^{T}</math><br />
<br />
<math>\frac{\partial Tr(X^{T}A)}{\partial X} = A</math><br />
<br />
<math>\frac{\partial Tr(X^{T}AX)}{\partial X} = (A^{T} + A)X</math><br />
<br />
==== Solving for W ====<br />
<br />
We find the minimum over <math>\,W</math> by setting <math>\frac{\partial err}{\partial W}</math> equal to zero and using the aforementioned properties of matrix differentiation.<br />
<br />
<math>\frac{\partial err}{\partial W} = 0</math><br />
<br />
<math>\ 0 - \Phi^{T}Y - \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ -2 \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math><br />
<br />
<math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY</math><br />
<br />
where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}</math><br />
<br />
<math>\,H</math> is the hat matrix for this model. This gives us a nice results since the solution has a closed form and we do not have to worry about convexity problems in this case.<br />
<br />
=== Including an additional bias ===<br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat Y = \left[ \begin{matrix}<br />
y_{11} & y_{12} & \cdots & y_{1k} \\<br />
y_{21} & y_{22} & \cdots & y_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
y_{n1} & y_{n2} & \cdots & y_{nk}<br />
\end{matrix}\right] </math> is the matrix(n by k) of output variables.<br />
<br />
:<math>\Phi = \left[ \begin{matrix}<br />
\phi_{10} &\phi_{11} & \phi_{12} & \cdots & \phi_{1M} \\<br />
\phi_{20} & \phi_{21} & \phi_{22} & \cdots & \phi_{2M} \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{n0} &\phi_{n1} & \phi_{n2} & \cdots & \phi_{nM}<br />
\end{matrix}\right] </math> is the matrix(n by M+1) of Radial Basis Functions.<br />
<br />
:<math>W = \left[ \begin{matrix}<br />
w_{01} & w_{02} & \cdots & w_{0k} \\<br />
w_{11} & w_{12} & \cdots & w_{1k} \\<br />
w_{21} & w_{22} & \cdots & w_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
w_{M1} & w_{M2} & \cdots & w_{Mk}<br />
\end{matrix}\right] </math> is the matrix(M+1 by k) of weights.<br />
<br />
where the extra basis function <math>\Phi_{0}</math> is set to 1.<br />
<br />
==== Normalized RBF ====<br />
<br />
In addition to the above unnormalized architecture, the normalized RBF can be represented as:<br />
<br />
<math>\hat{y}_{k}(X) = \frac{\sum_{j=1}^{M} w_{jk}\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math><br /><br /><br />
<br />
<br />
Actually, <math>\Phi^{\ast}_{j}(X) = \frac{\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math> is known as a normalized radial basis function. Giving the familiar form,<br /><br />
<br />
<math>\hat{y}_{k}(X) = \sum_{j=1}^{M} w_{jk}\Phi^{\ast}_{j}(X)</math><br /><br /><br />
<br />
=== Conceptualizing RBF networks ===<br />
<br />
In the past, we have classified data using models that were explicitly linear, quadratic, or otherwise definite. In RBF networks, like in Neural Networks, we can fit an arbitrary model. How can we do this without changing the equations being used?<br />
<br />
Recall a [[#Trick:_Using_LDA_to_do_QDA_-_October_7.2C_2009|trick]] that was discussed in the October 7 lecture: if we add new features to our original data set, we can project into higher dimensions, use a linear algorithm, and get a quadratic result by collapsing to a lower dimension afterward. In RBF networks, something similar can happen.<br />
<br />
Think of <math>\,\Phi</math>, our matrix of radial basis functions, as a feature space of the input. Each hidden unit, then, can be thought to represent a feature; we can see that, if there are more hidden units than input units, we can essentially project to a higher-dimensional space, as we did in our earlier trick. However, this does not mean that an RBF network will actually do this, it is merely a way to convince yourself that RBF networks (and neural networks) can fit arbitrary models. Nevertheless, it is also noticed that just because of such great power, the problem of overfitting appears more important. We have to control its complexity so that it does not fit to an arbitrary training model but to a general one.<br />
<br />
=== RBF networks for classification -- a probabilistic paradigm ===<br />
<br />
[[File:Rbf_graphical_model.png|350px|thumb|left|Figure 1: RBF graphical model]]<br />
<br />
An RBF network is akin to fitting a Gaussian mixture model to data. We assume that each class can be modelled by a single function <math>\,\phi</math> and data is generated by a mixture model. According to Bayes Rule,<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(x|y_{k})*Pr(y_{k})}{Pr(x)}</math><br />
<br />
While all classifiers that we have seen thus far in the course have been in discriminative form, the RBF network is a generative model that can be represented using a directed graph.<br />
<br />
We can replace the class conditional density in the above conditional probability expression by marginalizing <math>\,x</math> over <math>\,j</math>:<br />
<math>\Pr(x|y_{k}) = \sum_{j} Pr(x|j)*Pr(j|y_{k})</math><br />
<br />
<br />
<br />
<br/><br/><br />
*'''Note''' We made the assumption that each class can be modelled by a single function <math>\displaystyle\Phi</math> and that the data was generated by a mixture model. The Gaussian mixture model has the form:<br />
<math>f(x)=\sum_{m=1}^M \alpha_m \phi(x;\mu_m,\Sigma_m)</math> where <math>\displaystyle\alpha_m</math> are mixing proportions, <math>\displaystyle\sum_m \alpha_m=1</math>, and <math>\displaystyle\mu_m</math> and <math>\displaystyle\Sigma_m</math> are the mean and covariance of each Gaussian density respectively. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009), pp. 214. </ref> The generative model in Figure 1 shows graphically how each Gaussian in the mixture model is chosen to sample from.<br />
<br />
== '''Radial Basis Function (RBF) Networks - November 9th, 2009''' ==<br />
<br />
=== RBF Network for classification (A probabilistic point of view) ===<br />
Using RBF Network to do classification, we usually treat it as regression problem. We want to set a threshold to decide what the data’s class membership is. However, to find some insight to the classification problem of what we are doing in terms of RBF Network, we often think of mixture models and make certain assumptions.<br />
<br />
[[File:RBF.png|350px|thumb|right|Figure 26.1: RBF Network Classification Demo]] <br />
<br />
We assume, as we can see in the graph on the right hand side, that we have three random variables, <math>\displaystyle y_k</math>, <math>\displaystyle j</math>, and <math>\displaystyle x</math> where <math>\displaystyle y_k</math> denotes class <math>\,k</math>, <math>\displaystyle x</math> is what we observed here, and <math>\displaystyle j</math> is some hidden random variable. There is a process that there are different classes, and each class can trigger a different hidden random variable <math>\displaystyle j</math>. To understand this, we can assume that, for instance, this random variable <math>\displaystyle j</math> has a Gaussian distribution (it could have any other distribution as well) and that all the <math>\displaystyle j</math>’s have the same distribution (Gaussian), but with different parameters. From each Gaussian distribution triggered by each class, we are going to sample some data points. Therefore, in the end, we are going to get a set of data, which are not strictly Gaussian, but are actually a mixture of Gaussians.<br />
<br />
Again, we look at the posterior distribution from [http://en.wikipedia.org/wiki/Bayes'_theorem Bayes' Rule].<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(X = x | Y = y_{k})*Pr(Y = y_{k})}{Pr(X = x)}</math><br />
<br />
Since we made the assumption that the data has been generated from a mixture model, we can estimate this conditional probability by<br />
<br />
<math>\Pr(X = x | Y = y_{k}) = \sum_{j} Pr(X = x | j)*Pr(j | Y = y_{k})</math>, <br />
<br />
which is the class conditional distribution (or probability) of the mixture model. Note, here, if we only have a simple model from <math>\displaystyle y_k</math> to <math>\displaystyle x</math>, then we won’t have this summation.<br />
<br />
We can substitute this class conditional distribution into Bayes' formula. We can see that the posterior of class <math>\displaystyle k</math> is the summation over <math>\displaystyle j</math> of the probability of <math>\displaystyle x</math> given <math>\displaystyle j</math> times the probability of <math>\displaystyle j</math> given <math>\displaystyle y_k</math>, times the prior distribution of class <math>\displaystyle k</math>, and lastly divided by the marginal probability of <math>\displaystyle x</math>. That is,<br />
<br />
<math>\Pr(y_k | x) = \frac {\sum_{j} Pr(x | j)*Pr(j | y_{k})*Pr(y_{k})}{Pr(x)}</math>.<br />
<br />
Since, the prior probability of class <math>\displaystyle k</math>, <math>\displaystyle Pr(y_{k})</math>, does not have an index of <math>\displaystyle j</math>, it can be taken out of the summation. This yields,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)}</math>.<br />
<br />
We multiply this by <math>\displaystyle 1 = \frac {Pr(j)}{Pr(j)}</math>. Then, it becomes,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)} * \frac {Pr(j)}{Pr(j)}</math>.<br />
<br />
Next, note that <math>\displaystyle Pr(j | x) = \frac {Pr(x | j)*Pr(j)}{Pr(x)}</math>, and <math>\displaystyle Pr(y_k | j) = \frac {Pr(j | y_k)*Pr(y_k)}{Pr(j)}</math>. Then rearranging the terms, we finally have the posterior:<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} Pr(j | x)Pr(y_k | j)</math>.<br />
<br />
Interestingly, this is just the product of the posterior of the two functions that are summed.<br />
<br />
==== Interpretation of RBF Network classification ====<br />
<br />
[[File:2.png|350px|thumb|right|Figure 26.1.2(2): RBF Nerwork ]]<br />
<br />
We want to relate the results that we derived above to our RBF Network. In a RBF Network, as we can see on the right hand side, we have a set of data, <math>\displaystyle x_1</math> to <math>\displaystyle x_d</math>, and the hidden basis function, <math>\displaystyle \phi_{1}</math> to <math>\displaystyle \phi_{M}</math>, and then we have some output, <math>\displaystyle y_1</math> to <math>\displaystyle y_k</math>. Also, we have weights from the hidden layer to output layer. The output is just the linear sum of <math>\displaystyle \phi</math>’s. <br />
<br />
Now consider probability of <math>\displaystyle j</math> given <math>\displaystyle x</math> to be <math>\displaystyle \phi</math>, and the probability of <math>\displaystyle y_k</math> given <math>\displaystyle j</math> to be the weights <math>\displaystyle w_{jk}</math>, then the posterior can be written as,<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math>.<br />
<br />
[[File:3.png|350px|thumb|left|Figure 26.1.2(1): Gaussian mixture ]]<br />
<br />
Now, let us look at an example in one dimensional case. Suppose,<br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, and <math>\displaystyle j</math> is from 1 to 2. <br />
<br />
We know that <math>\displaystyle \phi</math> is a radial basis function. It's as if we put some Gaussian over data. And for each Gaussian, we consider the center <math>\displaystyle \mu</math>. Then, what <math>\displaystyle \phi</math> computes is the similarity of any data point to the center. <br />
<br />
We can see the graph on the left which plots the density of <math>\displaystyle \phi_{1}</math> and <math>\displaystyle \phi_{2}</math>. Take <math>\displaystyle \phi_{1}</math> for instance, if the point gets far from the center <math>\displaystyle \mu_{1}</math>, then it will reduce <math>\displaystyle \phi_{1}</math> to become nearly zero. Remember that, we can usually find a non-linear regression or classification of input space by doing a linear one in some extended space or some feature space (more details in Aside). Here, the <math>\displaystyle \phi</math>’s actually produce that feature space. <br />
<br />
So, one way to look at this is that this <math>\displaystyle \phi</math> is telling us that given an input, how likely the probability of presence of a particular feature is. Say, for example, we define the features as the centers of these Gaussian distributions. Then, this <math>\displaystyle \phi</math> function somehow computes the possibility given certain data points, of this kind of feature appearing. If the data point is right at the center, then the value of that <math>\displaystyle \phi</math> would be one, i.e. the probability is 1. If the point is far from the center, then the probability (<math>\displaystyle \phi</math> function value) will be close to zero, that is, it’s less likely. Therefore, we can treat <math>\displaystyle Pr(j | x)</math> as the probability of a particular feature given data. <br />
<br />
When we have those features, then <math>\displaystyle y</math> is the linear combination of the features. Hence, any of the weights <math>\displaystyle w</math>, which is equal to <math>\displaystyle Pr(y_k | j)</math>, tells us how likely this particular <math>\displaystyle y</math> will appear given those features. Therefore, the weight <math>\displaystyle w_{jk}</math> shows the probability of class membership given feature. <br />
<br />
Hence, we have found a probabilistic point of view to look at RBF Network!<br />
<br />
*'''Note''' There are some inconsistencies with this probabilistic point of view. There are no restrictions that force <math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math> to be between 0 and 1. So if least squares is used to solve this, <math>\displaystyle w_{jk}</math> cannot be interpreted as a probability. <br />
<br />
<br />
''' Aside '''<br />
*Feature Space:<br />
:One way to produce a feature space is LDA<br />
:Suppose, we have n data points <math>\mathbf{x}_1</math> to <math>\mathbf{x}_n </math>. Each data point has d features. And these n data points consist of the <math>X</math> matrix, <br />
:<math>X = \left[ \begin{matrix}<br />
x_{11} & x_{21} & \cdots & x_{n1} \\<br />
x_{12} & x_{22} & \cdots & x_{n2} \\<br />
\vdots & & \ddots & \vdots \\<br />
x_{1d} & x_{2d} & \cdots & x_{nd}<br />
\end{matrix}\right] </math><br />
:Also, we have feature space,<br />
:<math>\Phi^{T} = \left[ \begin{matrix}<br />
\phi_{1}(\mathbf{x_1}) & \phi_{1}(\mathbf{x_2})& \cdots & \phi_{1}(\mathbf{x_n})\\<br />
\phi_{2}(\mathbf{x_1})& \phi_{2}(\mathbf{x_2})& \cdots & \phi_{2}(\mathbf{x_n}) \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{M}(\mathbf{x_1}) & \phi_{M}(\mathbf{x_2}) & \cdots & \phi_{M}(\mathbf{x_n})<br />
\end{matrix}\right] </math> <br />
:If we want to solve a regression problem for the input data, we don’t perform Least Square on this <math>\displaystyle X</math> matrix, we do Least Square on the feature space, i.e. on the <math>\displaystyle \Phi^{T}</math> matrix. The dimensionality of <math>\displaystyle \Phi^{T}</math> is M by n.<br />
:Now, we still have n data points, but we define these n data points in terms of a new set of features. So, originally, we define our data points by d features, but now, we define them by M features. And what are those M features telling us? <br />
:Let us look at the first column of <math>\displaystyle \Phi^{T}</math> matrix. The first entry is <math>\displaystyle \phi_1</math> applied to <math>\mathbf{x_1}</math>, and so on, until the last entry is <math>\displaystyle \phi_M</math> applied to <math>\mathbf{x_1}</math>. Suppose each of these <math>\displaystyle \phi_j</math> is defined by<br />
:<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>.<br />
:Then, each <math>\displaystyle \phi_j</math> checks the similarity of the data point with its center. Hence, the new set of features are actually representing M centers in our data set, and for each data point, its new features check how this point is similar to the first center; how it is similar to the second center; and how it is similar to the <math>\displaystyle M^{th}</math> center. And this checking process will apply to all data points. Therefore, feature space gives another representation of our data set. <br />
<br />
</noinclude><br />
<br />
=== Model selection or complexity control for RBF Network - a brief introduction ===<br />
In order to obtain a better fit for the training data, we often want to increase the complexity of our RBF Network. By its construction, we know that to change the complexity of a RBF Network, the only way is to add or decrease the number of basis functions. A large number of basis function yields a more complex network. In theory, if we add enough basis functions, the RFB Network would work for any training; however, it doesn't mean this RBF Network model can generalize well. Therefore, for the purpose of avoiding overfitting problem (see Notes below), we only want to increase the number of basis function to certain point, i.e. its optimal level. <br />
<br />
For the model selection, what we usually do is estimate the training error. After working through the training error, we’ll see that the training error in fact can be decomposed, and one component of training error is called Mean Squared Error (MSE). In the later notes, we will find that our final goal is to get a good estimate of MSE. Moreover, in order to find an optimal model for our data, we select the model with the smallest MSE.<br />
<br />
Now, let us introduce some notations that we will use in the analysis:<br />
*<math>\hat f</math> -- the prediction model estimated by a RBF network from the training data<br />
*<math>\displaystyle f</math> -- the real model (not null), and ideally, we want <math>\hat f</math> to be close to <math>\displaystyle f</math><br />
*<math>\displaystyle err</math> -- the training error<br />
*<math>\displaystyle Err</math> -- the testing error<br />
*<math>\displaystyle MSE</math> -- the Mean Squared Error<br />
<br />
''' Notes '''<br />
<br />
[[File:overfitting.png|350px|thumb|left|Figure 26.2: Overfitting]]<br />
<br />
*Being more complex isn’t always a good thing. Sometime, [http://en.wikipedia.org/wiki/Overfitting overfitting] causes the model to lose its generality. For example in the graph on left hand side, the data points are sampled from the model <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle f(x_i)</math> is a linear function, which is shown by the blue line, and <math>\displaystyle \epsilon_i</math>is additive Gaussian noise from <math>~N(0,\sigma^2)</math>. The red curve displayed in the graph shows the over-fitted model. Clearly, this over-fitted model only works for any training data, and is useless for any further prediction when new data points are introduced.<br />
<br />
> n<-20;<br />
> x<-seq(1,10,length=n);<br />
> alpha<-2.5;<br />
> beta<-1.75;<br />
> y<-alpha+beta*x+rnorm(n);<br />
> plot(y~x, pch=16, lwd=3, cex=0.5, main='Overfitting');<br />
> abline(alpha, beta, col='blue');<br />
> lines(spline(x, y), col = 2);<br />
<br />
*More details on this topic later on.<br />
<br />
<br />
<br />
</noinclude><br />
<br />
<br />
<br />
<br />
<br />
<br />
== '''Model Selection(Stein's Unbiased Risk Estimate)- November 11th, 2009''' ==<br />
<br />
===Model Selection===<br />
<br />
Model selection is a task of selecting a model of optimal complexity for a given data. Learning a radial basis function network from data is a parameter estimation problem. One difficulty with this problem is selecting parameters that show good performance on both training and testing data. In principle, a model is selected to have parameters associated with the best observed performance on training data, although our goal really is to achieve good performance on unseen testing data. Not surprisingly, a model selected on the basis of training data does not necessarily exhibit comparable performance on the testing data. When squared error is used as the performance index, a zero-error model on the training data can always be achieved by using a sufficient number of basis functions.<br />
<br />
<br />
But, training error and testing error do not demonstrate a linear relationship. In particular, a smaller training error does do not necessarily result in a smaller testing error. In practice, one often observes that, up to a certain point, the model error on testing data tends to decrease as the training error decreases. However, if one attempts to decrease the training error too far by increasing model complexity, the testing error often can take a dramatic increase.<br />
<br />
<br />
The basic reason behind this phenomenon is that in the process of minimizing training error, after a certain point, the model begins to over-fit the training set. Over-fitting in this context means fitting the model to training data at the expense of losing generality. In the extreme form, a set of <math>\displaystyle N</math> training data points can be modeled exactly with <math>\displaystyle N</math> radial basis functions. Such a model follows the training data perfectly. However, the model is not representative features of the true underlying data source, and this is why it fails to correctly model new data points.<br />
<br />
<br />
In general, the training error rate will be less than the testing error on the new data. A model typically adapts to the training data, and hence the training error will be an overly optimistic estimate of the testing error. An obvious way to well estimate testing err is to add a penalty term to the training error to compensate. Actually, SURE is developed based on this.<br />
<br />
<br />
<br />
===Stein's unbiased risk estimate (SURE)===<br />
<br />
<br />
====Important Notation====<br />
<br />
Let:<br />
*<math>\hat f(X)</math> denote the ''prediction model'', which is estimated from a training sample by the RBF neural network model.<br />
*<math>\displaystyle f(X)</math> denote the ''true model''.<br />
*<math>\displaystyle err=\sum_{i=1}^N (\hat y_i-y_i)^2 </math> denote the ''training error'',which is the average loss over the training sample.<br />
*<math>\displaystyle Err=\sum_{i=1}^M (\hat y_i-y_i)^2 </math> denote the ''test error'', which is the expected prediction error on an independent test sample.<br />
*<math>\displaystyle MSE=E(\hat f-f)^2</math> denote the ''mean squared error'', where <math>\hat f(X)</math> is the estimated model and <math>\displaystyle f(X)</math> is the true model.<br />
<br />
<br />
<br />
Suppose the observations <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>.We need to estimate <math>\hat f</math> from training data set <math>T=\{(x_i,y_i)\}^2_{i=1}</math>. Let <math>\hat f_i=\hat f_(x_i)</math> and <math>\displaystyle f_i= f(x_i)</math>, then <br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i-\epsilon_i)^2]</math><math>=E[(\hat f_i-f_i)^2]+E[\epsilon_i^2]-2E[\epsilon_i(\hat f_i-f_i)]</math><br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2E[\epsilon_i(\hat f_i-f_i)]</math> <math>\displaystyle (1)</math><br />
<br />
The last term can be written as:<br />
<br />
<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=E[(y_i-f_i)(\hat f_i-f_i)]=cov(y_i,\hat f)</math>, where<math>\displaystyle y_i</math> and <math>\hat f_i</math> both have same mean <math>\displaystyle f_i</math>.<br />
<br />
<br />
<br />
====Stein's Lemma====<br />
<br />
If <math>\,Z</math> is <math>\,N(\mu,\sigma^2)</math> and if <math>\displaystyle g(Z)</math> is weakly differentiable,such that<math>\displaystyle E[\vert g'(Z)\vert]<\infty</math>, then <math>\displaystyle E[g(Z)(Z-\mu)]=\sigma^2E(g'(Z))</math>.<br />
<br />
<br />
According to Stein's Lemma, the last cross term of <math>\displaystyle (1)</math>, <math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]</math> can be written as <math>\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math>. The derivation is as follows.<br />
<br />
<math>\displaystyle Proof</math>: Let <math>\,Z = \epsilon</math>. Then <math>g(Z) = \hat f-f</math>, since <math>\hat y = f + \epsilon</math>, and <math>\,f</math> is a constant. So <math>\,\mu = 0</math> and <math>\,\sigma^2</math> is the variance in <math>\,\epsilon</math>.<br />
<math>\displaystyle E[g(Z)(Z-\mu)]=E[(\hat f-f)\epsilon]=\sigma^2E(g'(Z))=\sigma^2 E[\frac {\partial (\hat f-f)}{\partial y_i}]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}-\frac {\partial f}{\partial y_i}]</math><br />
<br />
<br />
Since <math>\displaystyle f</math> is the true model, not the function of the observations <math>\displaystyle y_i</math>, then <math>\frac {\partial f}{\partial y_i}=0</math>.<br />
<br />
So,<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math> <math>\displaystyle (2)</math><br />
<br />
====Two Different Cases====<br />
SURE in RBF,<br />
[http://www.cs.ualberta.ca/~papersdb/uploaded_files/801/paper_automatic-basis-selection-for.pdf Automatic basis selection for RBF networks using Stein’s unbiased risk estimator,Ali Ghodsi Dale Schuurmans]<br />
<br />
<br />
=====''Case 1''=====<br />
<br />
Consider the case in which a new data point has been introduced to the estimated model, i.e. <math>(x_i,y_i)\not\in\tau</math>; this new point belong to the validation set <math>\displaystyle \nu</math>,i.e.<math>(x_i,y_i)\in\nu</math>. Since <math>\displaystyle y_i</math> is a new point, <math>\hat f</math> and <math>\displaystyle y_i</math> are independent. Therefore <math>\displaystyle cov(y_i,\hat f)=0</math> (or think about <math>\frac{\partial \hat f}{\partial y_i}</math>, when <math>\,y_i</math>is a new point, then it has nothing with <math>\hat f</math> because the estimation of <math>\hat f</math> is from the training data, so <math>\frac{\partial \hat f}{\partial y_i}=0</math> ) and <math>\displaystyle (1)</math> in this case can be written as:<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2</math>. <br />
<br />
This expectation means <math>\frac {1}{m}\sum_{i=1}^m (\hat y_i-y_i)^2 = \frac {1}{m}\sum_{i=1}^m (\hat f_i-f_i)^2+ \sigma^2</math>.<br />
<br />
<math>\sum_{i=1}^m (\hat y_i-y_i)^2 = \sum_{i=1}^m (\hat f_i-f_i)^2+ m\sigma^2</math><br />
<br />
Based on the notation we denote above, then we obtain:<br />
<math>\displaystyle MSE=Err-m\sigma^2</math><br />
<br />
<br />
<br />
This is the justification behind the technique of cross validation. since <math>\displaystyle \sigma^2</math> is constant, to minimize <math>\displaystyle MSE</math> is equal to minimize the test err <math>\displaystyle Err</math>. In cross vaildation to avoid overfitting or underfitting, a validation data set is independent from the estimated model.<br />
<br />
<br />
=====''Case 2''=====<br />
<br />
A more interesting case is the case in which we do not use new data points to assess the performance of the estimated model. and the training data is used for both estimating and assessing a model <math>\hat f_i</math>. In this case the cross term in <math>\displaystyle (1)</math> cannot be ignored because <math>\hat f_i</math> and <math>\displaystyle y_i</math> are not independent. Therefore the cross term can be estimated by Stein's lemma, which was originally proposed to estimated the mean of a Guassian distribution.<br />
<br />
<br />
Suppose <math>(x_i,y_i)\in\tau</math>, then by applying Stein's lemma, we obtain <math>\displaystyle (2)</math> proved above.<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2\sigma^2E[\frac {\partial \hat f}{\partial y_i}]</math>.<br />
<br />
This expectation means <math>\frac {1}{N}\sum_{i=1}^N (\hat y_i-y_i)^2 = \frac {1}{N}\sum_{i=1}^N (\hat f_i-f_i)^2+ \sigma^2-\frac {2}{N}\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<br />
<math>\sum_{i=1}^N (\hat y_i-y_i)^2 = \sum_{i=1}^N (\hat f_i-f_i)^2+ N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<math>\displaystyle err=MSE+N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math><br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math> <math>\displaystyle (3)</math><br />
<br />
In statistics, this is known as [http://www.reference.com/browse/Stein%27s+unbiased+risk+estimate Stein's unbiased risk estimate (SURE)] is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter, and thus cannot be determined completely.<br />
<br />
===SURE for RBF Network===<br />
<br />
Based on SURE, the optimum number of basis functions should be assigned to have the minimum generalization err <math>\displaystyle err</math>. Based on the Radial Basis Function Network, by setting <math>\frac{\partial err}{\partial W}</math> equal to zero , we get the least squared solution of<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math>.Then we have <math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}Y = HY</math>,where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}</math>, <math>\,H</math> is the hat matrix for this model.<br />
<br />
<br />
<math>\hat f=\,H_{i1}y_2+\,H_{i2}y_2+\cdots+\,H_{in}y_n</math> <math>\displaystyle (3)</math><br />
<br />
where <math>\,H</math> depends on the input vector <math>\displaystyle x_i</math> but not on <math>\displaystyle y_i</math>. <br />
<br />
By taking the derivative of <math>\hat f_i</math> with respect to <math>\displaystyle y_i</math>, we can easily obtain:<br />
<br />
<math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Now, substituing this into<math>\displaystyle (3)</math> ,we get<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Here, we can tell that <math>\sum_{i=1}^N \,H_{ii}= \,Trace(H)</math>, the sum of the diagonal elements of <math>\,H</math>. Thus, we can obtain the further simplification that <math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1})\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=d</math>, where<math>\displaystyle d</math> is the dimension of <math>\displaystyle \Phi</math>. Since <math>\displaystyle \Phi</math> is a projection of input matrix <math>\,X</math> onto a basis set spanned by <math>\,M</math>, the number of basis functions. If considering intercept, then <math>\,Trace(H)= M+1</math>.<br />
<br />
Then,<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1)</math>.<br />
<br />
===SURE Algorithm===<br />
<br />
<br />
[[File:27.1.jpg|350px|thumb|right|Figure 27.1]]<br />
<br />
We use this method to find the optimum number of basis function by choosing the model with smallest MSE over the set of models considered. Given a set of models <math>\hat f_M(x)</math> indexed by the number of basis functions, <math>\displaystyle err(M)</math>. <br />
<br />
Then, <math>\displaystyle MSE(M)=err(M)-N\sigma^2+2\sigma^2(M+1)</math><br />
<br />
where <math>\displaystyle N</math> is the number of training samples and the noise,<math>\sigma^2</math>, can be estimated from the training data as<br />
<br />
<math>\hat \sigma^2=\frac {1}{N-1}\sum_{i=1}^N (\hat y-y)^2</math>.<br />
<br />
<br />
By applying SURE algorithm to SPECT Heart data, we get the optimal number of basis functions is <math>\displaystyle M=4</math>.<br />
<br />
<br />
Pls take a look at figure 27.1 on the right, which shows that<math>\displaystyle MSE</math> is smallest when <math>\displaystyle M=4</math>.<br />
<br />
<br />
Calculating the SURE value is easy if you have access to <math>\,\sigma</math>.<br />
<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
If <math>\,\sigma</math> is not known, it can be estimated using the error.<br />
<br />
error = (output - expected_output) .^ 2;<br />
sigma = (dot(err, ones(1, num_data_point)) / (num_data_point)) / (num_data_point - 1);<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
=='''SURE for RBF network & Support Vector Machine - November 13th, 2009'''==<br />
<br />
===SURE for RBF network===<br />
<br />
====Minimizing MSE====<br />
<br />
By Stein's unbiased risk estimate (SURE) for Radial Basis Function (RBF) Network<br />
we get:<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1) </math> (28.1)<br />
<br />
*<math>\displaystyle MSE</math>(mean square error)= <math>\sum_{i=1}^N (\hat y_i-y_i)^2 </math><br />
*<math>\displaystyle err</math>(training error)= <math>\sum_{i=1}^N (\hat f_i-f_i)^2 </math><br />
*<math>\displaystyle (M+1) </math>( number of hidden units)= <math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math><br />
<br />
<br />
'''Goal''': To minimize MSE<br />
<br />
1. If <math>\displaystyle \sigma </math> is known, then it is no impact (i.e. a constant),<br />
and we can ignore it. Only need to minimize <math>\displaystyle MSE=err +2\sigma^2(M+1)</math>.<br />
<br />
2. In reality, we do not know <math>\displaystyle \sigma</math>, and the estimate <math>\,\hat \sigma</math> changes when <math>\displaystyle (M+1) </math> changes. However, we can estimate <math>\displaystyle \sigma </math>.<br />
<br />
<math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>. Suppose we do not know the variance of <math>\displaystyle \epsilon</math>. Then, <br />
<br />
<math>\displaystyle \sigma^2=\frac{1}{N-1}\sum_{i=1}^N (\hat y_i-y_i)^2 =\frac{1}{N-1}err</math> (28.2)<br />
<br />
Substitute (28.2) into (28.1), get<br />
<br />
<math>\displaystyle MSE=err-N\frac{1}{N-1}err+2\frac{1}{N-1}err(M+1)</math><br />
<br />
<math>\displaystyle MSE=err(1-\frac{N}{N-1}+\frac{2(M+1)}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{N-1-N+2M+2}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{2M+1}{N-1}) </math> (28.3) <br />
<br />
<br />
[[File:28.1.jpg|350px|thumb|Figure 28.1: MSE vs err]]<br />
<br />
Figure 28.1: the training error will decrease and the MSE will increase when increasing the number of hidden units (i.e. the model is more complex).<br />
<br />
<br />
When the number of hidden units gets larger and larger, the training error will decrease until it approaches to <math>\displaystyle 0 </math>. If training error equals <math>\displaystyle 0 </math>, then no matter how large <math>\displaystyle (M+1) </math> is, from (28.3) we can see the estimate of MSE will approach <math>\displaystyle 0 </math> as well. However, in fact it does not happen since when training error is close to <math>\displaystyle 0 </math> [http://en.wikipedia.org/wiki/Overfitting overfitting] happens, and MSE should increase instead of being close to <math>\displaystyle 0 </math>. We can see it from the Figure 28.1. <br />
<br />
<br />
We can see <math>\displaystyle \sigma^2 </math> is the average of <math>\displaystyle err </math>. In order to deal with this problem, we can take the average for <math>\displaystyle err</math> of each hidden unit. For example: we can first take 1 hidden unit, and take 10 hidden units in the next. Since in reality the value of <math>\, \sigma^2</math> is a constant adjustment to the data points, and doesn't depend on <math>\,M+1</math>, using the average <math>\,\sigma^2</math> value for 1 to 10 hidden units has a firm theoretical basis.<br />
<br />
We can also see that unlike the classical Cross Validation (CV) or Leave one out (LOO) techniques, the SURE technique does not need to do the validation to find the optimal model. Hence, SURE technique uses less data than CV or LOO. It is suitable for the case that there is not enough data for validation. However, to implement SURE we need to find <math>\frac {\partial \hat f}{\partial y_i}</math>, which may not be trivial for models that do not have a closed-form solution.<br />
<br />
====Kmeans Clustering====<br />
<br />
Discription:<br /> [http://en.wikipedia.org/wiki/K-means_clustering Kmeans clustering] is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.<br />
<br />
*The number of hidden units is same as the number of clusters, which is <math>\displaystyle \phi </math><br />
<br />
*<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, we set it same for all clusters.<br />
<br />
The basic details for <math>K</math>-means clustering are given:<br />
<br />
The <math>K</math> initial centers are randomly chosen from the training data.<br />
<br />
Then the following two steps are iterated alternatively until convergernce.<br />
<br />
1. for each existing center, we reidentify its cluster (every points in this cluster should be closer to this center than others).<br />
<br />
2. compute the mean for each cluster and make it as the new center for each cluster.<br />
<br />
<br />
Example:<br /><br />
Partition data into 2 clusters (2 hidden values)<br />
<br />
<br />
>> X=rand(30,80); <br />
>> [IDX,C,sumD,D]=kmeans(X,2); <br />
>> size(IDX) <br />
>> 30 1<br />
>> size(C) <br />
>> 2 80<br />
>> size(sumD) <br />
>> 2 1<br />
>> c1=sum(IDX==1)<br />
>> 14<br />
>> c2=sum(IDX==2)<br />
>> 16<br />
>> sumD<br />
>> 85.6643<br />
>> 101.0419<br />
>> v1=sumD(1,1)/c1 <br />
>> 6.1189<br />
>> v2=sumD(2,1)/c2 <br />
>> 6.3151 <br />
<br />
<br />
<br />
Comments:<br />
<br />
We create <math>X</math> randomly as a train set with 80 data points and 30 dimensions, and then apply “kmeans” method to separate X into 2 clusters.IDX is a vector contains 1 or 2 which indicates 2 clusters, and its size is 30*1. <math>\displaystyle C </math> is the center (mean) of each cluster with size 2*80; sumD is sum of the square distance between the data points and center of its cluster. The <math>\displaystyle c1 </math> and <math>\displaystyle c2 </math> indicate the number of data points in cluster 1 and 2. <math>\displaystyle V1 </math> is the variance of the first cluster <math>\displaystyle (v1=\sigma_1)</math>; <math>\displaystyle V2 </math> is the variance of the second cluster <math>\displaystyle (v2=\sigma_2)</math>. Now we can get <math>\displaystyle \phi </math>, <math>\displaystyle w </math>, hat matrix <math>\displaystyle H </math> and <math>\displaystyle \hat Y </math> by following equations. Finally, we will get the <math>\displaystyle MSE </math> and predict the test set. <br />
<br />
<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br />
<br />
<math>\displaystyle W = (\phi^{T}\Phi)^{-1}\phi^{T}Y</math><br />
<br />
<math>\displaystyle H = \phi(\Phi^{T}\Phi)^{-1}\phi^{T}</math><br />
<br />
<math>\displaystyle \hat{Y} = \phi W = \phi(\phi^{T}\phi)^{-1}\phi^{T}Y = HY</math><br />
<br />
<br />
<br />
Aside:<br />
<br />
Similar in spirit to <math>K</math>-means, there is EM algorithm with respect to Gaussian mixture model. Generally speaking, the Gaussian mixture model is referred to as a soft clustering while <math>K</math>-means is hard clustering.<br />
<br />
Similar to <math>K</math>-means, the following two steps are iterated alternatively until convergernce.<br />
<br />
E-step, each point is assigned a weight for each cluster based on the likelihood of each of the corresponding Gaussians. Observations is assigned 1 for one cluster if they are closer to the center of that cluster, and is assigned 0 for other clusters. <br />
<br />
M-step, compute the weighted means and covariances and make them as the new means and covariances for every cluster.<br />
<br />
>>[P,mu,phi,1Pxtr]=mdgEM(X,2,200,0);<br />
<br />
===Support Vector Machine===<br />
<br />
====Introduction====<br />
We have seen that linear discriminate analysis and logistic regression both estimate linear decision boundaries in similar but slightly different ways. Separating hyperplane classifiers provide the basis for support vector classifier. It constructs linear decision boundaries that explicitly try to separate the data into different classes as well as possible. The techniques that make the extensions to the nonseparable case, where the classes overlap are generalized to what is known as the support vector machine. It produces nonlinear boundaries by constructing a linear boundary in a high dimensional, transformed version of the feature space.<br />
<br />
The original basis for SVM was published in the 1960s by [http://en.wikipedia.org/wiki/Vapnik Vapnik], Chervonenkis and co, however the ideas did not gain any attention until strong results were shown in the early 90s.<br />
<br />
Definition: <br /><br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machine (SVM)] are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a Maximum margin hyperplane or set of hyperplanes in a higher or infinite dimensional space. The set of points near the class boundaries, support vectors, define the model. which can be used for classification, regression or other tasks.<br />
<br />
====Optimal Seperating Hyperplane====<br />
<br />
[[File:28.2.jpg|350px|thumb|right|Figure 28.2]]<br />
<br />
Figure 28.2 An example with two classes separated by a hyperplane. The blue line is the least squares solution, which misclassifies one of the training points. Also shown are the black separating hyperplanes found by the [http://en.wikipedia.org/wiki/Perceptron perceptron] learning algorithm with different random starts.<br /><br />
<br />
We can see the data points can be separated by a linear boundary are in two classes in <math>\displaystyle \mathbb{R}^{2} </math>. Suppose a dataset is indeed linearly separable, then there exits infinitely many possible separating hyperplanes including the black lines in the figure as two of them for training data. However, which solution is the best when we introduce the new data. <br /><br />
<br />
Aside: <br /><br />
The blue line is the least squares solution to the problem,obtained by regressing the <math>\displaystyle -1/+1 </math> response <math>\displaystyle Y </math> on <math>\displaystyle X </math> (with intercept); the line is given by<br />
<math>\displaystyle {X:\hat\beta_0+\hat\beta_1X_1+\hat\beta_2X_2=0}</math>.<br />
This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by linear discriminant analysis, in light of its equivalence with linear regression in the two-class case.<br />
<br />
Classifiers such as (28.4) that compute a linear combination of the input features and return the sign were called ''perceptrons'' in the engineering literature in the late 1950s. <br />
<br />
<br />
Identifications:<br />
<br />
*Hyperplane: seprate two classes <br />
<br />
<math>\displaystyle x^{T}\beta+\beta_0=0</math><br />
<br />
*Margin: the distance between the hyperplane and the closest point.<br />
<br />
<math>\displaystyle d_i=x_i^{T}\beta+\beta_0 </math> where <math>\displaystyle i=1,....,N</math><br />
<br />
Note: since distance is positive, if the data is on <math>\displaystyle +1 </math> side the distance is <math>\displaystyle d_i(+1)</math>. If the data is on the <math>\displaystyle -1 </math> side the distance is <math>\displaystyle d_i(-1)</math>.<br />
<br />
*Data points: <math>\displaystyle y_i\in\{-1,+1\}</math>we can classify points as<math>\displaystyle sign\{d_i\}</math> if <math>\displaystyle \beta,\beta_0 </math> is known.<br /><br />
<br />
====Maximum Margin Classifiers in the Linearly separable case====<br />
Choose the line farthest from both classes or Choose the line which has the maximum distance from the closest point. (i.e maximize the margin)<br /><br />
<br />
<math>\displaystyle Margin=min\{y_id_i\}</math> <math>\displaystyle i=1,2,....,N </math> <br />
where <math>\displaystyle y_i </math> is label and <math>\displaystyle d_i </math> is distance<br /><br />
<br />
[[File:28.3.jpg|350px|thumb|right|Figure 28.3 The linear algebra of a hyperplane]]<br />
<br />
<br />
<br />
Fiqure 28.3 depicts a hyperplane defined by the equation <math>\displaystyle x^{T}\beta+\beta_0=0</math>. Since they are in <math>\displaystyle \mathbb{R}^{2} </math>, the hyperplane is a line.<br /><br />
<br />
<br />
Let us rewrite <math>\displaystyle Margin=min\{y_id_i\}</math> by using the following properties:<br /><br />
<br />
1. <math>\displaystyle \beta </math> is orthogonal to the hyperplane <br /><br />
<br />
Two points<math>\displaystyle x_1,x_2</math> lying on the hyperplane.<br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0\beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}(x_1-x_2)=0</math><br />
<br />
Hence,<math>\displaystyle \beta </math> is orthogonal to <math>\displaystyle (x_1-x_2)</math>, and<math>\displaystyle \beta^*=\frac{\beta}{\|\beta\|} </math> is the vector normal to the hyperplane.<br /><br />
<br />
2. For any point <math>\displaystyle x_0 </math> on the hyperplane, <br />
<br />
<math>\displaystyle \beta^{T}x_0+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1=-\beta_0</math><br />
For any point on the hyperplane <math>\displaystyle \beta^{T}</math> gives negative value of the intercept of the hyperplane. <br/><br />
<br />
<br />
3. The signed distance for any point <math>\displaystyle x </math> to the hyperplane is <math>\displaystyle d_i=\beta^{T}(x_i-x_0)</math>. <br/>Since the length of <math>\displaystyle \beta </math> is not known, let it be unit vector.<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}(x_i-x_0)}{\|\beta\|} </math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i-\beta^{T}x_0}{\|\beta\|} </math><br />
<br />
by property 2<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i+\beta_0}{\|\beta\|} </math><br />
<br />
<br />
<br />
[[File:4.jpg|350px|thumb|right|Figure 28.4]]<br />
<br />
<br />
We had <math>\displaystyle Margin=min(y_id_i)</math> <math>\displaystyle i=1,2,....,N </math>, and since we now know how to compute <math>\displaystyle d_i \Rightarrow</math> <br />
<br />
<math>\displaystyle Margin=min\{y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\} </math><br />
<br />
Suppose <math>\displaystyle x_i </math> is not on the hyperplan<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)>0 </math><br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq c </math>for <math>\displaystyle c>0 </math><br />
<br />
<br />
<math>\displaystyle y_i(\frac{\beta^{T}x_i}{c}+\frac{\beta_0}{c})\geq1</math> <br />
<br />
This is known as the cnonical representation of the decision hyperplane.<br />
<br />
For <math>\displaystyle \beta^{T} </math> only the direction is important, so <math>\displaystyle \frac{\beta^{T}}{c} </math> does not change its direction, the hyperplance will be the same.<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq1 </math><br />
<br />
<math>\displaystyle y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\geq\frac{1}{\|\beta\|} </math><br />
<br />
<math>\displaystyle Margin=min(\frac{1}{\|\beta\|} )</math><br />
<br />
which is equivalent as <math>\displaystyle Margin=max(\|\beta\| )</math><br />
<br />
<br />
<br />
<br />
Refrence:<br /><br />
Hastie,T.,Tibshirani,R., Friedman,J.,(2008).The Elements of Statistical Learning:129-130<br />
<br />
=='''Optimizing The Support Vector Machine - November 16th, 2009'''==<br />
We currently derive Support Vector Machine for the case where two classes are separable in the given feature space. This margin can be written as <math>\,min\{y_id_i\}</math>, or the distance of each point from the hyperplane, where <math>\,d_i</math> is the distance and <math>\,y_i</math> is used as the sign.<br />
===Margin Maximizing Problem for the Support Vector Machine===<br />
<math>\,Margin=min\{y_id_i\}</math> can be rewritten as <math>\,min\left\{\frac{y_i\left(\beta^Tx_i+\beta_0\right)}{|\beta|}\right\}</math>. <br />
<br\>Note that the term <math>\,y_i\left(\beta^Tx_i+\beta_0\right) = 0</math> if <math>\,x_i</math> is on the hyperplane, but <math>\,y_i\left(\beta^Tx_i+\beta_0\right) > 0</math> if <math>\,x_i</math> is <math>\displaystyle not</math> on the hyperplane<br />
<br />
This implies <math>\,\exists C>0</math> such that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq C</math>.<br />
<br />
Divide through by C to produce <math>\,y_i\left(\frac{\beta^T}{C}x_i + \frac{\beta_0}{C}\right) \geq 1</math>. <br />
<br />
<math>\,\beta, \beta_0</math> compose a hyperplane that can have different values but we care about the direction, dividng through by a constant does not change the direction of the hyperplane. Thus, by assuming scaled values for <math>\,\beta, \beta_0</math> we eliminate C, so that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq 1</math>. Implying that the lower bound on <math>\,y_i\left(\beta^Tx_i+\beta_0\right)</math> is <math>\displaystyle 1</math><br />
<br />
Now in order to maximize the margin, we simply need to find <math>\,min\left\{\frac{1}{|\beta|}\right\}</math>. <br />
<br />
In other words, our optimization problem is now to find maximum <math>\,|\beta|</math>, under the constraint that <math>\,min_i\{y_i(\beta^Tx_i+\beta_0)\} = 1</math>.<br />
<br />
Note that we're dealing with the norm of <math>\,\beta</math>. There are many different choices of possible norms, in general[http://en.wikipedia.org/wiki/P-norm#p-norm p-norm]. The 1-norm of a vector is simply the sum of the absolute value of each element (also known as the taxicab or Manhattan distance), and is apparently more accurate, but also has a discontinuity in the derivative. 2-norm or the Euclidean norm (the intuitive measure of the length of a vector), is easier to work with - that is <math>\,\|\beta\|_2 = (\beta^T\beta)^{1/2}</math>. For convenience, we will maximize <math>\,\frac{1}{2}\|\beta\|_2^2 = \frac{1}{2}\beta^T\beta</math> where the constant 1/2 has been added for simplification and that the maximizing the function is the same as maximizing the square root of that function.<br />
<br />
This is an example of a quadratic programming problem and we will minimize a quadratic function respective to a linear inequality constraints.<br />
<br />
<br />
====Writing Lagrangian Form of Support Vector Machine====<br />
The Lagrangian form is introduced to ensure that the conditions are satisfied, as well as finding an optimal solution(The optimal saddle point of the lagrangian for the classic quadratic optimization). The problem will be solved in dual space by introducing <math>\,\alpha_i</math> as dual constraints, this is in contrast to solving the problem in primal space as function of the betas. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.<br />
<br />
<math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math>. To find the optimal value, set the derivative equal to zero.<br />
<br />
<math>\,\frac{\partial L}{\partial \beta} = 0</math>, <math>\,\frac{\partial L}{\partial \beta_0} = 0</math>. Note that <math>\,\frac{\partial L}{\partial \alpha_i}</math> is equivalent to the constraints <math>\left(y_i(\beta^Tx_i+\beta_0)-1\right) \geq 0, \,\forall\, i</math><br />
<br />
First, <math>\,\frac{\partial L}{\partial \beta} = \frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\left\{\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i)+\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0-\frac{\partial}{\partial \beta}\alpha_iy_i\right\}}</math><br />
<br />
: <math>\frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 = \beta</math>.<br />
<br />
: <math>\,\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i) = \alpha_iy_ix_i</math><br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0 = 0</math>. <br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i = 0</math>.<br />
<br />
So this simplifies to <math>\,\frac{\partial L}{\partial \beta} = \beta - \sum_{i=1}^n{\alpha_iy_ix_i} = 0</math>. In other words,<br />
<br />
<math>\,\beta = \sum_{i=1}^n{\alpha_iy_ix_i}</math>, <math>\,\beta^T = \sum_{i=1}^n{\alpha_iy_ix_i^T}</math><br />
<br />
Similarly, <math>\,\frac{\partial L}{\partial \beta_0} = -\sum_{i=1}^n{\alpha_iy_i} = 0</math>.<br />
<br />
This allows us to rewrite the Lagrangian without <math>\,\beta</math>.<br />
<br />
<math>\,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} - \sum_{i=1}^n{\alpha_iy_i(\sum_{j=1}^n{\alpha_jy_jx_j^Tx_i} + \beta_0) - 1}</math>. <br />
<br />
Because <math>\,\sum_{i=1}^n{\alpha_iy_i} = 0</math>, and <math>\,\beta_0</math> is constant, <math>\,\sum_{i=1}^n{\alpha_iy_i\beta_0} = 0</math>. So this simplifies further, to<br />
<br />
<math>L(\alpha) = \,-\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} + \sum_{i=1}^n{\alpha_i}</math><br />
A dual representation of the maximum margin.<br />
<br />
Because <math>\,\alpha_i</math> is the Lagrangian multiplier, <math>\,\alpha_i \geq 0 \forall i</math>.<br />
<br />
This is a much simpler optimization problem.<br />
<br />
=='''The Support Vector Machine algorithm - November 18, 2009'''==<br />
<br />
===Solving the Lagrangian===<br />
<br />
Continuing from the above derivation, we now have the equation that we need to minimize, as well as two constraints.<br />
<br />
The Support Vector Machine problem boils down to:<br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math><br />
:such that <math>\alpha_i \geq 0</math><br />
:and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
<br />
We are looking to minimize <math>\,\alpha</math>, which is our only unknown. Once we know <math>\,\alpha</math>, we can easily find <math>\,\beta</math> and <math>\,\beta_0</math> (see the Support Vector algorithm below for complete details).<br />
<br />
If we examine the Lagrangian equation, we can see that <math>\,\alpha</math> is multiplied by itself; that is, the Lagrangian is quadratic with respect to <math>\,\alpha</math>. Our constraints are linear. This is therefore a problem that can be solved through [http://en.wikipedia.org/wiki/Quadratic_programming quadratic programming] techniques. We will examine how to do this in Matlab shortly.<br />
<br />
We can write the Lagrangian equation in matrix form:<br />
<br />
<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:such that <math>\underline{\alpha} \geq \underline{0}</math><br />
:and <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Where:<br />
* <math>\underline{\alpha}</math> denotes an <math>\,n \times 1</math> vector; <math>\underline{\alpha}^T = [\alpha_1, ..., \alpha_n]</math><br />
* Matrix <math>S = y_iy_jx_i^Tx_j = (y_ix_i)^T(y_ix_i)</math><br />
* <math>\,\underline{0}</math> and <math>\,\underline{1}</math> are vectors containing all 0s or all 1s respectively<br />
<br />
Using this matrix notation, we can use Matlab's built in quadratic programming routine, [http://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/quadprog.html quadprog].<br />
<br />
===Quadprog example===<br />
<br />
Let's use <code>quadprog</code> to find the solution to <math>\,L(\alpha)</math>.<br />
<br />
Matlab's <code>quadprog</code> function minimizes an equation of the following form:<br />
:<math>\min_x\frac{1}{2}x^THx+f^Tx</math><br />
:such that: <math>\,A \cdot x \leq b</math>, <math>\,Aeq \cdot x = beq</math> and <math>\,lb \leq x \leq ub</math><br />
<br />
We can now see why we kept the <math>\frac{1}{2}</math> constant in the original derivation of the equation.<br />
<br />
The function is called as such: <code>x = quadprog(H,f,A,b,Aeq,beq,lb,ub)</code>. The variables correspond to values in the equation above.<br />
<br />
We can now try to find the solution to <math>\,L(\alpha)</math> (though, it should be noted, that in <math>\,L(\alpha)</math> we subtract the first term rather than add it; I'm not sure if this difference would change how we call <code>quadprog</code>).<br />
<br />
We'll use a simple one-dimensional data set, which is essentially y = -1 or 1 + Gaussian noise. (Note: you could easily put the values straight into the quadprog equation; they are separated for clarity.)<br />
<br />
x = [mvnrnd([-1],[0.01],100); mvnrnd([1],[0.01],100)]';<br />
y = [-ones(100,1); ones(100,1)];<br />
S = (y*x)' * (y*x);<br />
f = ones(200,1);<br />
A = -ones(1,200);<br />
b = 0;<br />
Aeq = y';<br />
beq = 0;<br />
lb = 0;<br />
ub = []; % There is no upper bound<br />
alpha = quadprog(S,f,A,b,Aeq,beq,lb,ub);<br />
<br />
This gives us the optimal <math>\,\alpha</math>... or at least I think it should, but it does not appear to work for me (that is, despite setting the lower boundary of 0, a number of values are still negative. Whether this is just the nature of <code>quadprog</code> or an error on my part is an exercise for the reader).<br />
<br />
===Examining K.K.T. conditions===<br />
<br />
[http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker conditions] ([http://webrum.uni-mannheim.de/mokuhn/public/KarushKuhnTucker.pdf more info]) give us a closer look into the Lagrangian equation and the associated conditions.<br />
<br />
Suppose we are looking to minimize <math>\,f(x)</math> such that <math>\,g_i(x) \geq 0, \forall{x}</math>. If <math>\,f</math> and <math>\,g</math> are differentiable, then the ''necessary'' conditions for <math>\hat{x}</math> to be a local minimum are:<br />
<br />
# At the optimal point, <math>\frac{\partial L}{\partial \hat{x}} = 0</math>; i.e. <math>f'(\hat{x}) - \sum{\alpha_ig'(\hat{x})}=0</math><br />
# <math>\alpha_i \geq 0</math>. (Dual Feasibility) <br />
# <math>\alpha_ig_i(\hat{x}) = 0, \forall{i}</math> (Complementary Slackness)<br />
# <math>g_i(\hat{x}) \geq 0</math> (Primal Feasibility)<br />
<br />
If any of these conditions are violated, then the problem is deemed not feasible.<br />
<br />
These are all trivial except for condition 4. Let's examine it further in our support vector machine problem.<br />
<br />
===Support Vectors===<br />
<br />
Basically, the support vectors are the training points that actually determine the optimal separating hyperplane which we are look for. Also, they are the most difficult points to classify or the most informative for the classification.<br />
<br />
In our case, the <math>g_i(\hat{x})</math> function is:<br />
:<math>\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1</math><br />
<br />
Substituting <math>\,g_i</math> into KKT condition 3, we get <math>\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0</math>. <br\>In order for this condition to be satisfied either <br/><math>\,\alpha_i= 0</math> or <br/><math>\,y_i(\beta^Tx_i+\beta_0)=1</math><br />
<br />
All points <math>\,x_i</math> will be either 1 or greater than 1 distance away from the hyperplane.<br />
<br />
'''Case 1: a point <math>\displaystyle x_i > 1</math> away from the margin'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) > 1 \Rightarrow \alpha_i = 0</math>.<br />
<br />
If point <math>\, x_i</math> is not on the margin, then the corresponding <math>\,\alpha_i=0</math>.<br />
<br />
'''Case 2: a point <math>\displaystyle x_i = 1</math> away from the margin'''<br />
<br />
If <math>\,\alpha_i > 0 \Rightarrow y_i(\beta^Tx_i+\beta_0) = 1</math> <br />
<br\>If point <math>\, x_i</math> is on the margin, then the corresponding <math>\,\alpha_i>0</math>.<br />
<br />
<br />
Points on the margin, with corresponding <math>\,\alpha_i > 0</math>, are called '''''support vectors'''''.<br />
<br />
===Using support vectors===<br />
<br />
Support vectors are important because they allow the support vector machine algorithm to be insensitive to outliers. If <math>\,\alpha = 0</math>, then the cost function is also 0, and won't contribute to the solution of the SVM problem; only points on the margin -- support vectors -- contribute. Hence the model given by SVM in entirely defined by the set of support vectors, a subset of the entire training set. This is interesting because in the NN methods(and can be generalize to classical statistical learning) previous to this the configuration of the network needed to be specified. In this case we have a data-driven or 'nonparametric' model in which is the training set and algorithm will determine the support vectors.<br />
<br />
References:<br />
Wang, L, 2005. Support Vector Machines: Theory and Applications, Springer, 3<br />
<br />
====The support vector machine algorithm====<br />
<br />
# Solve the quadratic programming problem:<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
## Use Matlab's quadprog to find the optimal <math>\,\underline{\alpha}</math><br />
# Find <math>\beta = \sum_{i=1}^n{\alpha_iy_i\underline{x_i}}</math><br />
# Find <math>\,\beta_0</math> by choosing a support vector (a point with <math>\,\alpha_i > 0</math>) and solving <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math><br />
<br />
===Example in Matlab===<br />
<br />
The following code, taken verbatim from the lecture, shows how to use Matlab built-in SVM routines (found in the Bioinformatics toolkit) to do classification through support vector machines.<br />
<br />
load 2_3;<br />
[U,Y] = princomp(X');<br />
data = Y(:,1:2);<br />
l = [-ones(1,200) ones(1,200)];<br />
[train,test] = crossvalind('holdOut',400);<br />
% Gives indices of train and test; so, train is a matrix of 0 or 1, 1 where the point should be used as part of the training set<br />
svmStruct = svmtrain(data(train,:), l(train), 'showPlot', true);<br />
<br />
[[File:Svm1.png|frame|center|The plot produced by training on some of the 2_3 data's first two features.]]<br />
<br />
yh = svmclassify(svmStruct, data(test,:), 'showPlot', true);<br />
<br />
[[File:Svm2.png|frame|center|The plot produced by testing some of the 2_3 data.]]<br />
=='''Non-linear hypersurfaces and Non-Separable classes - November 20, 2009'''==<br />
==='''Kernel Trick'''===<br />
We have talked about the curse of dimensions at the beginning of this course, however, now we turn to the power of high dimensions in order to make find a linearly separable hyperplane between two classes of data points. To understand this, imagine a two dimensional prison where a two dimensional person is constrained. Suppose magically we give the person a third dimension, then he can escape from the prison. In other words, the prison and the person are linearly separable now with respect to the third dimension. The intuition behind the "kernel trick" is basically to map data to a higher dimension so that they are linearly separable by a hyperplane.<br />
<br />
We have seen SVM as a linear classification problem finding the max margin hyperplane in the given input space. However, for many real world problems a more complex decision boundary is required. The following simple method was devised is order to solve the same linear classification problem but in a usually higher dimensional space, 'feature space' under which the max margin hyperplane is better suited.<br />
<br />
Let <math>\,\phi</math> be a mapping,<br />
<br />
<math>\phi:\Re^d \rightarrow \Re^D </math><br /><br /><br />
<br />
We wish to find a <math>\,\phi</math> such that our data will be suited for separation by a hyperplane. Given this function, we are lead to solving the previous constrained quadratic optimization on the transformed dataset,<br /><br /><br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_j\phi(x_i)^T\phi(x_j)}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br /><br /><br />
<br />
The solution to this optimization problem is now well know; however a workable <math>\,\phi</math> must be determined. Possibly the largest drawback in this method is that we must compute the inner product of two vectors in the high dimensional space. As the number of dimensions in the initial data set increases, the inner product becomes computationally intensive or impossible.<br />
<br />
However, we have a very useful result that says that there exists a class of functions, <math>\,\Phi</math>, which satisfy the above requirements and that for any function <math>\,\phi \in \Phi</math>,<br />
<br />
<math>\,\phi(x_i)^T\phi(x_j) = K(x_i,x_j) </math><br /><br /><br />
<br />
Where K is the kernel function in the input space satisfying [http://en.wikipedia.org/wiki/Mercer%27s_condition Mercer's condition] (to guarantee that it indeed corresponds to certain mapping function <math>\,\phi</math>). As a result, if the objective function depends on inner products but not on coordinates, we can always use the kernel function to implicitly calculate in the feature space without storing the huge data. Not only does this solve the computation problems but it no longer requires us to explicitly determine a specific mapping function in order to use this method. In fact, it is now possible to use an infinite dimensional feature space in SVM without even knowing the function <math>\,\phi</math>.<br />
<br />
==='''Mercer's Theorem in detail'''===<br />
Let <math>\,\phi</math> be a mapping to a high dimensional [http://en.wikipedia.org/wiki/Hilbert_space Hibert Space] <math>\,H</math><br /><br />
<br />
<br />
<math>\phi:x \in \Re^d \rightarrow H </math><br /><br /><br />
<br />
The transformed coordinates can be defined as,<br /><br />
<br />
<math>\phi_1(x)\dots\phi_d(x)\dots </math><br /><br /><br />
<br />
By Hilbert - Schmidt theory we can represent an inner product in Hilbert space as,<br /><br /><br />
<br />
<math>\,\phi(x_i)^T\phi(x_j) = \sum_{r=1}^{\infty}a_k\phi_r(x_i)\phi(x_j) \Leftrightarrow K(x_i,x_j), \ a_r \ge 0 </math><br /><br /><br />
where K is symmetric, then Mercer's theorem gives necessary and sufficient conditions on K for it to satisfy the above relation.<br><br><br />
<br />
'''Mercer's Theorem'''<br />
<br />
Let C be a compact subset of <math>\Re^d</math> and K a function <math> \in L^2(C) </math>, if<br /><br /><br />
<br />
<math>\, \int_C\int_C K(u,v)g(u)g(v)dudv \ge 0, \ \forall g \in L^2(C)</math> <br /><br /><br />
<br />
then,<br /><br /><br />
<br />
<math>\sum_{r=1}^{\infty}a_k\phi_r(u)\phi(v)</math> converges absolutely and uniformly to a symmetric function <math>\,K(u,v)</math><br />
<br />
References:<br />
Vapnik, V., 1998. Statistical Learning Theory. John Wiley & Sons, {423}<br />
<br />
==='''Kernel Functions'''===<br />
There are various kernel functions, for example:<br />
<br />
* Linear kernel: <math>\,k(x,y)=x \cdot y</math><br />
* Polynomial kernel: <math>\,k(x,y)=(x \cdot y)^d</math><br />
* Gaussian kernel: <math>e^{-\frac{|x-y|^2}{2\sigma^2}}</math><br />
<br />
If <math>\,X</math> is a <math>\,d \times n</math> matrix in the original space, and <math>\,\phi(X)</math> is a <math>\,D \times n</math> matrix in the [http://en.wikipedia.org/wiki/Hilbert_space Hilbert space] (good explanation video: [http://www.youtube.com/watch?v=V2pBdH7YzX0 part 1] [http://www.youtube.com/watch?v=YRY5xlk3TC0 part 2]), then <math>\,\phi^T(X) \cdot \phi(X)</math> is an <math>\,n \times n</math> matrix. <br />
The inner product is also illustrated as correlation, which measures the similarity between data points. This gives us some insight in how to choose the kernel. The choice depends on certain prior knowledge of the problem and on how we believe the similarity of our data should be measured. In practice, the Gaussian (RBF) kernel usually works best. Besides the most common kernel functions mentioned above, many novel kernels are also suggested for different problem domains like text classification, gene classification and so on.<br />
<br />
These kernel functions can be applied to many algorithms to derive the "kernel version". For example, kernel PCA, kernel LDA, etc..<br />
<br />
==='''SVM: non-separable case'''===<br />
We have seen how SVMs are able to find an optimally separating hyperplane of two separable classes of data, in which case the margin contains no data points. However, in the real world, data of different classes are usually mixed together at the boundary and it's hard to find a perfect boundary to totally separate them. To address this problem, we slack the classification rule to allow data cross the margin. Mathematically the problem becomes,<br />
:<math>\min_{\beta, \beta_0} \frac{1}{2}|\beta|^2</math><br />
:<math>\,y_i(\beta^Tx_i+\beta_0) \geq 1-\xi_i</math><br />
:<math>\xi_i \geq 0</math><br />
<br />
Now each data point can have some error <math>\,\xi_i</math>. However, we only want data to cross the boundary when they have to and make the minimum sacrifice; thus, a penalty term is added correspondingly in the objective function to constrain the number of points that cross the margin. The optimization problem now becomes:<br />
<br />
:<math>\min_{\alpha} \frac{1}{2}|\beta|^2+\gamma\sum_{i=1}^n{\xi_i}</math><br />
:<math>\,s.t.</math> <math>y_i(\beta^Tx+\beta_0) \geq 1-\xi_i</math> <br />
:<math>\xi_i \geq 0</math><br />
<br />
[[File:non-separable.JPG|350px|thumb|right|Figure non-separable case]]<br />
<br />
Note that <math>\,\xi_i</math> is not necessarily smaller than one, which means data can not only enter the margin but can also cross the separating hyperplane.<br />
<br />
References:<br />
<br />
Mercer, J., 1909. Functions of positive and negative type and their connection<br />
with the theory of integral equations. Philos. Trans. Roy. Soc. London, A<br />
209:415{446}<br />
<br />
==Support Vector Machine algorithm for non-separable cases - November 23, 2009==<br />
<br />
With the equation above, we can form the lagrangian, apply KKT conditions, and come up with the equation to optimize. As we will see, the equation that we will attempt to optimize in the SVM algorithm for non-separable data sets is the same as the optimization for the separable case, with slightly different conditions.<br />
<br />
===Forming the Lagrangian===<br />
<br />
Someone please fill in<br />
<br />
===Applying KKT conditions===<br />
<br />
Someone please fill in<br />
<br />
===Putting it all together===<br />
<br />
With our KKT conditions and the Lagrangian equation, we can now use quadratic programming to find <math>\,\alpha</math>.<br />
<br />
In matrix form, we want to solve the following optimization:<br />
:<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:<math>\,s.t.</math> <math>\underline{0} \leq \underline{\alpha} \leq \gamma</math>, <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Solving this gives us <math>\,\underline{\alpha}</math>, which we can use to find <math>\,\underline{\beta}</math> as before:<br />
:<math>\,\underline{\beta} = \sum{\alpha_i y_i \underline{x_i}}</math><br />
<br />
However, we cannot find <math>\,\beta_0</math> in the same way as before, even if we choose a point with <math>\,\alpha_i > 0</math>, because we do not know the value of <math>\,\xi_i</math> in the equation<br />
:<math>\,y_i(\underline{\beta}^Tx_i + \beta_0) - 1 + \xi_i = 0</math><br />
<br />
From our discussion on the KKT conditions, we know that <math>\,\lambda_i \xi_i = 0</math> and <math>\,\gamma = \alpha_i + \lambda_i</math>.<br />
<br />
So, if <math>\,\alpha_i < \gamma</math> then <math>\,\lambda_i > 0</math> and consequently <math>\,\xi_i = 0</math>.<br />
<br />
Therefore, we can solve for <math>\,\beta_0</math> if we choose a point where:<br />
:<math>\,0 < \alpha_i < \gamma</math><br />
<br />
====The SVM algorithm for non-separable data sets====<br />
<br />
The algorithm, then, for non-separable data sets is:<br />
<br />
# Use <code>quadprog</code> (or another quadratic programming technique) to solve the above optimization and find <math>\,\alpha</math><br />
# Find <math>\,\underline{\beta}</math> by solving <math>\,\underline{\beta} = \sum{\alpha_i y_i x_i}</math><br />
# Find <math>\,\beta_0</math> by choosing a point where <math>\,0 < \alpha_i < \gamma</math> and then solving <math>\,y_i(\underline{\beta}^Tx_i + \beta_0) - 1 = 0</math></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=5455stat8412009-11-21T20:06:26Z<p>Ipargaru: /* Maximum Margin Classifiers */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rise of fields such as data-mining, bioinformatics, and machine learning, classification has becomes a fast-developing topic. In the age of information, vast amounts of data are generated constantly, and the goal of classification is to learn from data. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
'''Definition''': The problem of Prediction a discrete random variable <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math> is called Classification.<br />
<br />
In classification,, we attempt to approximate a function <math>\,h</math>, by using a training data set, which will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>d</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data which are identical independent distributions, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, for instance, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability. Intuitively speaking, to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we find <math>\,y\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y|X=x)</math>, and classify <math>\,X</math> into class <math>\,y</math>. In order to calculate the value of <math>\,P(Y=y|X=x)</math>, we use ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posterior probability, <math>\,P(Y=y)</math> as the prior probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two classes, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<sub></sub><br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''3 different approaches to classification''':<br />
<br />
1) Empirical Risk Minimization: Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of <math>\,L(h)</math><br />
<br />
2) Regression: Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3) Density Estimation: estimate <math>\,P(X=x|Y=0)</math> and <math>\,P(X=x|Y=1)</math> (less popular in high-dimension cases)<br />
<br />
<br />
<br />
'''Bayes Classification Rule Optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when<math></math> the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
<br />
:<math>\, h^*(X)= \left\{\begin{matrix} <br />
1 & if P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
Remark:<br />
<br />
1)Bayes classification rule is optimal. Proof:[http://www.ee.columbia.edu/~vittorio/BayesProof.pdf]<br />
<br />
2)We still need any other method, since we cannot define prior probability in realistic.<br />
<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, since it is generally impossible for us to know the prior <math>\,P(Y=1)</math>, and class conditional density <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], tree-augmented naive Bayes (TAN), Bayesian network augmented naive Bayes (BAN) and general Bayesian network (GBN).<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as having objective existence. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayesian and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayesian'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In Bayesian method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
'''Multi-class Classification''':<br />
<br />
Y takes on more than two values.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
''Theorem'': Suppose that <math>\,Y \in \mathcal{Y}= \{1,\dots,k\}</math>, the optimal rule is :<math>\,h^*(X) = \arg\max_{k}{P(Y = k|X = x)}</math><br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
1 Empirical Risk Minimization:Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
<br />
2 Regression:Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3 Density estimation, estimate <math>P(X = x|Y = 0)</math> and <math>P(X = x|Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
It is quadratic because there is no boundaries.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on [http://academicearth.org/lectures/advice-for-applying-machine-learning LDA and QDA] so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
LDA is for classification and FDA is used for feature extraction.<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
<br />
=== logistic function ===<br />
A logistic function or logistic curve is the most common sigmoid curve. <br />
<br />
:<math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
1. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
2. <math>y(0) = \frac{1}{2}</math><br />
<br />
3. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
4. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
5. The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood, using Pr(Y|X). The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Seeing these equations as a weighted least squares problem makes them easier to derivate.<br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== Perceptron (Foundation of Neural Network) ===<br />
<br />
==== Separating Hyperplane Classifiers ====<br />
Separating hyperplane trys to separate the data using linear decision boundaries. When the classes overlap, it can be generalized to support vector machine, which constructs nonlinear boundaries by constructing a linear boundary in an enlarged and transformed feature space.<br />
<br />
==== Perceptron ====<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
Least Squares returns the sign of a linear combination of data points as the class label<br />
<br />
sign(<math>(\underline{\beta}^T \underline{x} + {\beta}_0)) = sign(\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2})</math><br />
<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br />
<br />
<br />
If we find a hyperplane that is not unique between 2 classes, there will be infinitely many solutions obtained from the perceptron algorithm.<br />
<br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
A Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math>, where <math>\,I</math> indicates the sign of the expression and returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math> (initial guess). Its goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach], which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
====Problems with the Algorithm and Issues Affecting Convergence====<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. This problem can be eliminated by using basis expansions technique. To be specific, we try to find a hyperplane not in the original space, but in the enlarged space obtained by using some basis functions.<br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#A perfect separation is not always available even desirable. If observations comes from different classes sharing the same imput, the classification model seems to be overfitting and will generally have poor predictive performance.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
<br />
</ref>.<br />
====Comment on gradient descent algorithm====<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction in which the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and you initially stand in the middle, then you will finally arrive at the saddle point (local minimum) and get stuck there.<br />
<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over <math>\,i</math> (all data points). Actually, this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent) known as Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>\,{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
*A Perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection with branches that "fan out" onto as many connections as desired, each carrying the same signal - the processing element output signal. <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref> It is a multistage regression or classification model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
A regression problem typically has only one unit in the output layer. In a k-class classification problem, there are usually k units in the output layer that each represent the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1).<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classification by NN. <br />
<br />
In perceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] and is not continuous at 0. Thus, we replace it by a smooth function <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it the '''activation function'''.<br />
<br>The choice of this function <math>\displaystyle \sigma </math> is determined by the properties of the data and the assumed distribution of target variables, but for multiple binary classification problems the logistic function, also known as inverse-logit, is often used: <br />
<math>\sigma(a)=\frac {1}{1+e^{-a}}</math><br />
<br />
There are some important properties for the activation function.<br />
<br />
# Activation function is nonlinear. It can be shown that if the activation function of the hidden units is linear, a three-layer neural network is equivalent to a two layer one. <br />
# Activation function saturate, which means there are maximum and minimum output value. This property ensures that the weights are bounded and therefore the searching time is limited. <br />
# Activation function is continuous and smooth.<br />
# Activation function is monotonic. This property is not necessary, since we know that RBF networks is also a kind of power model. <br />
<br />
'''Note:''' A key difference between a perceptron and a neural network is that a neural network uses continuous nonlinearities in the units, for the purpose of differentiation, whereas the perceptron often uses a non-differentiable activation function. The neural network function is differentiable with respect to the network parameters so that a gradient descent method can be used in training. Moreover, a perceptron is a linear classifier, whereas a neural network, by combining layers of perceptrons, is able to classify non-linear problems through proper training.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes into the perceptron, to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural network is called [http://en.wikipedia.org/wiki/Feedforward_neural_network Feed-Forward Neural Network]. Applications to Feed-Forward Neural Networks include data reduction, speech recognition, sensor signal processing, and ECG abnormality detection, to name a few. <ref>J. Annema, Feed-Forward Neural Networks, (Springer 1995), pp. 9 </ref><br />
<br />
===Back-propagation===<br />
For a while, the Neural Network model was just an idea, since there were no algorithms for training the model until 1986, when Geoffrey Hinton <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> devised an algorithm called '''back-propagation''' [http://en.wikipedia.org/wiki/Backpropagation#Algorithm]. After that, a number of other training algorithms and various configurations of neural networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied a gradient descent algorithm for optimizing weights. Back-propagation uses this idea of gradient descent to train a neural network based on the chain rule in calculus. <br />
<br />
Assume that the last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
For simplicity, there is only 1 unit at the end and assume for the moment we are doing regression.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each unit we have a function <math>\displaystyle z_i=\sigma(a_i)</math> that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classification and the resulting classification output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
''i j are reversed!''<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\delta_j \cdot z_l</math><br />
<br />
Note that a change in <math>\,a_j</math> causes changes in all <math>\,a_i</math> in the next layer on which the error is based, so we need to sum over i in the chain:<br />
<math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} =\sum_i \delta_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\,\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math> Using the activation function <math>\,\sigma(\cdot)</math><br />
<br />
So <math>\delta_j = \sum_i \delta_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \delta_i \cdot u_{ij}</math><br />
<br />
We can propagate the error calculated in the output back through the previous layers and adjust weights to minimize error.<br />
<br />
==Neural Networks (NN) - October 30, 2009 ==<br />
<br />
=== Back-propagation ===<br />
The idea is that we first feed an input from the training set to the Neural Network, then find the error rate at the output and then we propagate the error to previous layers and for each edge of weight <math>\,u_{ij}</math> we find <math>\frac{\partial \mathrm{err}}{\partial u_{ij}}</math>. Having the error rates at hand we adjust the weight of each edge by taking steps proportional to the negative of the gradient to decrease the error at output. The next step is to apply the next input from the training set and go through the described adjustment procedure.<br />
The overview of Back-propagation algorithm:<br />
#Feed a point <math>\,x</math> in the training set to the network, and find the output of all the nodes.<br />
#Evaluate <math>\,\delta_k=y_k-\hat{y_k}</math> for all output units, where <math>y_k</math> is the expected output and <math>\hat{y_k}</math> is the real output.<br />
#By propagating to the previous layers evaluate all <math>\,\delta_j</math>s for hidden units: <math>\,\delta_j=\sigma'(a_j)\sum_i \delta_i u_{ij}</math> where <math>i</math> is associated to the previous layer.<br />
#Using <math>\frac{\partial \mathrm{err}}{\partial u_{jl}} = \delta_j\cdot z_l</math> find all the derivatives.<br />
#Adjust each weight by taking steps proportional to the negative of the gradient: <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} -\rho \frac{\partial \mathrm{err}}{u_{ij}}</math><br />
#Feed the next point in the training set and repeat the above steps.<br />
==== How to initialize the weights ====<br />
This still leaves the question of how to initialize the weights <math>\,u_{ij}, w_i</math>. The method of choosing weights mentioned in class was to randomize the weights before the first step. This is not likely to be near the optimal solution in every case, but is simple to implement. To be more specific, random values near zero will be a good choice for the initial weights(usually from [-1,1]). In this case, the model evolves from a nearly linear one to a nonlinear one as we desired. An alternative is to use an orthogonal least squares method to find the initial weights <ref>http://www.mitpressjournals.org/doi/abs/10.1162/neco.1995.7.5.982</ref>. Regression is performed on the weights and output by using a linear approximation of <math>\,\sigma(a_i)</math>, and finds optimal weights in the linear model. Back propagation is used afterward to find the optimal solution, since the NN is non-linear.<br />
<br />
==== How to set learning rates ====<br />
The learning rate <math>\,\rho</math> is usually a constant. <br />
<br />
If we use On-line learning, as a form of stochastic approximation process, <math>\,\rho</math> should decrease as the iteration increase.<br />
<br />
<br />
Choosing too large learning rate may cause the unstability of the system, while the too small learning rate may lead to a very slow convergence rate(very long time in learning phase). However, the advantage of small learning rate is that it can guarantee the convergence. Thus, generally, it is better to choose a relatively small learning rate to ensure the stability. Usually, choose <math>\,\rho</math> between 0.01 and 0.7<br />
<br />
==== How to determine the number of hidden units ====<br />
<br />
Here we will mainly discuss how to estimate the number of hidden units at very beginning. Obviously, we should adjust it to be more precise using CV, LOO or other complexity control methods. <br />
<br />
Basically, if the patterns are well separated, few hidden units are fairly enough. If the patterns are drawn from some highly complicated mixture models, more hidden units are really needed. <br />
<br />
Actually, the number of hidden units determines the size of the model, and therefore the total number of the weights in the model. Typically speaking, the number of weights should not be larger than the number of training data, say N. Thus, sometimes, N/10 is a good choice. However, in pratice, many well performed models will use more hidden units.<br />
<br />
=== Dimensionality reduction application ===<br />
[[File:NN-bottelneck.png|350px|thumb|right|Figure 1: Bottleneck configuration for applying dimensionality reduction.]]<br />
One possible application of Neural Networks is to perform dimensionality reduction, like other techniques, e.g., PCA, MDS, LLE and Isomap.<br />
<br />
Consider the following configuration as shown in figure 1:<br />
As we go forward in layers of this Neural Network, the number of nodes is reduced, until we reach a layer with the number of nodes representing the desired dimensionality. However, note that at the very first few layers the number of nodes may not be strictly decreasing, as long as finally it can reach a layer with less nodes. From now on in the Neural Network <br />
the previous layers are mirrored. So at the output layer we have the same number of states as we have in the input layer. Now note that if we feed the network with each point and get an output approximately equal to the fed input, that means at the output the same input is reconstructed from the middle layer units. So the output of the middle layer units can represent the input with less dimensions.<br />
<br />
To train this Neural Network, we feed the network with a training point and through back propagation we adjust the network weights based on the error between the input layer and the reconstruction at the output layer. Our low dimensional mapping will be the observed output from the middle layer. Data reconstruction consists of putting the low dimensional data through the second half of the network.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
== Complexity Control October 30, 2009 ==<br />
<br />
[[File:overfitting-model.png|500px|thumb|right|Figure 2. The overfitting model passes through all the points of the training set, but has poor predictive power for new points.<br />
In exchange the line model has some error on the training points but has extracted the main characteristic of the training points, and has good predictive power.]]<br />
There are [http://academicearth.org/lectures/underfitting-and-overfitting two issues] that we have to avoid in Machine Learning:<br />
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]<br />
#Underfitting<br />
<br />
Overfitting occurs when our model is heavily complex with so many degrees of freedom, that we can learn every detail of the training set. Such a model will have very high precision on the training set but will show very poor ability to predict outcomes of new instances, especially outside the domain of the training set.<br />
<br />
In a Neural Network if the depth is too much, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will show a very precise outcome of the training set but will not be able to generalize the commonality of the training set to predict the outcome of new cases.<br />
<br />
Underfitting occurs when the model we picked to describe the data is not complex enough, and has high error rate on the training set.<br />
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.<br />
<br />
'''Example'''<br />
#Consider the example showed in the figure. We have a training set and we want to find a model which fits it the best. We can find a polynomial of high degree which almost passes through all the points in the training set. But, in fact the training set is coming from a line model. Now the problem is although the complex model has less error on the training set it diverges from the line in other ranges which we have no training point. Because of that the high degree polynomail has very poor predictive result on test cases. This is an example of overfitting model.<br />
#Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem.<br />
#Consider a simple classification example. If our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is a yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features typical of bananas, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie., we have overfit the data. This occurs when we have considered so many features that we have perfectly described the existing bananas, but if presented with a new banana of slightly different shape than the existing bananas, for example, it cannot be detected. This is the tradeoff; what is the right level of complexity?<br />
<br />
== Complexity Control - Nov 2, 2009 ==<br />
<br />
Overfitting occurs when the model becomes too complex and underfitting occurs when it is not complex enough, both of which are not desirable. To control complexity, it is necessary to make assumptions for the model before fitting the data. Assumptions that we can make for a model are with polynomials or a neural network. There are other ways as well.<br />
<br />
[[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 1: An example of a model with a family of polynomials]]<br />
We do not want a model to get too complex, so we control it by making an assumption on the model. With complexity control, we want a model or a classifier with a low error rate.<br />
<br />
=== '''How do we choose a good classifier?''' ===<br />
<br />
Our goal is to find a classifier that minimizes the true error rate. <br />
Recall the empirical error rate<br />
<br />
<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math><br />
<br />
<math>\,h</math> is a classifier and we want to minimize the error rate. So we apply <math>\displaystyle x_1</math> to <math>\displaystyle x_n</math> to <math>\displaystyle h</math>, and take the average to get the empirical true error rate estimation of probability that <br />
<math>h(x_{i}) \neq y_{i}</math>.<br />
<br />
<span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 2]]</span><br />
There is a downward bias to this estimate meaning that it is always less than the true error rate. <br />
<br />
If there is a change in our complexity from low to high, our error rate is always decreasing. When we apply our model to the test data, our error rate will start to decrease to a point, but then it will increase since the model hasn't seen it before. This can be explained since training error will decrease when we fit the model better by increasing its complexity, but as we have seen, this complex model will not generalize well, resulting in a larger test error. <br />
<br />
We use our test data (from the test sample line shown on Figure 2) to get our empirical error rate.<br />
Right complexity is defined as where error rate of the test data is minimum; and this is one idea behind complexity control.<br />
<br />
<br />
<br />
[[File:Bias.jpg|200px|thumb|left|Figure 3]]<br />
<br />
We assume that we have samples <math>\,X_1, . . . ,X_n</math> that follow some (possibly unknown) distribution. We want to estimate a parameter <math>\,f</math> of the unknown distribution. This parameter may be the mean <math>\,E(X_i)</math>, the variance <math>\,var(X_i)</math> or some other quantity.<br />
<br />
The unknown parameter <math>\,f</math> is a fixed real number <math>f\in R</math>. To estimate it, we use an estimator which is a<br />
function of our observations, <math>\hat{f}(X_1,...,X_n)</math>. <br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f</math><br />
<br />
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]</math><br />
<br />
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math><br />
<br />
One property we desire of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>.<br />
However, there is a more important property for an estimator than just being unbiased: the mean squared error. In statistics, there are problems for which it may be good to use an estimator with a small bias. In some cases, an estimator with a small bias may have lesser mean squared error or be median-unbiased (rather than mean-unbiased, the standard unbiasedness property). The property of median-unbiasedness is invariant under transformations while the property of mean-unbiasedness may be lost under nonlinear transformations. For example, while using an unbiased estimator with large mean square error to estimate the parameter, we highly risk a big error. In contrast, a biased estimator with small mean square error will well improve the precision of our prediction.<br />
<br />
Hence, our goal is to minimize <math>MSE (\hat{f})</math>.<br />
<br />
From figure 3, we can see that the relationship of the three parameters is:<br />
<math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.<br />
<br />
A Test error is a good estimation on MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias.<br />
<br />
<br />
Referring to Figure 2, overfitting happens after the point where training data (training sample line) starts to decrease and test data (test sample line) starts to increase. There are 2 main approaches to avoid overfitting:<br />
<br />
1. Estimating error rate<br />
<br />
<math>\hookrightarrow</math> Empirical training error is not a good estimation<br />
<br />
<math>\hookrightarrow</math> Empirical test error is a better estimation<br />
<br />
<math>\hookrightarrow</math> Cross-Validation is fast<br />
<br />
<math>\hookrightarrow</math> Computing error bound (analytically) using some probability inequality.<br />
<br />
We will not discuss computing the error bound in class; however, a popular method for doing this computation is called VC Dimension (short for Vapnik–Chervonenkis Dimension). Information can be found from [http://www.autonlab.org/tutorials/vcdim.html Andrew Moore] and [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.7171&rep=rep1&type=pdf Steve Gunn].<br />
<br />
2. Regularization<br />
<br />
<math>\hookrightarrow</math> Use of shrinkage method<br />
<br />
<math>\hookrightarrow</math> Decrease the chance of overfitting by controlling the weights<br />
<br />
=== '''Example of under and overfitting in R''' ===<br />
<br />
To give further intuition of over and underfitting, consider this example. A simple quadratic data set with some random noise is generated, and then polynomials of varying degrees are fitted. The errors for the training set and a test set are calculated.<br />
[[File:Curvefitting-rex2.png|250px|thumb|right|Polynomial fits to curved data set.]]<br />
<br />
>> x <- rnorm(200,0,1)<br />
>> y <- x^2-0.5*x+rnorm(200,0,0.3)<br />
>> xtest <- rnorm(50,1,1)<br />
>> ytest <- xtest^2-0.5*xtest+rnorm(50,0,0.3)<br />
>> p1 <- lm(y~x)<br />
>> p2 <- lm(y ~ poly(x,2))<br />
>> pn <- lm(y ~ poly(x,10))<br />
>> psi <- lm(y~I(sin(x))+I(cos(x)))<br />
<br />
: <code>x</code> values for the training set are based on a <math>\,N(0,1)</math> distribution, while the test set has a <math>\,N(1,1)</math> distribution. <code>y</code> values are determined by <math>\,y = x^2 + x + N(0,0.3)</math>, a quadratic function with some random variation. Polynomial least square fits of degree 1, 2, and 10 are calculated, as well as a fit of <math>\,sin(x)+cos(x)</math>.<br />
<br />
>> > # calculate the mean squared error of degree 1 poly<br />
>> > sum((y-predict(p1,data.frame(x)))^2)/length(y)<br />
>> [1] 1.576042<br />
>> > sum((ytest-predict(p1,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 7.727615<br />
: Training and test mean squared errors for the linear fit. These are both quite high - and since the data is non-linear, the different mean value of the test data increases the error quite a bit.<br />
>> > # calculate the mean squared error of degree 2 poly<br />
>> > sum((y-predict(p2,data.frame(x)))^2)/length(y)<br />
>> [1] 0.08608467<br />
>> > sum((ytest-predict(p2,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 0.08407432<br />
: This fit is far better - and there is not much difference between the training and test error, either.<br />
>> > # calculate the mean squared error of degree 10 poly<br />
>> > sum((y-predict(pn,data.frame(x)))^2)/length(y)<br />
>> [1] 0.07967558<br />
>> > sum((ytest-predict(pn,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 156.7139<br />
: With a high-degree polynomial, the training error continues to decrease, but not by much - and the test set error has risen again. The overfitting makes it a poor predictor. As the degree of the polynomial rises further, the accuracy of the computer becomes an issue - and a good fit is not even consistently produced for the training data.<br />
>> > # calculate mse of sin/cos fit<br />
>> > sum((y-predict(psi,data.frame(x)))^2)/length(y)<br />
>> [1] 0.1105446<br />
>> > sum((ytest-predict(psi,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 1.320404<br />
: Fitting a function of the form sin(x)+cos(x) works pretty well on the training set, but because it is not the real underlying function, it fails on test data which doesn't lie on the same domain.<br />
<br />
== ''' Cross-Validation (CV) - Introduction ''' ==<br />
<br />
[[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]]<br />
Cross-Validation is used to estimate the error rate of a classifier with respect to test data rather than data used in the model. Here is a general introduction to CV:<br />
<br />
<math>\hookrightarrow</math> We have a set of collected data for which we know the proper labels<br />
<br />
<math>\hookrightarrow</math> We divide it into 2 parts, Training data (T) and Validation data (V)<br />
<br />
<math>\hookrightarrow</math> For our calculation, we pretend that we do not know the label of V and we use data in T to train the classifier<br />
<br />
<math>\hookrightarrow</math> We estimate an empirical error rate on V since the model hasn't seen V yet and we know the proper label of all elements in V to know how many were misclassified.<br />
<br />
CV has different implementations which can reduce the variance of the calculated error rate, but sometimes with a tradeoff of a higher calculation time.<br />
<br />
== ''' Complexity Control - Nov 4, 2009''' ==<br />
<br />
== Cross-validation ==<br />
[[File:Cross-validation.png|350px|thumb|right|Figure 1: Classical/Standard cross-validation]]<br />
Cross-validation is the simplest and most widely used method to estimate the true error. It comes from the observation that although training error always decreases with the increasing complexity of the model, the test error starts to increase from a certain point, which is noted as overfitting (see [[#prediction-error|figure 2]] above). Since test error estimates MSE (mean square error) best, people came up with the idea to divide the data set into three parts: training set, validation set, and test set. training set is used to build the model, validation set is used to deside the parameters and the optimal model, and the test set is used to estimate the performance of the chosen model. A classical division is 50% for training set, and 25% each for validation set and test set. All of them are randomly selected from the original data set. <br />
<br />
Then, we only use the part of our data marked as the "training set" to train our algorithm, while keeping the remaining marked as the "validation set" untouched. As a result, the validation set will be totally unknown to the trained model. The error rate is then estimated by:<br />
<br />
<math>\hat L(h) = \frac{1}{|\nu|}\sum_{X_i \in \nu}(h(x_i) \neq y_i)</math>, where <math>\,|\nu|</math> is the cardinality of the validation set.<br />
<br />
When we change the complexity, the error generated by the validation set will have the same behavior as the test set, so we are able to choose the best parameters to get the lowest error.<br />
<br />
<br />
=== K-fold Cross-validation ===<br />
[[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]]<br />
Above is the simplest form of complexity control. However, in reality, it may be hard to collect data ??and we usually suffer from the curse of dimensionality??, and a larger data set may be hard to come by. Consequently, we may not be able to afford to sacrifice part of the limited resources. In this case we use another method that addresses this problem, K-fold cross-validation. We divide the data set into <math>\,K</math> subsets roughly equal in size. The usual choice is <math>\,K = 10</math>.<br />
<br />
Generally, how to choose <math>\,K</math>:<br />
<br />
if <math>\,K=n</math>, leave one out, low bias, high variance. Each subset contains a single element, so the model is trained with all except one point, and then validated using that point.<br />
<br />
if <math>\,K=2</math>, say 2-fold, 5-fold, high bias, low variance. Each subset contains approximately <math>\,\frac{1}{2}</math> or <math>\,\frac{1}{5}</math> of the data.<br />
<br />
For every <math>\,k</math>th <math>( \,k \in [ 1, K ] )</math> part, we use the other <math>\,K-1</math> parts to fit the model and test on the <math>\,k</math>th part to estimate the prediction error <math>\hat L_k</math>, where<br />
<br />
<math>\hat L(h) = \frac{1}{K}\sum_{k=1}^K\hat L_k</math><br />
<br />
For example, suppose we want to fit a polynomial model to the data set and split the set into four equal subsets as shown in Figure 2. First we choose the degree to be 1, i.e. a linear model. Next we use the first three sets as training sets and the last as validation set, then the 1st, 2nd, 4th subsets as training set and the 3rd as validation set, so on and so forth until all the subsets have been the validation set once (all observations are used for both training and validation). After we get <math>\hat L_1, \hat L_2, \hat L_3, \hat L_4</math>, we can calculate the average <math>\hat L</math> for degree 1 model. Similarly, we can estimate the error for n degree model and generate a simulating curve. Now we are able to choose the right degree which corresponds to the minimum error. Also, we can use this method to find the optimal unit number of hidden layers of neural networks. We can begin with 1 unit number, then 2, 3 and so on and so forth. Then find the unit number of hidden layers with lowest average error.<br />
<br />
=== Generalized Cross-validation ===<br />
If the vector of observed values is denoted by <math>\mathbf{y}</math>, and the vector of fitted values by <math>\hat\mathbf{y}</math>.<br />
<br />
<math>\mathbf{y} = \mathbf{H}\hat\mathbf{y}</math>, <br />
<br />
where the hat matrix is given by<br />
<br />
<math>\mathbf{H} = \mathbf{X}( \mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X}^{T}</math>,<br />
<br />
<math> \frac{1}{N}\sum_{i=1}^{N}[y_{i} - \hat f^{-i}(\mathbf{x}_{i})]^{2}=\frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-\mathbf{H}_{ii}}]^{2}</math>,<br />
<br />
Then the GCV approximation is given by<br />
<br />
<math> GCV(\hat f) = \frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-trace(\mathbf{H})/N}]^{2}</math>,<br />
<br />
Thus, one of the biggest advantages of the GCV is that the trace is more easily computed.<br />
<br />
=== Leave-one-out Cross-validation ===<br />
Leave-one-out cross-validation involves using all but one data point in the original training data set to train our model, then using the data point that we initially left out as a means to estimate true error. But repeating this process for every data point in our original data set, we can obtain a good estimation of true error.<br />
<br />
In other words, leave-one-out cross-validation is k-fold cross-validation in which we set the subset number <math>\,K</math> to be the cardinality of the whole data set.<br />
<br />
In the above example, we can see that k-fold cross-validation can be computationally expensive: for every possible value of the parameter, we must train the model <math>\,K</math> times. This deficiency is even more obvious in leave-one-out cross-validation, where we must train the model <math>\,n</math> times, where <math>\,n</math> is the number of data point in the data set.<br />
<br />
Fortunately, when adding data points to the classifier is reversible, calculating the difference between two classifiers is computationally more efficient than calculating the two classifiers separately. So, if the classifier on all the data points is known, we simply undo the changes from a data point <math>\,K</math> times to calculate the leave-one-out cross-validation error rate.<br />
<br />
== Regularization for Neural Network — Weight Decay ==<br />
[[File:figure 2.png|350px|thumb|right|Figure 1: activation function]]<br />
Weight decay training is suggested as an implementation for achieving a robust neural network which is insensitive to noise. Since the number of hidden layers in NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting.<br />
<br />
It can be seen from Figure 1 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, we can avoid overfitting by constraining the weights to be small. This gives us a hint to initialize the random weights to be close to zero.<br />
<br />
Formally, we penalize nonlinear weights by adding a penalty term in the error function. Now the regularized error function becomes:<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}|w_i|^2 + \sum_{jk}|u_{jk}|^2)</math>, where <math>\,err</math> is the original error in back-propagation; <math>\,w_i</math> is the weights of the output layer; <math>\,u_{jk}</math> is the weights of the hidden layers.<br />
<br />
Usually, too large <math>\,\lambda</math> will make the weights <math>\,w_i</math> and <math>\,u_{jk}</math> too small. We can use cross validation to estimate <math>\,\lambda</math>.<br />
<br />
A similar penalty, weight elimination, is given by,<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}\frac{|w_i|^2}{1 + |w_i|^2} + \sum_{jk}\frac{|u_{jk}|^2}{1+|u_{jk}|^2})</math>.<br />
<br />
As in back-propagation, we take partial derivative with respect to the weights:<br />
<br />
<math>\frac{\partial REG}{\partial w_i} = \frac{\partial err}{\partial w_i} + 2\lambda w_i</math><br />
<br />
<math>\frac{\partial REG}{\partial u_{jk}} = \frac{\partial err}{\partial u_{jk}} + 2\lambda u_{jk}</math><br />
<br />
<math>w^{new} \leftarrow w^{old} - \rho\left(\frac{\partial err}{\partial w} + 2\lambda w\right)</math><br />
<br />
<math>u^{new} \leftarrow u^{old} - \rho\left(\frac{\partial err}{\partial u} + 2\lambda u\right)</math><br />
<br />
Note that here <math>\,\lambda</math> serves as a trade-off parameter, tuning between the error rate and the linearity. Actually, we may also set <math>\,\lambda</math> by cross-validation. The tuning parameter is important since weights of zero will lead to zero derivatives and the algorithm will not change. On the other hand, starting with weights that are too large means starting with a nonlinear model which can often lead to poor solutions. <ref>Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning (Springer 2009) pp.398</ref><br />
<br />
== Radial Basis Function (RBF) Networks - November 6, 2009 ==<br />
<br />
[[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]]<br />
<br />
=== Introduction === <br />
<br />
A Radial Basis Function (RBF) network [http://en.wikipedia.org/wiki/Radial_basis_function_network] is a type of artificial neural network with an output layer and a single hidden layer, with weights from the hidden layer to the output layer, and can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. One choice that has been widely used is that of radial basis functions, which have the property that each basis function depends only on the radial distance (typically Euclidean) from a center <math>\displaystyle\mu_{j}</math>, so that <math>\phi_{j}(x)= h({\Vert x - \mu_{j}\Vert})</math>.<br />
<br />
<br />
The output of an RBF network can be expressed as a weighted sum of its radial basis functions as follows:<br />
<br />
<math>\hat y_{k} = \sum_{j=1}^M\phi_{j}(x) w_{jk}</math><br />
<br />
The radial basis function is: <br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br /><br />
(Gaussian without a normalization constant)<br /><br /><br />
'''note:'''The hidden layer has a variable number of neurons (the optimal number is determined by the training process). As usual the more neurons in the hidden layer, the higher the model complexity. Each neuron consists of a radial basis function centered on a point with the same dimensions as the input data. The radii of the RBF functions may be different. The centers and radii can be determined through clustering or an EM algorithm. When the x vector is given from the input layer, the hidden neuron computes the radial distance from the neuron’s center point and then applies RBF function to this distance. The resulting value is passed to the the output layer and weighed together to form the output. <br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat{Y}_{n,k} = \left[ \begin{matrix}<br />
\hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\<br />
\hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\<br />
\vdots &\vdots & \ddots & \vdots \\<br />
\hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k}<br />
\end{matrix}\right] </math> is the matrix of output variables. <br />
<br />
:<math>\Phi_{n,m} = \left[ \begin{matrix}<br />
\phi_{1,1} & \phi_{1,2} & \cdots & \phi_{1,m} \\<br />
\phi_{2,1} & \phi_{2,2} & \cdots & \phi_{2,m} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
\phi_{n,1} & \phi_{n,2} & \cdots & \phi_{n,m}<br />
\end{matrix}\right] </math> is the matrix of Radial Basis Functions.<br />
<br />
:<math>W_{m,k} = \left[ \begin{matrix}<br />
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\<br />
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
w_{m,1} & w_{m,2} & \cdots & w_{m,k}<br />
\end{matrix}\right] </math> is the matrix of weights.<br />
<br />
Here, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>k = 1</math>, <math>\hat Y</math> and <math>W</math> are column vectors.<br />
<br />
''related reading'':<br />
<br />
Introduction of the Radial Basis Function (RBF) Networks [http://axiom.anu.edu.au/~daa/courses/GSAC6017/rbf.pdf]<br />
<br />
Radial Basis Function (RBF) Networks [http://documents.wolfram.com/applications/neuralnetworks/index6.html] [http://lcn.epfl.ch/tutorial/english/rbf/html/index.html]<br />
<br />
=== Estimation of weight matrix W ===<br />
<br />
We minimize the training error, <math>\Vert Y - \hat{Y}\Vert^2</math> in order to find <math>\,W</math>.<br /><br /><br />
From a previous result in linear algebra we know that <br />
<br />
<math>\Vert A \Vert^2 = Tr(A^{T}A)</math><br />
<br />
Thus we have a problem similar to linear regression:<br />
<br />
<math>\ err = \Vert Y - \Phi W\Vert^{2} = Tr[(Y - \Phi W)^{T}(Y - \Phi W)]</math><br />
<br />
<math>\ err = Tr[Y^{T}Y - Y^{T}\Phi W - W^{T} \Phi^{T} Y + W^{T}\Phi^{T} \Phi W]</math><br />
<br />
<br />
==== Useful properties of matrix differentiation ====<br />
<br />
<br />
<math>\frac{\partial Tr(AX)}{\partial X} = A^{T}</math><br />
<br />
<math>\frac{\partial Tr(X^{T}A)}{\partial X} = A</math><br />
<br />
<math>\frac{\partial Tr(X^{T}AX)}{\partial X} = (A^{T} + A)X</math><br />
<br />
==== Solving for W ====<br />
<br />
We find the minimum over <math>\,W</math> by setting <math>\frac{\partial err}{\partial W}</math> equal to zero and using the aforementioned properties of matrix differentiation.<br />
<br />
<math>\frac{\partial err}{\partial W} = 0</math><br />
<br />
<math>\ 0 - \Phi^{T}Y - \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ -2 \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math><br />
<br />
<math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY</math><br />
<br />
where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}</math><br />
<br />
<math>\,H</math> is the hat matrix for this model. This gives us a nice results since the solution has a closed form and we do not have to worry about convexity problems in this case.<br />
<br />
=== Including an additional bias ===<br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat Y = \left[ \begin{matrix}<br />
y_{11} & y_{12} & \cdots & y_{1k} \\<br />
y_{21} & y_{22} & \cdots & y_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
y_{n1} & y_{n2} & \cdots & y_{nk}<br />
\end{matrix}\right] </math> is the matrix(n by k) of output variables.<br />
<br />
:<math>\Phi = \left[ \begin{matrix}<br />
\phi_{10} &\phi_{11} & \phi_{12} & \cdots & \phi_{1M} \\<br />
\phi_{20} & \phi_{21} & \phi_{22} & \cdots & \phi_{2M} \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{n0} &\phi_{n1} & \phi_{n2} & \cdots & \phi_{nM}<br />
\end{matrix}\right] </math> is the matrix(n by M+1) of Radial Basis Functions.<br />
<br />
:<math>W = \left[ \begin{matrix}<br />
w_{01} & w_{02} & \cdots & w_{0k} \\<br />
w_{11} & w_{12} & \cdots & w_{1k} \\<br />
w_{21} & w_{22} & \cdots & w_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
w_{M1} & w_{M2} & \cdots & w_{Mk}<br />
\end{matrix}\right] </math> is the matrix(M+1 by k) of weights.<br />
<br />
where the extra basis function <math>\Phi_{0}</math> is set to 1.<br />
<br />
==== Normalized RBF ====<br />
<br />
In addition to the above unnormalized architecture, the normalized RBF can be represented as:<br />
<br />
<math>\hat{y}_{k}(X) = \frac{\sum_{j=1}^{M} w_{jk}\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math><br /><br /><br />
<br />
<br />
Actually, <math>\Phi^{\ast}_{j}(X) = \frac{\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math> is known as a normalized radial basis function. Giving the familiar form,<br /><br />
<br />
<math>\hat{y}_{k}(X) = \sum_{j=1}^{M} w_{jk}\Phi^{\ast}_{j}(X)</math><br /><br /><br />
<br />
=== Conceptualizing RBF networks ===<br />
<br />
In the past, we have classified data using models that were explicitly linear, quadratic, or otherwise definite. In RBF networks, like in Neural Networks, we can fit an arbitrary model. How can we do this without changing the equations being used?<br />
<br />
Recall a [[#Trick:_Using_LDA_to_do_QDA_-_October_7.2C_2009|trick]] that was discussed in the October 7 lecture: if we add new features to our original data set, we can project into higher dimensions, use a linear algorithm, and get a quadratic result by collapsing to a lower dimension afterward. In RBF networks, something similar can happen.<br />
<br />
Think of <math>\,\Phi</math>, our matrix of radial basis functions, as a feature space of the input. Each hidden unit, then, can be thought to represent a feature; we can see that, if there are more hidden units than input units, we can essentially project to a higher-dimensional space, as we did in our earlier trick. However, this does not mean that an RBF network will actually do this, it is merely a way to convince yourself that RBF networks (and neural networks) can fit arbitrary models. Nevertheless, it is also noticed that just because of such great power, the problem of overfitting appears more important. We have to control its complexity so that it does not fit to an arbitrary training model but to a general one.<br />
<br />
=== RBF networks for classification -- a probabilistic paradigm ===<br />
<br />
[[File:Rbf_graphical_model.png|350px|thumb|left|Figure 1: RBF graphical model]]<br />
<br />
An RBF network is akin to fitting a Gaussian mixture model to data. We assume that each class can be modelled by a single function <math>\,\phi</math> and data is generated by a mixture model. According to Bayes Rule,<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(x|y_{k})*Pr(y_{k})}{Pr(x)}</math><br />
<br />
While all classifiers that we have seen thus far in the course have been in discriminative form, the RBF network is a generative model that can be represented using a directed graph.<br />
<br />
We can replace the class conditional density in the above conditional probability expression by marginalizing <math>\,x</math> over <math>\,j</math>:<br />
<math>\Pr(x|y_{k}) = \sum_{j} Pr(x|j)*Pr(j|y_{k})</math><br />
<br />
<br />
<br />
<br/><br/><br />
*'''Note''' We made the assumption that each class can be modelled by a single function <math>\displaystyle\Phi</math> and that the data was generated by a mixture model. The Gaussian mixture model has the form:<br />
<math>f(x)=\sum_{m=1}^M \alpha_m \phi(x;\mu_m,\Sigma_m)</math> where <math>\displaystyle\alpha_m</math> are mixing proportions, <math>\displaystyle\sum_m \alpha_m=1</math>, and <math>\displaystyle\mu_m</math> and <math>\displaystyle\Sigma_m</math> are the mean and covariance of each Gaussian density respectively. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009), pp. 214. </ref> The generative model in Figure 1 shows graphically how each Gaussian in the mixture model is chosen to sample from.<br />
<br />
== '''Radial Basis Function (RBF) Networks - November 9th, 2009''' ==<br />
<br />
=== RBF Network for classification (A probabilistic point of view) ===<br />
Using RBF Network to do classification, we usually treat it as regression problem. We want to set a threshold to decide what the data’s class membership is. However, to find some insight to the classification problem of what we are doing in terms of RBF Network, we often think of mixture models and make certain assumptions.<br />
<br />
[[File:RBF.png|350px|thumb|right|Figure 26.1: RBF Network Classification Demo]] <br />
<br />
We assume, as we can see in the graph on the right hand side, that we have three random variables, <math>\displaystyle y_k</math>, <math>\displaystyle j</math>, and <math>\displaystyle x</math> where <math>\displaystyle y_k</math> denotes class <math>\,k</math>, <math>\displaystyle x</math> is what we observed here, and <math>\displaystyle j</math> is some hidden random variable. There is a process that there are different classes, and each class can trigger a different hidden random variable <math>\displaystyle j</math>. To understand this, we can assume that, for instance, this random variable <math>\displaystyle j</math> has a Gaussian distribution (it could have any other distribution as well) and that all the <math>\displaystyle j</math>’s have the same distribution (Gaussian), but with different parameters. From each Gaussian distribution triggered by each class, we are going to sample some data points. Therefore, in the end, we are going to get a set of data, which are not strictly Gaussian, but are actually a mixture of Gaussians.<br />
<br />
Again, we look at the posterior distribution from [http://en.wikipedia.org/wiki/Bayes'_theorem Bayes' Rule].<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(X = x | Y = y_{k})*Pr(Y = y_{k})}{Pr(X = x)}</math><br />
<br />
Since we made the assumption that the data has been generated from a mixture model, we can estimate this conditional probability by<br />
<br />
<math>\Pr(X = x | Y = y_{k}) = \sum_{j} Pr(X = x | j)*Pr(j | Y = y_{k})</math>, <br />
<br />
which is the class conditional distribution (or probability) of the mixture model. Note, here, if we only have a simple model from <math>\displaystyle y_k</math> to <math>\displaystyle x</math>, then we won’t have this summation.<br />
<br />
We can substitute this class conditional distribution into Bayes' formula. We can see that the posterior of class <math>\displaystyle k</math> is the summation over <math>\displaystyle j</math> of the probability of <math>\displaystyle x</math> given <math>\displaystyle j</math> times the probability of <math>\displaystyle j</math> given <math>\displaystyle y_k</math>, times the prior distribution of class <math>\displaystyle k</math>, and lastly divided by the marginal probability of <math>\displaystyle x</math>. That is,<br />
<br />
<math>\Pr(y_k | x) = \frac {\sum_{j} Pr(x | j)*Pr(j | y_{k})*Pr(y_{k})}{Pr(x)}</math>.<br />
<br />
Since, the prior probability of class <math>\displaystyle k</math>, <math>\displaystyle Pr(y_{k})</math>, does not have an index of <math>\displaystyle j</math>, it can be taken out of the summation. This yields,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)}</math>.<br />
<br />
We multiply this by <math>\displaystyle 1 = \frac {Pr(j)}{Pr(j)}</math>. Then, it becomes,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)} * \frac {Pr(j)}{Pr(j)}</math>.<br />
<br />
Next, note that <math>\displaystyle Pr(j | x) = \frac {Pr(x | j)*Pr(j)}{Pr(x)}</math>, and <math>\displaystyle Pr(y_k | j) = \frac {Pr(j | y_k)*Pr(y_k)}{Pr(j)}</math>. Then rearranging the terms, we finally have the posterior:<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} Pr(j | x)Pr(y_k | j)</math>.<br />
<br />
Interestingly, this is just the product of the posterior of the two functions that are summed.<br />
<br />
==== Interpretation of RBF Network classification ====<br />
<br />
[[File:2.png|350px|thumb|right|Figure 26.1.2(2): RBF Nerwork ]]<br />
<br />
We want to relate the results that we derived above to our RBF Network. In a RBF Network, as we can see on the right hand side, we have a set of data, <math>\displaystyle x_1</math> to <math>\displaystyle x_d</math>, and the hidden basis function, <math>\displaystyle \phi_{1}</math> to <math>\displaystyle \phi_{M}</math>, and then we have some output, <math>\displaystyle y_1</math> to <math>\displaystyle y_k</math>. Also, we have weights from the hidden layer to output layer. The output is just the linear sum of <math>\displaystyle \phi</math>’s. <br />
<br />
Now consider probability of <math>\displaystyle j</math> given <math>\displaystyle x</math> to be <math>\displaystyle \phi</math>, and the probability of <math>\displaystyle y_k</math> given <math>\displaystyle j</math> to be the weights <math>\displaystyle w_{jk}</math>, then the posterior can be written as,<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math>.<br />
<br />
[[File:3.png|350px|thumb|left|Figure 26.1.2(1): Gaussian mixture ]]<br />
<br />
Now, let us look at an example in one dimensional case. Suppose,<br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, and <math>\displaystyle j</math> is from 1 to 2. <br />
<br />
We know that <math>\displaystyle \phi</math> is a radial basis function. It's as if we put some Gaussian over data. And for each Gaussian, we consider the center <math>\displaystyle \mu</math>. Then, what <math>\displaystyle \phi</math> computes is the similarity of any data point to the center. <br />
<br />
We can see the graph on the left which plots the density of <math>\displaystyle \phi_{1}</math> and <math>\displaystyle \phi_{2}</math>. Take <math>\displaystyle \phi_{1}</math> for instance, if the point gets far from the center <math>\displaystyle \mu_{1}</math>, then it will reduce <math>\displaystyle \phi_{1}</math> to become nearly zero. Remember that, we can usually find a non-linear regression or classification of input space by doing a linear one in some extended space or some feature space (more details in Aside). Here, the <math>\displaystyle \phi</math>’s actually produce that feature space. <br />
<br />
So, one way to look at this is that this <math>\displaystyle \phi</math> is telling us that given an input, how likely the probability of presence of a particular feature is. Say, for example, we define the features as the centers of these Gaussian distributions. Then, this <math>\displaystyle \phi</math> function somehow computes the possibility given certain data points, of this kind of feature appearing. If the data point is right at the center, then the value of that <math>\displaystyle \phi</math> would be one, i.e. the probability is 1. If the point is far from the center, then the probability (<math>\displaystyle \phi</math> function value) will be close to zero, that is, it’s less likely. Therefore, we can treat <math>\displaystyle Pr(j | x)</math> as the probability of a particular feature given data. <br />
<br />
When we have those features, then <math>\displaystyle y</math> is the linear combination of the features. Hence, any of the weights <math>\displaystyle w</math>, which is equal to <math>\displaystyle Pr(y_k | j)</math>, tells us how likely this particular <math>\displaystyle y</math> will appear given those features. Therefore, the weight <math>\displaystyle w_{jk}</math> shows the probability of class membership given feature. <br />
<br />
Hence, we have found a probabilistic point of view to look at RBF Network!<br />
<br />
*'''Note''' There are some inconsistencies with this probabilistic point of view. There are no restrictions that force <math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math> to be between 0 and 1. So if least squares is used to solve this, <math>\displaystyle w_{jk}</math> cannot be interpreted as a probability. <br />
<br />
<br />
''' Aside '''<br />
*Feature Space:<br />
:One way to produce a feature space is LDA<br />
:Suppose, we have n data points <math>\mathbf{x}_1</math> to <math>\mathbf{x}_n </math>. Each data point has d features. And these n data points consist of the <math>X</math> matrix, <br />
:<math>X = \left[ \begin{matrix}<br />
x_{11} & x_{21} & \cdots & x_{n1} \\<br />
x_{12} & x_{22} & \cdots & x_{n2} \\<br />
\vdots & & \ddots & \vdots \\<br />
x_{1d} & x_{2d} & \cdots & x_{nd}<br />
\end{matrix}\right] </math><br />
:Also, we have feature space,<br />
:<math>\Phi^{T} = \left[ \begin{matrix}<br />
\phi_{1}(\mathbf{x_1}) & \phi_{1}(\mathbf{x_2})& \cdots & \phi_{1}(\mathbf{x_n})\\<br />
\phi_{2}(\mathbf{x_1})& \phi_{2}(\mathbf{x_2})& \cdots & \phi_{2}(\mathbf{x_n}) \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{M}(\mathbf{x_1}) & \phi_{M}(\mathbf{x_2}) & \cdots & \phi_{M}(\mathbf{x_n})<br />
\end{matrix}\right] </math> <br />
:If we want to solve a regression problem for the input data, we don’t perform Least Square on this <math>\displaystyle X</math> matrix, we do Least Square on the feature space, i.e. on the <math>\displaystyle \Phi^{T}</math> matrix. The dimensionality of <math>\displaystyle \Phi^{T}</math> is M by n.<br />
:Now, we still have n data points, but we define these n data points in terms of a new set of features. So, originally, we define our data points by d features, but now, we define them by M features. And what are those M features telling us? <br />
:Let us look at the first column of <math>\displaystyle \Phi^{T}</math> matrix. The first entry is <math>\displaystyle \phi_1</math> applied to <math>\mathbf{x_1}</math>, and so on, until the last entry is <math>\displaystyle \phi_M</math> applied to <math>\mathbf{x_1}</math>. Suppose each of these <math>\displaystyle \phi_j</math> is defined by<br />
:<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>.<br />
:Then, each <math>\displaystyle \phi_j</math> checks the similarity of the data point with its center. Hence, the new set of features are actually representing M centers in our data set, and for each data point, its new features check how this point is similar to the first center; how it is similar to the second center; and how it is similar to the <math>\displaystyle M^{th}</math> center. And this checking process will apply to all data points. Therefore, feature space gives another representation of our data set. <br />
<br />
</noinclude><br />
<br />
=== Model selection or complexity control for RBF Network - a brief introduction ===<br />
In order to obtain a better fit for the training data, we often want to increase the complexity of our RBF Network. By its construction, we know that to change the complexity of a RBF Network, the only way is to add or decrease the number of basis functions. A large number of basis function yields a more complex network. In theory, if we add enough basis functions, the RFB Network would work for any training; however, it doesn't mean this RBF Network model can generalize well. Therefore, for the purpose of avoiding overfitting problem (see Notes below), we only want to increase the number of basis function to certain point, i.e. its optimal level. <br />
<br />
For the model selection, what we usually do is estimate the training error. After working through the training error, we’ll see that the training error in fact can be decomposed, and one component of training error is called Mean Squared Error (MSE). In the later notes, we will find that our final goal is to get a good estimate of MSE. Moreover, in order to find an optimal model for our data, we select the model with the smallest MSE.<br />
<br />
Now, let us introduce some notations that we will use in the analysis:<br />
*<math>\hat f</math> -- the prediction model estimated by a RBF network from the training data<br />
*<math>\displaystyle f</math> -- the real model (not null), and ideally, we want <math>\hat f</math> to be close to <math>\displaystyle f</math><br />
*<math>\displaystyle err</math> -- the training error<br />
*<math>\displaystyle Err</math> -- the testing error<br />
*<math>\displaystyle MSE</math> -- the Mean Squared Error<br />
<br />
''' Notes '''<br />
<br />
[[File:overfitting.png|350px|thumb|left|Figure 26.2: Overfitting]]<br />
<br />
*Being more complex isn’t always a good thing. Sometime, [http://en.wikipedia.org/wiki/Overfitting overfitting] causes the model to lose its generality. For example in the graph on left hand side, the data points are sampled from the model <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle f(x_i)</math> is a linear function, which is shown by the blue line, and <math>\displaystyle \epsilon_i</math>is additive Gaussian noise from <math>~N(0,\sigma^2)</math>. The red curve displayed in the graph shows the over-fitted model. Clearly, this over-fitted model only works for any training data, and is useless for any further prediction when new data points are introduced.<br />
<br />
> n<-20;<br />
> x<-seq(1,10,length=n);<br />
> alpha<-2.5;<br />
> beta<-1.75;<br />
> y<-alpha+beta*x+rnorm(n);<br />
> plot(y~x, pch=16, lwd=3, cex=0.5, main='Overfitting');<br />
> abline(alpha, beta, col='blue');<br />
> lines(spline(x, y), col = 2);<br />
<br />
*More details on this topic later on.<br />
<br />
<br />
<br />
</noinclude><br />
<br />
<br />
<br />
<br />
<br />
<br />
== '''Model Selection(Stein's Unbiased Risk Estimate)- November 11th, 2009''' ==<br />
<br />
===Model Selection===<br />
<br />
Model selection is a task of selecting a model of optimal complexity for a given data. Learning a radial basis function network from data is a parameter estimation problem. One difficulty with this problem is selecting parameters that show good performance on both training and testing data. In principle, a model is selected to have parameters associated with the best observed performance on training data, although our goal really is to achieve good performance on unseen testing data. Not surprisingly, a model selected on the basis of training data does not necessarily exhibit comparable performance on the testing data. When squared error is used as the performance index, a zero-error model on the training data can always be achieved by using a sufficient number of basis functions.<br />
<br />
<br />
But, training error and testing error do not demonstrate a linear relationship. In particular, a smaller training error does do not necessarily result in a smaller testing error. In practice, one often observes that, up to a certain point, the model error on testing data tends to decrease as the training error decreases. However, if one attempts to decrease the training error too far by increasing model complexity, the testing error often can take a dramatic increase.<br />
<br />
<br />
The basic reason behind this phenomenon is that in the process of minimizing training error, after a certain point, the model begins to over-fit the training set. Over-fitting in this context means fitting the model to training data at the expense of losing generality. In the extreme form, a set of <math>\displaystyle N</math> training data points can be modeled exactly with <math>\displaystyle N</math> radial basis functions. Such a model follows the training data perfectly. However, the model is not representative features of the true underlying data source, and this is why it fails to correctly model new data points.<br />
<br />
<br />
In general, the training error rate will be less than the testing error on the new data. A model typically adapts to the training data, and hence the training error will be an overly optimistic estimate of the testing error. An obvious way to well estimate testing err is to add a penalty term to the training error to compensate. Actually, SURE is developed based on this.<br />
<br />
<br />
<br />
===Stein's unbiased risk estimate (SURE)===<br />
<br />
<br />
====Important Notation====<br />
<br />
Let:<br />
*<math>\hat f(X)</math> denote the ''prediction model'', which is estimated from a training sample by the RBF neural network model.<br />
*<math>\displaystyle f(X)</math> denote the ''true model''.<br />
*<math>\displaystyle err=\sum_{i=1}^N (\hat y_i-y_i)^2 </math> denote the ''training error'',which is the average loss over the training sample.<br />
*<math>\displaystyle Err=\sum_{i=1}^M (\hat y_i-y_i)^2 </math> denote the ''test error'', which is the expected prediction error on an independent test sample.<br />
*<math>\displaystyle MSE=E(\hat f-f)^2</math> denote the ''mean squared error'', where <math>\hat f(X)</math> is the estimated model and <math>\displaystyle f(X)</math> is the true model.<br />
<br />
<br />
<br />
Suppose the observations <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>.We need to estimate <math>\hat f</math> from training data set <math>T=\{(x_i,y_i)\}^2_{i=1}</math>. Let <math>\hat f_i=\hat f_(x_i)</math> and <math>\displaystyle f_i= f(x_i)</math>, then <br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i-\epsilon_i)^2]</math><math>=E[(\hat f_i-f_i)^2]+E[\epsilon_i^2]-2E[\epsilon_i(\hat f_i-f_i)]</math><br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2E[\epsilon_i(\hat f_i-f_i)]</math> <math>\displaystyle (1)</math><br />
<br />
The last term can be written as:<br />
<br />
<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=E[(y_i-f_i)(\hat f_i-f_i)]=cov(y_i,\hat f)</math>, where<math>\displaystyle y_i</math> and <math>\hat f_i</math> both have same mean <math>\displaystyle f_i</math>.<br />
<br />
<br />
<br />
====Stein's Lemma====<br />
<br />
If <math>\,Z</math> is <math>\,N(\mu,\sigma^2)</math> and if <math>\displaystyle g(Z)</math> is weakly differentiable,such that<math>\displaystyle E[\vert g'(Z)\vert]<\infty</math>, then <math>\displaystyle E[g(Z)(Z-\mu)]=\sigma^2E(g'(Z))</math>.<br />
<br />
<br />
According to Stein's Lemma, the last cross term of <math>\displaystyle (1)</math>, <math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]</math> can be written as <math>\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math>. The derivation is as follows.<br />
<br />
<math>\displaystyle Proof</math>:<br />
<math>\displaystyle E[g(Z)(Z-\mu)]=E[(\hat f-f)\epsilon]=\sigma^2E(g'(Z))=\sigma^2 E[\frac {\partial (\hat f-f)}{\partial y_i}]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}-\frac {\partial f}{\partial y_i}]</math><br />
<br />
<br />
Since <math>\displaystyle f</math> is the true model, not the function of the observations <math>\displaystyle y_i</math>, then <math>\frac {\partial f}{\partial y_i}=0</math>.<br />
<br />
So,<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math> <math>\displaystyle (2)</math><br />
<br />
<br />
<br />
====Two Different Cases====<br />
<br />
=====''Case 1''=====<br />
<br />
Consider the case in which a new data point has been introduced to the estimated model, i.e. <math>(x_i,y_i)\not\in\tau</math>; this new point belong to the validation set <math>\displaystyle \nu</math>,i.e.<math>(x_i,y_i)\in\nu</math>. Since <math>\displaystyle y_i</math> is a new point, <math>\hat f</math> and <math>\displaystyle y_i</math> are independent. Therefore <math>\displaystyle cov(y_i,\hat f)=0</math> (or think about <math>\frac{\partial \hat f}{\partial y_i}</math>, when <math>\,y_i</math>is a new point, then it has nothing with <math>\hat f</math> because the estimation of <math>\hat f</math> is from the training data, so <math>\frac{\partial \hat f}{\partial y_i}=0</math> ) and <math>\displaystyle (1)</math> in this case can be written as:<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2</math>. <br />
<br />
This expectation means <math>\frac {1}{m}\sum_{i=1}^m (\hat y_i-y_i)^2 = \frac {1}{m}\sum_{i=1}^m (\hat f_i-f_i)^2+ \sigma^2</math>.<br />
<br />
<math>\sum_{i=1}^m (\hat y_i-y_i)^2 = \sum_{i=1}^m (\hat f_i-f_i)^2+ m\sigma^2</math><br />
<br />
Based on the notation we denote above, then we obtain:<br />
<math>\displaystyle MSE=Err-m\sigma^2</math><br />
<br />
<br />
<br />
This is the justification behind the technique of cross validation. since <math>\displaystyle \sigma^2</math> is constant, to minimize <math>\displaystyle MSE</math> is equal to minimize the test err <math>\displaystyle Err</math>. In cross vaildation to avoid overfitting or underfitting, a validation data set is independent from the estimated model.<br />
<br />
<br />
=====''Case 2''=====<br />
<br />
A more interesting case is the case in which we do not use new data points to assess the performance of the estimated model. and the training data is used for both estimating and assessing a model <math>\hat f_i</math>. In this case the cross term in <math>\displaystyle (1)</math> cannot be ignored because <math>\hat f_i</math> and <math>\displaystyle y_i</math> are not independent. Therefore the cross term can be estimated by Stein's lemma, which was originally proposed to estimated the mean of a Guassian distribution.<br />
<br />
<br />
Suppose <math>(x_i,y_i)\in\tau</math>, then by applying Stein's lemma, we obtain <math>\displaystyle (2)</math> proved above.<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2\sigma^2E[\frac {\partial \hat f}{\partial y_i}]</math>.<br />
<br />
This expectation means <math>\frac {1}{N}\sum_{i=1}^N (\hat y_i-y_i)^2 = \frac {1}{N}\sum_{i=1}^N (\hat f_i-f_i)^2+ \sigma^2-\frac {2}{N}\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<br />
<math>\sum_{i=1}^N (\hat y_i-y_i)^2 = \sum_{i=1}^N (\hat f_i-f_i)^2+ N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<math>\displaystyle err=MSE+N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math><br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math> <math>\displaystyle (3)</math><br />
<br />
In statistics, this is known as [http://www.reference.com/browse/Stein%27s+unbiased+risk+estimate Stein's unbiased risk estimate (SURE)] is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter, and thus cannot be determined completely.<br />
<br />
<br />
<br />
===SURE for RBF Network===<br />
<br />
Based on SURE, the optimum number of basis functions should be assigned to have the minimum generalization err <math>\displaystyle err</math>. Based on the Radial Basis Function Network, by setting <math>\frac{\partial err}{\partial W}</math> equal to zero , we get the least squared solution of<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math>.Then we have <math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}Y = HY</math>,where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}</math>, <math>\,H</math> is the hat matrix for this model.<br />
<br />
<br />
<math>\hat f=\,H_{i1}y_2+\,H_{i2}y_2+\cdots+\,H_{in}y_n</math> <math>\displaystyle (3)</math><br />
<br />
where <math>\,H</math> depends on the input vector <math>\displaystyle x_i</math> but not on <math>\displaystyle y_i</math>. <br />
<br />
By taking the derivative of <math>\hat f_i</math> with respect to <math>\displaystyle y_i</math>, we can easily obtain:<br />
<br />
<math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Now, substituing this into<math>\displaystyle (3)</math> ,we get<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Here, we can tell that <math>\sum_{i=1}^N \,H_{ii}= \,Trace(H)</math>, the sum of the diagonal elements of <math>\,H</math>. Thus, we can obtain the further simplification that <math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1})\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=d</math>, where<math>\displaystyle d</math> is the dimension of <math>\displaystyle \Phi</math>. Since <math>\displaystyle \Phi</math> is a projection of input matrix <math>\,X</math> onto a basis set spanned by <math>\,M</math>, the number of basis functions. If considering intercept, then <math>\,Trace(H)= M+1</math>.<br />
<br />
Then,<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1)</math>.<br />
<br />
===SURE Algorithm===<br />
<br />
<br />
[[File:27.1.jpg|350px|thumb|right|Figure 27.1]]<br />
<br />
We use this method to find the optimum number of basis function by choosing the model with smallest MSE over the set of models considered. Given a set of models <math>\hat f_M(x)</math> indexed by the number of basis functions, <math>\displaystyle err(M)</math>. <br />
<br />
Then, <math>\displaystyle MSE(M)=err(M)-N\sigma^2+2\sigma^2(M+1)</math><br />
<br />
where <math>\displaystyle N</math> is the number of training samples and the noise,<math>\sigma^2</math>, can be estimated from the training data as<br />
<br />
<math>\hat \sigma^2=\frac {1}{N-1}\sum_{i=1}^N (\hat y-y)^2</math>.<br />
<br />
<br />
By applying SURE algorithm to SPECT Heart data, we get the optimal number of basis functions is <math>\displaystyle M=4</math>.<br />
<br />
<br />
Pls take a look at figure 27.1 on the right, which shows that<math>\displaystyle MSE</math> is smallest when <math>\displaystyle M=4</math>.<br />
<br />
<br />
Calculating the SURE value is easy if you have access to <math>\,\sigma</math>.<br />
<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
If <math>\,\sigma</math> is not known, it can be estimated using the error.<br />
<br />
error = (output - expected_output) .^ 2;<br />
sigma = (dot(err, ones(1, num_data_point)) / (num_data_point)) / (num_data_point - 1);<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
=='''SURE for RBF network & Support Vector Machine - November 13th, 2009'''==<br />
<br />
===SURE for RBF network===<br />
<br />
====Minimizing MSE====<br />
<br />
By Stein's unbiased risk estimate (SURE) for Radial Basis Function (RBF) Network<br />
we get:<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1) </math> (28.1)<br />
<br />
*<math>\displaystyle MSE</math>(mean square error)= <math>\sum_{i=1}^N (\hat y_i-y_i)^2 </math><br />
*<math>\displaystyle err</math>(training error)= <math>\sum_{i=1}^N (\hat f_i-f_i)^2 </math><br />
*<math>\displaystyle (M+1) </math>( number of hidden units)= <math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math><br />
<br />
<br />
'''Goal''': To minimize MSE<br />
<br />
1. If <math>\displaystyle \sigma </math> is known, then it is no impact (i.e. a constant),<br />
and we can ignore it. Only need to minimize <math>\displaystyle MSE=err +2\sigma^2(M+1)</math>.<br />
<br />
2. In reality, we do not know <math>\displaystyle \sigma</math>, and it changes when <math>\displaystyle (M+1) </math> changes. However, we can estimate <math>\displaystyle \sigma </math>.<br />
<br />
<math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>. Suppose we do not know the variance of <math>\displaystyle \epsilon</math>. Then, <br />
<br />
<math>\displaystyle \sigma^2=\frac{1}{N-1}\sum_{i=1}^N (\hat y_i-y_i)^2 =\frac{1}{N-1}err</math> (28.2)<br />
<br />
Substitute (28.2) into (28.1), get<br />
<br />
<math>\displaystyle MSE=err-N\frac{1}{N-1}err+2\frac{1}{N-1}err(M+1)</math><br />
<br />
<math>\displaystyle MSE=err(1-\frac{N}{N-1}+\frac{2(M+1)}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{N-1-N+2M+2}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{2M+1}{N-1}) </math> (28.3) <br />
<br />
<br />
[[File:28.1.jpg|350px|thumb|Figure 28.1: MSE vs err]]<br />
<br />
Figure 28.1: the training error will decrease and the MSE will increase when increasing the number of hidden units (i.e. the model is more complex).<br />
<br />
<br />
When the number of hidden units gets larger and larger, the training error will decrease until it approaches to <math>\displaystyle 0 </math>. If training error approaches <math>\displaystyle 0 </math>, then no matter how large <math>\displaystyle (M+1) </math> is, from (28.3) we can see MSE will approaches to <math>\displaystyle 0 </math> as well. However, in fact it does not happen since when training error is close to <math>\displaystyle 0 </math> [http://en.wikipedia.org/wiki/Overfitting overfitting] happens, and MSE will increase instead of being close to <math>\displaystyle 0 </math>. We can see it from the Figure 28.1. <br />
<br />
<br />
We can see <math>\displaystyle \sigma^2 </math> is the average of <math>\displaystyle err </math>. In order to deal with this problem, we can take the average for <math>\displaystyle err</math> of each hidden unit. For example: we can first take 1 hidden unit, and take 10 hidden units in the next.<br />
<br />
We can also see that unlike the classical Cross Validation (CV) or Leave one out (LOO) techniques, the SURE technique does not need to do the validation to find the optimal model. Hence, SURE technique uses less data than CV or LOO. It is suitable for the case that there is not enough data for validation. However, to implement SURE we need to find <math>\frac {\partial \hat f}{\partial y_i}</math>, which may not be trivial for models that do not have a closed-form solution.<br />
<br />
====Kmeans Clustering====<br />
<br />
Discription:<br /> [http://en.wikipedia.org/wiki/K-means_clustering Kmeans clustering] is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.<br />
<br />
*The number of hidden units is same as the number of clusters, which is <math>\displaystyle \phi </math><br />
<br />
*<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, we set it same for all clusters.<br />
<br />
The basic details for <math>K</math>-means clustering are given:<br />
<br />
The <math>K</math> initial centers are randomly chosen from the training data.<br />
<br />
Then the following two steps are iterated alternatively until convergernce.<br />
<br />
1. for each existing center, we reidentify its cluster (every points in this cluster should be closer to this center than others).<br />
<br />
2. compute the mean for each cluster and make it as the new center for each cluster.<br />
<br />
<br />
Example:<br /><br />
Partition data into 2 clusters (2 hidden values)<br />
<br />
<br />
>> X=rand(30,80); <br />
>> [IDX,C,sumD,D]=kmeans(X,2); <br />
>> size(IDX) <br />
>> 30 1<br />
>> size(C) <br />
>> 2 80<br />
>> size(sumD) <br />
>> 2 1<br />
>> c1=sum(IDX==1)<br />
>> 14<br />
>> c2=sum(IDX==2)<br />
>> 16<br />
>> sumD<br />
>> 85.6643<br />
>> 101.0419<br />
>> v1=sumD(1,1)/c1 <br />
>> 6.1189<br />
>> v2=sumD(2,1)/c2 <br />
>> 6.3151 <br />
<br />
<br />
<br />
Comments:<br />
<br />
We create <math>X</math> randomly as a train set with 80 data points and 30 dimensions, and then apply “kmeans” method to separate X into 2 clusters.IDX is a vector contains 1 or 2 which indicates 2 clusters, and its size is 30*1. <math>\displaystyle C </math> is the center (mean) of each cluster with size 2*80; sumD is sum of the square distance between the data points and center of its cluster. The <math>\displaystyle c1 </math> and <math>\displaystyle c2 </math> indicate the number of data points in cluster 1 and 2. <math>\displaystyle V1 </math> is the variance of the first cluster <math>\displaystyle (v1=\sigma_1)</math>; <math>\displaystyle V2 </math> is the variance of the second cluster <math>\displaystyle (v2=\sigma_2)</math>. Now we can get <math>\displaystyle \phi </math>, <math>\displaystyle w </math>, hat matrix <math>\displaystyle H </math> and <math>\displaystyle \hat Y </math> by following equations. Finally, we will get the <math>\displaystyle MSE </math> and predict the test set. <br />
<br />
<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br />
<br />
<math>\displaystyle W = (\phi^{T}\Phi)^{-1}\phi^{T}Y</math><br />
<br />
<math>\displaystyle H = \phi(\Phi^{T}\Phi)^{-1}\phi^{T}</math><br />
<br />
<math>\displaystyle \hat{Y} = \phi W = \phi(\phi^{T}\phi)^{-1}\phi^{T}Y = HY</math><br />
<br />
<br />
<br />
Aside:<br />
<br />
Similar in spirit to <math>K</math>-means, there is EM algorithm with respect to Gaussian mixture model. Generally speaking, the Gaussian mixture model is referred to as a soft clustering while <math>K</math>-means is hard clustering.<br />
<br />
Similar to <math>K</math>-means, the following two steps are iterated alternatively until convergernce.<br />
<br />
E-step, each point is assigned a weight for each cluster based on the likelihood of each of the corresponding Gaussians. Observations is assigned 1 for one cluster if they are closer to the center of that cluster, and is assigned 0 for other clusters. <br />
<br />
M-step, compute the weighted means and covariances and make them as the new means and covariances for every cluster.<br />
<br />
>>[P,mu,phi,1Pxtr]=mdgEM(X,2,200,0);<br />
<br />
===Support Vector Machine===<br />
<br />
====Introduction====<br />
We have seen that linear discriminate analysis and logistic regression both estimate linear decision boundaries in similar but slightly different ways. Separating hyperplane classifiers provide the basis for support vector classifier. It constructs linear decision boundaries that explicitly try to separate the data into different classes as well as possible. The techniques that make the extensions to the nonseparable case, where the classes overlap are generalized to what is known as the support vector machine. It produces nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space.<br />
<br />
Definition: <br /><br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machine (SVM)] are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space, which can be used for classification, regression or other tasks.<br />
<br />
====Optimal Seperating Hyperplane====<br />
<br />
[[File:28.2.jpg|350px|thumb|right|Figure 28.2]]<br />
<br />
Figure 28.2 An example with two classes separable by a hyperplane. The blue line is the least squares solution, which misclassifies one of the training points. Also shown are the black separating hyperplanes found by the [http://en.wikipedia.org/wiki/Perceptron perceptron] learning algorithm with different random starts.<br /><br />
<br />
We can see the data points can be separated by a linear boundary are in two classes in <math>\displaystyle \mathbb{R}^{2} </math>. Suppose a dataset is indeed linearly separable, then there exits infinitely many possible separating hyperplanes including the black lines in the figure as two of them for training data. However, which solution is the best when we introduce the new data. <br /><br />
<br />
Aside: <br /><br />
The blue line is the least squares solution to the problem,obtained by regressing the <math>\displaystyle -1/+1 </math> response <math>\displaystyle Y </math> on <math>\displaystyle X </math> (with intercept); the line is given by<br />
<math>\displaystyle {X:\hat\beta_0+\hat\beta_1X_1+\hat\beta_2X_2=0}</math>.<br />
This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by linear discriminant analysis, in light of its equivalence with linear regression in the two-class case.<br />
<br />
Classifiers such as (28.4) that compute a linear combination of the input features and return the sign were called ''perceptrons'' in the engineering literature in the late 1950s. <br />
<br />
<br />
Identifications:<br />
<br />
*Hyperplane: seprate two classes <br />
<br />
<math>\displaystyle x^{T}\beta+\beta_0=0</math><br />
<br />
*Margin: the distance between the hyperplane and the closest point.<br />
<br />
<math>\displaystyle d_i=x_i^{T}\beta+\beta_0 </math> where <math>\displaystyle i=1,....,N</math><br />
<br />
Note: since distance is positive, if the data is on <math>\displaystyle +1 </math> side the distance is <math>\displaystyle d_i(+1)</math>. If the data is on the <math>\displaystyle -1 </math> side the distance is <math>\displaystyle d_i(-1)</math>.<br />
<br />
*Data points: <math>\displaystyle y_i\in\{-1,+1\}</math>we can classify points as<math>\displaystyle sign\{d_i\}</math> if <math>\displaystyle \beta,\beta_0 </math> is known.<br /><br />
<br />
====Maximum Margin Classifiers====<br />
Choose the line farthest from both classes or Choose the line which has the maximum distance from the closest point. (i.e maximize the margin)<br /><br />
<br />
<math>\displaystyle Margin=min\{y_id_i\}</math> <math>\displaystyle i=1,2,....,N </math> <br />
where <math>\displaystyle y_i </math> is label and <math>\displaystyle d_i </math> is distance<br /><br />
<br />
[[File:28.3.jpg|350px|thumb|right|Figure 28.3 The linear algebra of a hyperplane]]<br />
<br />
<br />
<br />
Fiqure 28.3 depicts a hyperplane defined by the equation <math>\displaystyle x^{T}\beta+\beta_0=0</math>. Since they are in <math>\displaystyle \mathbb{R}^{2} </math>, the hyperplane is a line.<br /><br />
<br />
<br />
Let us rewrite <math>\displaystyle Margin=min\{y_id_i\}</math> by using the following properties:<br /><br />
<br />
1. <math>\displaystyle \beta </math> is orthogonal to the hyperplane <br /><br />
<br />
Two points<math>\displaystyle x_1,x_2</math> lying on the hyperplane.<br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0\beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}(x_1-x_2)=0</math><br />
<br />
Hence,<math>\displaystyle \beta </math> is orthogonal to <math>\displaystyle (x_1-x_2)</math>, and<math>\displaystyle \beta^*=\frac{\beta}{\|\beta\|} </math> is the vector normal to the hyperplane.<br /><br />
<br />
2. For any point <math>\displaystyle x_0 </math> on the hyperplane, <br />
<br />
<math>\displaystyle \beta^{T}x_0+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1=-\beta_0</math><br />
For any point on the hyperplane <math>\displaystyle \beta^{T}</math> gives negative value of the intercept of the hyperplane. <br/><br />
<br />
<br />
3. The signed distance for any point <math>\displaystyle x </math> to the hyperplane is <math>\displaystyle d_i=\beta^{T}(x_i-x_0)</math>. <br/>Since the length of <math>\displaystyle \beta </math> is not known, let it be unit vector.<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}(x_i-x_0)}{\|\beta\|} </math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i-\beta^{T}x_0}{\|\beta\|} </math><br />
<br />
by property 2<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i+\beta_0}{\|\beta\|} </math><br />
<br />
<br />
<br />
[[File:4.jpg|350px|thumb|right|Figure 28.4]]<br />
<br />
<br />
We had <math>\displaystyle Margin=min(y_id_i)</math> <math>\displaystyle i=1,2,....,N </math>, and since we now know how to compute <math>\displaystyle d_i \Rightarrow</math> <br />
<br />
<math>\displaystyle Margin=min\{y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\} </math><br />
<br />
Suppose <math>\displaystyle x_i </math> is not on the hyperplan<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)>0 </math><br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq c </math>for <math>\displaystyle c>0 </math><br />
<br />
<br />
<math>\displaystyle y_i(\frac{\beta^{T}x_i}{c}+\frac{\beta_0}{c})\geq1</math> <br />
<br />
This is known as the cnonical representation of the decision hyperplane.<br />
<br />
For <math>\displaystyle \beta^{T} </math> only the direction is important, so <math>\displaystyle \frac{\beta^{T}}{c} </math> does not change its direction, the hyperplance will be the same.<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq1 </math><br />
<br />
<math>\displaystyle y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\geq\frac{1}{\|\beta\|} </math><br />
<br />
<math>\displaystyle Margin=min(\frac{1}{\|\beta\|} )</math><br />
<br />
which is equivalent as <math>\displaystyle Margin=max(\|\beta\| )</math><br />
<br />
<br />
<br />
<br />
Refrence:<br /><br />
Hastie,T.,Tibshirani,R., Friedman,J.,(2008).The Elements of Statistical Learning:129-130<br />
<br />
=='''Optimizing The Support Vector Machine - November 16th, 2009'''==<br />
The Support Vector Machine is used to find a maximum margin hyperplane, assuming the two classes are separable. This margin can be written as <math>\,min\{y_id_i\}</math>, or the distance of each point from the hyperplane, where <math>\,d_i</math> is the distance and <math>\,y_i</math> is used as the sign.<br />
===Maximizing the Support Vector Machine===<br />
<math>\,Margin=min\{y_id_i\}</math> can be rewritten as <math>\,min\left\{\frac{y_i\left(\beta^Tx_i+\beta_0\right)}{|\beta|}\right\}</math>. <br />
<br\>Note that the term <math>\,y_i\left(\beta^Tx_i+\beta_0\right) = 0</math> if <math>\,x_i</math> is on the hyperplane, but <math>\,y_i\left(\beta^Tx_i+\beta_0\right) > 0</math> if <math>\,x_i</math> is <math>\displaystyle not</math> on the hyperplane<br />
<br />
This implies <math>\,\exists C</math> such that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq C</math>.<br />
<br />
Divide through by C to produce <math>\,y_i\left(\frac{\beta^T}{C}x_i + \frac{\beta_0}{C}\right) \geq 1</math>. <br />
<br />
<math>\,\beta, \beta_0</math> compose a hyperplane that can have different values but we care about the direction, dividng through by a constant does not change the hyperplane. Thus, by scaling <math>\,\beta, \beta_0</math> we can eliminate C, so that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq 1</math>. Implying that the lower bound on <math>\,y_i\left(\beta^Tx_i+\beta_0\right)</math> is <math>\displaystyle 1</math><br />
<br />
Now in order to maximize the margin, we simply need to find <math>\,min\left\{\frac{1}{|\beta|}\right\}</math>. <br />
<br />
In other words, find maximum <math>\,|\beta|</math>, s.t. <math>\,min_i\{y_i(\beta^Tx_i+\beta_0)\} = 1</math>.<br />
<br />
Note that we're dealing with the norm of <math>\,\beta</math>. The 1-norm of a vector is simply the sum of the absolute value of each element (also known as the taxicab or Manhattan distance), and is apparently more accurate, but also has a discontinuity in the derivative. 2-norm, the Euclidean norm (the intuitive length of the vector), is easier to work with - that is <math>\,\|\beta\|_2 = (\beta^T\beta)^{1/2}</math>. For convenience, we will maximize <math>\,\frac{1}{2}\|\beta\|_2^2 = \frac{1}{2}\beta^T\beta</math>.<br />
<br />
This is an example of a quadratic programming problem and we will minimize a quadratic function respective to a linear inequality constraints.<br />
<br />
<br />
====Writing Lagrangian Form of Support Vector Machine====<br />
The Lagrangian form is introduced to ensure that the conditions are satisfied, as well as finding an optimal solution. <math>\,\alpha_i</math> are introduced as dual constraints. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.<br />
<br />
<math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math>. To find the optimal value, set the derivative equal to zero.<br />
<br />
<math>\,\frac{\partial L}{\partial \beta} = 0</math>, <math>\,\frac{\partial L}{\partial \beta_0} = 0</math>. Note that <math>\,\frac{\partial L}{\partial \alpha_i}</math> is equivalent to the constraints <math>\left(y_i(\beta^Tx_i+\beta_0)-1\right) \geq 0, \,\forall\, i</math><br />
<br />
First, <math>\,\frac{\partial L}{\partial \beta} = \frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\left\{\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i)+\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0-\frac{\partial}{\partial \beta}\alpha_iy_i\right\}}</math><br />
<br />
: <math>\frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 = \beta</math>.<br />
<br />
: <math>\,\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i) = \alpha_iy_ix_i</math><br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0 = 0</math>. <br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i = 0</math>.<br />
<br />
So this simplifies to <math>\,\frac{\partial L}{\partial \beta} = \beta - \sum_{i=1}^n{\alpha_iy_ix_i} = 0</math>. In other words,<br />
<br />
<math>\,\beta = \sum_{i=1}^n{\alpha_iy_ix_i}</math>, <math>\,\beta^T = \sum_{i=1}^n{\alpha_iy_ix_i^T}</math><br />
<br />
Similarly, <math>\,\frac{\partial L}{\partial \beta_0} = -\sum_{i=1}^n{\alpha_iy_i} = 0</math>.<br />
<br />
This allows us to rewrite the Lagrangian without <math>\,\beta</math>.<br />
<br />
<math>\,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} - \sum_{i=1}^n{\alpha_iy_i(\sum_{j=1}^n{\alpha_jy_jx_j^Tx_i} + \beta_0) - 1}</math>. <br />
<br />
Because <math>\,\sum_{i=1}^n{\alpha_iy_i} = 0</math>, and <math>\,\beta_0</math> is constant, <math>\,\sum_{i=1}^n{\alpha_iy_i\beta_0} = 0</math>. So this simplifies further, to<br />
<br />
<math>L(\alpha) = \,-\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} + \sum_{i=1}^n{\alpha_i}</math><br />
A dual representation of the maximum margin.<br />
<br />
Because <math>\,\alpha_i</math> is the Lagrangian multiplier, <math>\,\alpha_i \geq 0 \forall i</math>.<br />
<br />
This is a much simpler optimization problem.<br />
<br />
=='''The Support Vector Machine algorithm - November 18, 2009'''==<br />
<br />
===Solving the Lagrangian===<br />
<br />
Continuing from the above derivation, we now have the equation that we need to minimize, as well as two constraints.<br />
<br />
The Support Vector Machine problem boils down to:<br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math><br />
:such that <math>\alpha_i \geq 0</math><br />
:and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
<br />
We are looking to minimize <math>\,\alpha</math>, which is our only unknown. Once we know <math>\,\alpha</math>, we can easily find <math>\,\beta</math> and <math>\,\beta_0</math> (see the Support Vector algorithm below for complete details).<br />
<br />
If we examine the Lagrangian equation, we can see that <math>\,\alpha</math> is multiplied by itself; that is, the Lagrangian is quadratic with respect to <math>\,\alpha</math>. Our constraints are linear. This is therefore a problem that can be solved through [http://en.wikipedia.org/wiki/Quadratic_programming quadratic programming] techniques. We will examine how to do this in Matlab shortly.<br />
<br />
We can write the Lagrangian equation in matrix form:<br />
<br />
<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:such that <math>\underline{\alpha} \geq \underline{0}</math><br />
:and <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Where:<br />
* <math>\underline{\alpha}</math> denotes an <math>\,n \times 1</math> vector; <math>\underline{\alpha}^T = [\alpha_1, ..., \alpha_n]</math><br />
* Matrix <math>S = y_iy_jx_i^Tx_j = (y_ix_i)^T(y_ix_i)</math><br />
* <math>\,\underline{0}</math> and <math>\,\underline{1}</math> are vectors containing all 0s or all 1s respectively<br />
<br />
Using this matrix notation, we can use Matlab's build in quadratic programming routine, [http://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/quadprog.html quadprog].<br />
<br />
===Quadprog example===<br />
<br />
Let's use quadprog to find the solution to <math>\,L(\alpha)</math>.<br />
<br />
Matlab's quadprog function minimizes an equation of the following form:<br />
:<math>\min_x\frac{1}{2}x^THx+f^Tx</math><br />
:such that: <math>\,A \cdot x \leq b</math>, <math>\,Aeq \cdot x = beq</math> and <math>\,lb \leq x \leq ub</math><br />
<br />
We can now see why we kept the <math>\frac{1}{2}</math> constant in the original derivation of the equation.<br />
<br />
The function is called as such: <code>x = quadprog(H,f,A,b,Aeq,beq,lb,ub)</code>. The variables correspond to values in the equation above.<br />
<br />
We can now try to find the solution to <math>\,L(\alpha)</math> (though, it should be noted, that in <math>\,L(\alpha)</math> we subtract the first term rather than add it; I'm not sure if this difference would change how we call <code>quadprog</code>).<br />
<br />
We'll use a simple one-dimensional data set, which is essentially y = -1 or 1 + Gaussian noise. (Note: you could easily put the values straight into the quadprog equation; they are separated for clarity)<br />
<br />
x = [mvnrnd([-1],[0.01],100); mvnrnd([1],[0.01],100)]';<br />
y = [-ones(100,1); ones(100,1)];<br />
S = (y*x)' * (y*x);<br />
f = ones(200,1);<br />
A = -ones(1,200);<br />
b = 0;<br />
Aeq = y';<br />
beq = 0;<br />
lb = 0;<br />
ub = []; % There is no upper bound<br />
alpha = quadprog(S,f,A,b,Aeq,beq,lb,ub);<br />
<br />
This gives us the optimal <math>\,\alpha</math>... or at least I think it should, but it does not appear to work for me (that is, despite setting the lower boundary of 0, a number of values are still negative. Whether this is just the nature of <code>quadprog</code> or an error on my part is an exercise for the reader).<br />
<br />
===Examining K.K.T. conditions===<br />
<br />
[http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker conditions] ([http://webrum.uni-mannheim.de/mokuhn/public/KarushKuhnTucker.pdf more info]) give us a closer look into the Lagrangian equation and the associated conditions.<br />
<br />
Suppose we are looking to minimize <math>\,f(x)</math> such that <math>\,g_i(x) \geq 0, \forall{x}</math>. If <math>\,f</math> and <math>\,g</math> are differentiable, then the ''necessary'' conditions for <math>\hat{x}</math> to be local minimum are:<br />
<br />
# At the optimal point, <math>\frac{\partial L}{\partial \hat{x}} = 0</math>; i.e. <math>f'(\hat{x}) - \sum{\alpha_ig'(\hat{x})}=0</math><br />
# <math>\alpha_i \geq 0</math>. (Dual Feasibility) <br />
# <math>\alpha_ig_i(\hat{x}) = 0, \forall{i}</math> (Complementary Slackness)<br />
# <math>g_i(\hat{x}) \geq 0</math> (Primal Feasibility)<br />
<br />
If any of these conditions are violated, then the problem is deemed not feasible.<br />
<br />
These are all trivial except for condition 4. Let's examine it further in our support vector machine problem.<br />
<br />
===Support Vectors===<br />
<br />
Basically, the support vectors are the training points that actually determine the optimal separating hyperplane which we are look for. Also, they are the most difficult points to classify or the most informative for the classification.<br />
<br />
In our case, the <math>g_i(\hat{x})</math> function is:<br />
:<math>\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1</math><br />
<br />
Substituting <math>\,g_i</math> into KKT condition 3, we get <math>\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0</math>. <br\>In order for this condition to be satisfied either <br/><math>\,\alpha_i= 0</math> or <br/><math>\,y_i(\beta^Tx_i+\beta_0=1</math><br />
<br />
All points <math>x_i</math> are either be 1 or greater than 1 distance away from the hyperplane.<br />
<br />
'''Case 1: a point <math>\displaystyle x_i</math> > 1 away from the margin'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) > 1 \Rightarrow \alpha_i = 0</math>.<br />
<br />
If point <math>\, x_i</math> is not on the margin, then the corresponding <math>\,\alpha_i=0</math>.<br />
<br />
'''Case 2: a point <math>\displaystyle x_i</math> 1 away from the margin'''<br />
<br />
If <math>\,\alpha_i > 0 \Rightarrow y_i(\beta^Tx_i+\beta_0) = 1</math> <br />
<br\>If point <math>\, x_i</math> is on the margin, then the corresponding <math>\,\alpha_i>0</math>.<br />
<br />
<br />
<br />
Points on the margin, points with corresponding <math>\,\alpha_i > 0</math>, are called '''''support vectors'''''.<br />
<br />
===Using support vectors===<br />
<br />
Support vectors are important because they allow the support vector machine algorithm to be insensitive to outliers. If <math>\,\alpha = 0</math>, then the cost function is also 0, and won't contribute to the solution of the SVM problem; only points on the margin -- support vectors -- contribute.<br />
<br />
====The support vector machine algorithm====<br />
<br />
# Solve the quadratic programming problem:<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
## Use Matlab's quadprog to find the optimal <math>\,\underline{\alpha}</math><br />
# Find <math>\beta = \sum_{i=1}^n{\alpha_iy_i\underline{x_i}}</math><br />
# Find <math>\,\beta_0</math> by choosing a support vector (a point with <math>\,\alpha_i > 0</math> and solving <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math><br />
<br />
===Example in Matlab===<br />
<br />
The following code, taken verbatim from the lecture, shows how to use Matlab built-in SVM routines (found in the Bioinformatics toolkit) to do classification through support vector machines.<br />
<br />
load 2_3;<br />
[U,Y] = princomp(X');<br />
data = Y(:,1:2);<br />
l = [-ones(1,200) ones(1,200)];<br />
[train,test] = crossvalind('holdOut',400);<br />
% Gives indices of train and test; so, train is a matrix of 0 or 1, 1 where the point should be used as part of the training set<br />
svmStruct = svmtrain(data(train,:), l(train), 'showPlot', true);<br />
<br />
[[File:Svm1.png|frame|center|The plot produced by training on some of the 2_3 data's first two features.]]<br />
<br />
yh = svmclassify(svmStruct, data(test,:), 'showPlot', true);<br />
<br />
[[File:Svm2.png|frame|center|The plot produced by testing some of the 2_3 data.]]</div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=5454stat8412009-11-21T20:01:14Z<p>Ipargaru: /* Maximum Margin Classfifiers */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rise of fields such as data-mining, bioinformatics, and machine learning, classification has becomes a fast-developing topic. In the age of information, vast amounts of data are generated constantly, and the goal of classification is to learn from data. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
'''Definition''': The problem of Prediction a discrete random variable <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math> is called Classification.<br />
<br />
In classification,, we attempt to approximate a function <math>\,h</math>, by using a training data set, which will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>d</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data which are identical independent distributions, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, for instance, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability. Intuitively speaking, to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we find <math>\,y\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y|X=x)</math>, and classify <math>\,X</math> into class <math>\,y</math>. In order to calculate the value of <math>\,P(Y=y|X=x)</math>, we use ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posterior probability, <math>\,P(Y=y)</math> as the prior probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two classes, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<sub></sub><br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''3 different approaches to classification''':<br />
<br />
1) Empirical Risk Minimization: Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of <math>\,L(h)</math><br />
<br />
2) Regression: Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3) Density Estimation: estimate <math>\,P(X=x|Y=0)</math> and <math>\,P(X=x|Y=1)</math> (less popular in high-dimension cases)<br />
<br />
<br />
<br />
'''Bayes Classification Rule Optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when<math></math> the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
<br />
:<math>\, h^*(X)= \left\{\begin{matrix} <br />
1 & if P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
Remark:<br />
<br />
1)Bayes classification rule is optimal. Proof:[http://www.ee.columbia.edu/~vittorio/BayesProof.pdf]<br />
<br />
2)We still need any other method, since we cannot define prior probability in realistic.<br />
<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, since it is generally impossible for us to know the prior <math>\,P(Y=1)</math>, and class conditional density <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], tree-augmented naive Bayes (TAN), Bayesian network augmented naive Bayes (BAN) and general Bayesian network (GBN).<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as having objective existence. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayesian and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayesian'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In Bayesian method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
'''Multi-class Classification''':<br />
<br />
Y takes on more than two values.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
''Theorem'': Suppose that <math>\,Y \in \mathcal{Y}= \{1,\dots,k\}</math>, the optimal rule is :<math>\,h^*(X) = \arg\max_{k}{P(Y = k|X = x)}</math><br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
1 Empirical Risk Minimization:Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
<br />
2 Regression:Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3 Density estimation, estimate <math>P(X = x|Y = 0)</math> and <math>P(X = x|Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
It is quadratic because there is no boundaries.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on [http://academicearth.org/lectures/advice-for-applying-machine-learning LDA and QDA] so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
LDA is for classification and FDA is used for feature extraction.<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
<br />
=== logistic function ===<br />
A logistic function or logistic curve is the most common sigmoid curve. <br />
<br />
:<math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
1. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
2. <math>y(0) = \frac{1}{2}</math><br />
<br />
3. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
4. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
5. The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood, using Pr(Y|X). The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Seeing these equations as a weighted least squares problem makes them easier to derivate.<br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== Perceptron (Foundation of Neural Network) ===<br />
<br />
==== Separating Hyperplane Classifiers ====<br />
Separating hyperplane trys to separate the data using linear decision boundaries. When the classes overlap, it can be generalized to support vector machine, which constructs nonlinear boundaries by constructing a linear boundary in an enlarged and transformed feature space.<br />
<br />
==== Perceptron ====<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
Least Squares returns the sign of a linear combination of data points as the class label<br />
<br />
sign(<math>(\underline{\beta}^T \underline{x} + {\beta}_0)) = sign(\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2})</math><br />
<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br />
<br />
<br />
If we find a hyperplane that is not unique between 2 classes, there will be infinitely many solutions obtained from the perceptron algorithm.<br />
<br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
A Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math>, where <math>\,I</math> indicates the sign of the expression and returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math> (initial guess). Its goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach], which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
====Problems with the Algorithm and Issues Affecting Convergence====<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. This problem can be eliminated by using basis expansions technique. To be specific, we try to find a hyperplane not in the original space, but in the enlarged space obtained by using some basis functions.<br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#A perfect separation is not always available even desirable. If observations comes from different classes sharing the same imput, the classification model seems to be overfitting and will generally have poor predictive performance.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
<br />
</ref>.<br />
====Comment on gradient descent algorithm====<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction in which the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and you initially stand in the middle, then you will finally arrive at the saddle point (local minimum) and get stuck there.<br />
<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over <math>\,i</math> (all data points). Actually, this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent) known as Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>\,{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
*A Perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection with branches that "fan out" onto as many connections as desired, each carrying the same signal - the processing element output signal. <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref> It is a multistage regression or classification model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
A regression problem typically has only one unit in the output layer. In a k-class classification problem, there are usually k units in the output layer that each represent the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1).<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classification by NN. <br />
<br />
In perceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] and is not continuous at 0. Thus, we replace it by a smooth function <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it the '''activation function'''.<br />
<br>The choice of this function <math>\displaystyle \sigma </math> is determined by the properties of the data and the assumed distribution of target variables, but for multiple binary classification problems the logistic function, also known as inverse-logit, is often used: <br />
<math>\sigma(a)=\frac {1}{1+e^{-a}}</math><br />
<br />
There are some important properties for the activation function.<br />
<br />
# Activation function is nonlinear. It can be shown that if the activation function of the hidden units is linear, a three-layer neural network is equivalent to a two layer one. <br />
# Activation function saturate, which means there are maximum and minimum output value. This property ensures that the weights are bounded and therefore the searching time is limited. <br />
# Activation function is continuous and smooth.<br />
# Activation function is monotonic. This property is not necessary, since we know that RBF networks is also a kind of power model. <br />
<br />
'''Note:''' A key difference between a perceptron and a neural network is that a neural network uses continuous nonlinearities in the units, for the purpose of differentiation, whereas the perceptron often uses a non-differentiable activation function. The neural network function is differentiable with respect to the network parameters so that a gradient descent method can be used in training. Moreover, a perceptron is a linear classifier, whereas a neural network, by combining layers of perceptrons, is able to classify non-linear problems through proper training.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes into the perceptron, to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural network is called [http://en.wikipedia.org/wiki/Feedforward_neural_network Feed-Forward Neural Network]. Applications to Feed-Forward Neural Networks include data reduction, speech recognition, sensor signal processing, and ECG abnormality detection, to name a few. <ref>J. Annema, Feed-Forward Neural Networks, (Springer 1995), pp. 9 </ref><br />
<br />
===Back-propagation===<br />
For a while, the Neural Network model was just an idea, since there were no algorithms for training the model until 1986, when Geoffrey Hinton <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> devised an algorithm called '''back-propagation''' [http://en.wikipedia.org/wiki/Backpropagation#Algorithm]. After that, a number of other training algorithms and various configurations of neural networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied a gradient descent algorithm for optimizing weights. Back-propagation uses this idea of gradient descent to train a neural network based on the chain rule in calculus. <br />
<br />
Assume that the last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
For simplicity, there is only 1 unit at the end and assume for the moment we are doing regression.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each unit we have a function <math>\displaystyle z_i=\sigma(a_i)</math> that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classification and the resulting classification output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
''i j are reversed!''<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\delta_j \cdot z_l</math><br />
<br />
Note that a change in <math>\,a_j</math> causes changes in all <math>\,a_i</math> in the next layer on which the error is based, so we need to sum over i in the chain:<br />
<math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} =\sum_i \delta_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\,\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math> Using the activation function <math>\,\sigma(\cdot)</math><br />
<br />
So <math>\delta_j = \sum_i \delta_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \delta_i \cdot u_{ij}</math><br />
<br />
We can propagate the error calculated in the output back through the previous layers and adjust weights to minimize error.<br />
<br />
==Neural Networks (NN) - October 30, 2009 ==<br />
<br />
=== Back-propagation ===<br />
The idea is that we first feed an input from the training set to the Neural Network, then find the error rate at the output and then we propagate the error to previous layers and for each edge of weight <math>\,u_{ij}</math> we find <math>\frac{\partial \mathrm{err}}{\partial u_{ij}}</math>. Having the error rates at hand we adjust the weight of each edge by taking steps proportional to the negative of the gradient to decrease the error at output. The next step is to apply the next input from the training set and go through the described adjustment procedure.<br />
The overview of Back-propagation algorithm:<br />
#Feed a point <math>\,x</math> in the training set to the network, and find the output of all the nodes.<br />
#Evaluate <math>\,\delta_k=y_k-\hat{y_k}</math> for all output units, where <math>y_k</math> is the expected output and <math>\hat{y_k}</math> is the real output.<br />
#By propagating to the previous layers evaluate all <math>\,\delta_j</math>s for hidden units: <math>\,\delta_j=\sigma'(a_j)\sum_i \delta_i u_{ij}</math> where <math>i</math> is associated to the previous layer.<br />
#Using <math>\frac{\partial \mathrm{err}}{\partial u_{jl}} = \delta_j\cdot z_l</math> find all the derivatives.<br />
#Adjust each weight by taking steps proportional to the negative of the gradient: <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} -\rho \frac{\partial \mathrm{err}}{u_{ij}}</math><br />
#Feed the next point in the training set and repeat the above steps.<br />
==== How to initialize the weights ====<br />
This still leaves the question of how to initialize the weights <math>\,u_{ij}, w_i</math>. The method of choosing weights mentioned in class was to randomize the weights before the first step. This is not likely to be near the optimal solution in every case, but is simple to implement. To be more specific, random values near zero will be a good choice for the initial weights(usually from [-1,1]). In this case, the model evolves from a nearly linear one to a nonlinear one as we desired. An alternative is to use an orthogonal least squares method to find the initial weights <ref>http://www.mitpressjournals.org/doi/abs/10.1162/neco.1995.7.5.982</ref>. Regression is performed on the weights and output by using a linear approximation of <math>\,\sigma(a_i)</math>, and finds optimal weights in the linear model. Back propagation is used afterward to find the optimal solution, since the NN is non-linear.<br />
<br />
==== How to set learning rates ====<br />
The learning rate <math>\,\rho</math> is usually a constant. <br />
<br />
If we use On-line learning, as a form of stochastic approximation process, <math>\,\rho</math> should decrease as the iteration increase.<br />
<br />
<br />
Choosing too large learning rate may cause the unstability of the system, while the too small learning rate may lead to a very slow convergence rate(very long time in learning phase). However, the advantage of small learning rate is that it can guarantee the convergence. Thus, generally, it is better to choose a relatively small learning rate to ensure the stability. Usually, choose <math>\,\rho</math> between 0.01 and 0.7<br />
<br />
==== How to determine the number of hidden units ====<br />
<br />
Here we will mainly discuss how to estimate the number of hidden units at very beginning. Obviously, we should adjust it to be more precise using CV, LOO or other complexity control methods. <br />
<br />
Basically, if the patterns are well separated, few hidden units are fairly enough. If the patterns are drawn from some highly complicated mixture models, more hidden units are really needed. <br />
<br />
Actually, the number of hidden units determines the size of the model, and therefore the total number of the weights in the model. Typically speaking, the number of weights should not be larger than the number of training data, say N. Thus, sometimes, N/10 is a good choice. However, in pratice, many well performed models will use more hidden units.<br />
<br />
=== Dimensionality reduction application ===<br />
[[File:NN-bottelneck.png|350px|thumb|right|Figure 1: Bottleneck configuration for applying dimensionality reduction.]]<br />
One possible application of Neural Networks is to perform dimensionality reduction, like other techniques, e.g., PCA, MDS, LLE and Isomap.<br />
<br />
Consider the following configuration as shown in figure 1:<br />
As we go forward in layers of this Neural Network, the number of nodes is reduced, until we reach a layer with the number of nodes representing the desired dimensionality. However, note that at the very first few layers the number of nodes may not be strictly decreasing, as long as finally it can reach a layer with less nodes. From now on in the Neural Network <br />
the previous layers are mirrored. So at the output layer we have the same number of states as we have in the input layer. Now note that if we feed the network with each point and get an output approximately equal to the fed input, that means at the output the same input is reconstructed from the middle layer units. So the output of the middle layer units can represent the input with less dimensions.<br />
<br />
To train this Neural Network, we feed the network with a training point and through back propagation we adjust the network weights based on the error between the input layer and the reconstruction at the output layer. Our low dimensional mapping will be the observed output from the middle layer. Data reconstruction consists of putting the low dimensional data through the second half of the network.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
== Complexity Control October 30, 2009 ==<br />
<br />
[[File:overfitting-model.png|500px|thumb|right|Figure 2. The overfitting model passes through all the points of the training set, but has poor predictive power for new points.<br />
In exchange the line model has some error on the training points but has extracted the main characteristic of the training points, and has good predictive power.]]<br />
There are [http://academicearth.org/lectures/underfitting-and-overfitting two issues] that we have to avoid in Machine Learning:<br />
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]<br />
#Underfitting<br />
<br />
Overfitting occurs when our model is heavily complex with so many degrees of freedom, that we can learn every detail of the training set. Such a model will have very high precision on the training set but will show very poor ability to predict outcomes of new instances, especially outside the domain of the training set.<br />
<br />
In a Neural Network if the depth is too much, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will show a very precise outcome of the training set but will not be able to generalize the commonality of the training set to predict the outcome of new cases.<br />
<br />
Underfitting occurs when the model we picked to describe the data is not complex enough, and has high error rate on the training set.<br />
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.<br />
<br />
'''Example'''<br />
#Consider the example showed in the figure. We have a training set and we want to find a model which fits it the best. We can find a polynomial of high degree which almost passes through all the points in the training set. But, in fact the training set is coming from a line model. Now the problem is although the complex model has less error on the training set it diverges from the line in other ranges which we have no training point. Because of that the high degree polynomail has very poor predictive result on test cases. This is an example of overfitting model.<br />
#Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem.<br />
#Consider a simple classification example. If our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is a yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features typical of bananas, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie., we have overfit the data. This occurs when we have considered so many features that we have perfectly described the existing bananas, but if presented with a new banana of slightly different shape than the existing bananas, for example, it cannot be detected. This is the tradeoff; what is the right level of complexity?<br />
<br />
== Complexity Control - Nov 2, 2009 ==<br />
<br />
Overfitting occurs when the model becomes too complex and underfitting occurs when it is not complex enough, both of which are not desirable. To control complexity, it is necessary to make assumptions for the model before fitting the data. Assumptions that we can make for a model are with polynomials or a neural network. There are other ways as well.<br />
<br />
[[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 1: An example of a model with a family of polynomials]]<br />
We do not want a model to get too complex, so we control it by making an assumption on the model. With complexity control, we want a model or a classifier with a low error rate.<br />
<br />
=== '''How do we choose a good classifier?''' ===<br />
<br />
Our goal is to find a classifier that minimizes the true error rate. <br />
Recall the empirical error rate<br />
<br />
<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math><br />
<br />
<math>\,h</math> is a classifier and we want to minimize the error rate. So we apply <math>\displaystyle x_1</math> to <math>\displaystyle x_n</math> to <math>\displaystyle h</math>, and take the average to get the empirical true error rate estimation of probability that <br />
<math>h(x_{i}) \neq y_{i}</math>.<br />
<br />
<span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 2]]</span><br />
There is a downward bias to this estimate meaning that it is always less than the true error rate. <br />
<br />
If there is a change in our complexity from low to high, our error rate is always decreasing. When we apply our model to the test data, our error rate will start to decrease to a point, but then it will increase since the model hasn't seen it before. This can be explained since training error will decrease when we fit the model better by increasing its complexity, but as we have seen, this complex model will not generalize well, resulting in a larger test error. <br />
<br />
We use our test data (from the test sample line shown on Figure 2) to get our empirical error rate.<br />
Right complexity is defined as where error rate of the test data is minimum; and this is one idea behind complexity control.<br />
<br />
<br />
<br />
[[File:Bias.jpg|200px|thumb|left|Figure 3]]<br />
<br />
We assume that we have samples <math>\,X_1, . . . ,X_n</math> that follow some (possibly unknown) distribution. We want to estimate a parameter <math>\,f</math> of the unknown distribution. This parameter may be the mean <math>\,E(X_i)</math>, the variance <math>\,var(X_i)</math> or some other quantity.<br />
<br />
The unknown parameter <math>\,f</math> is a fixed real number <math>f\in R</math>. To estimate it, we use an estimator which is a<br />
function of our observations, <math>\hat{f}(X_1,...,X_n)</math>. <br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f</math><br />
<br />
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]</math><br />
<br />
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math><br />
<br />
One property we desire of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>.<br />
However, there is a more important property for an estimator than just being unbiased: the mean squared error. In statistics, there are problems for which it may be good to use an estimator with a small bias. In some cases, an estimator with a small bias may have lesser mean squared error or be median-unbiased (rather than mean-unbiased, the standard unbiasedness property). The property of median-unbiasedness is invariant under transformations while the property of mean-unbiasedness may be lost under nonlinear transformations. For example, while using an unbiased estimator with large mean square error to estimate the parameter, we highly risk a big error. In contrast, a biased estimator with small mean square error will well improve the precision of our prediction.<br />
<br />
Hence, our goal is to minimize <math>MSE (\hat{f})</math>.<br />
<br />
From figure 3, we can see that the relationship of the three parameters is:<br />
<math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.<br />
<br />
A Test error is a good estimation on MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias.<br />
<br />
<br />
Referring to Figure 2, overfitting happens after the point where training data (training sample line) starts to decrease and test data (test sample line) starts to increase. There are 2 main approaches to avoid overfitting:<br />
<br />
1. Estimating error rate<br />
<br />
<math>\hookrightarrow</math> Empirical training error is not a good estimation<br />
<br />
<math>\hookrightarrow</math> Empirical test error is a better estimation<br />
<br />
<math>\hookrightarrow</math> Cross-Validation is fast<br />
<br />
<math>\hookrightarrow</math> Computing error bound (analytically) using some probability inequality.<br />
<br />
We will not discuss computing the error bound in class; however, a popular method for doing this computation is called VC Dimension (short for Vapnik–Chervonenkis Dimension). Information can be found from [http://www.autonlab.org/tutorials/vcdim.html Andrew Moore] and [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.7171&rep=rep1&type=pdf Steve Gunn].<br />
<br />
2. Regularization<br />
<br />
<math>\hookrightarrow</math> Use of shrinkage method<br />
<br />
<math>\hookrightarrow</math> Decrease the chance of overfitting by controlling the weights<br />
<br />
=== '''Example of under and overfitting in R''' ===<br />
<br />
To give further intuition of over and underfitting, consider this example. A simple quadratic data set with some random noise is generated, and then polynomials of varying degrees are fitted. The errors for the training set and a test set are calculated.<br />
[[File:Curvefitting-rex2.png|250px|thumb|right|Polynomial fits to curved data set.]]<br />
<br />
>> x <- rnorm(200,0,1)<br />
>> y <- x^2-0.5*x+rnorm(200,0,0.3)<br />
>> xtest <- rnorm(50,1,1)<br />
>> ytest <- xtest^2-0.5*xtest+rnorm(50,0,0.3)<br />
>> p1 <- lm(y~x)<br />
>> p2 <- lm(y ~ poly(x,2))<br />
>> pn <- lm(y ~ poly(x,10))<br />
>> psi <- lm(y~I(sin(x))+I(cos(x)))<br />
<br />
: <code>x</code> values for the training set are based on a <math>\,N(0,1)</math> distribution, while the test set has a <math>\,N(1,1)</math> distribution. <code>y</code> values are determined by <math>\,y = x^2 + x + N(0,0.3)</math>, a quadratic function with some random variation. Polynomial least square fits of degree 1, 2, and 10 are calculated, as well as a fit of <math>\,sin(x)+cos(x)</math>.<br />
<br />
>> > # calculate the mean squared error of degree 1 poly<br />
>> > sum((y-predict(p1,data.frame(x)))^2)/length(y)<br />
>> [1] 1.576042<br />
>> > sum((ytest-predict(p1,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 7.727615<br />
: Training and test mean squared errors for the linear fit. These are both quite high - and since the data is non-linear, the different mean value of the test data increases the error quite a bit.<br />
>> > # calculate the mean squared error of degree 2 poly<br />
>> > sum((y-predict(p2,data.frame(x)))^2)/length(y)<br />
>> [1] 0.08608467<br />
>> > sum((ytest-predict(p2,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 0.08407432<br />
: This fit is far better - and there is not much difference between the training and test error, either.<br />
>> > # calculate the mean squared error of degree 10 poly<br />
>> > sum((y-predict(pn,data.frame(x)))^2)/length(y)<br />
>> [1] 0.07967558<br />
>> > sum((ytest-predict(pn,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 156.7139<br />
: With a high-degree polynomial, the training error continues to decrease, but not by much - and the test set error has risen again. The overfitting makes it a poor predictor. As the degree of the polynomial rises further, the accuracy of the computer becomes an issue - and a good fit is not even consistently produced for the training data.<br />
>> > # calculate mse of sin/cos fit<br />
>> > sum((y-predict(psi,data.frame(x)))^2)/length(y)<br />
>> [1] 0.1105446<br />
>> > sum((ytest-predict(psi,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 1.320404<br />
: Fitting a function of the form sin(x)+cos(x) works pretty well on the training set, but because it is not the real underlying function, it fails on test data which doesn't lie on the same domain.<br />
<br />
== ''' Cross-Validation (CV) - Introduction ''' ==<br />
<br />
[[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]]<br />
Cross-Validation is used to estimate the error rate of a classifier with respect to test data rather than data used in the model. Here is a general introduction to CV:<br />
<br />
<math>\hookrightarrow</math> We have a set of collected data for which we know the proper labels<br />
<br />
<math>\hookrightarrow</math> We divide it into 2 parts, Training data (T) and Validation data (V)<br />
<br />
<math>\hookrightarrow</math> For our calculation, we pretend that we do not know the label of V and we use data in T to train the classifier<br />
<br />
<math>\hookrightarrow</math> We estimate an empirical error rate on V since the model hasn't seen V yet and we know the proper label of all elements in V to know how many were misclassified.<br />
<br />
CV has different implementations which can reduce the variance of the calculated error rate, but sometimes with a tradeoff of a higher calculation time.<br />
<br />
== ''' Complexity Control - Nov 4, 2009''' ==<br />
<br />
== Cross-validation ==<br />
[[File:Cross-validation.png|350px|thumb|right|Figure 1: Classical/Standard cross-validation]]<br />
Cross-validation is the simplest and most widely used method to estimate the true error. It comes from the observation that although training error always decreases with the increasing complexity of the model, the test error starts to increase from a certain point, which is noted as overfitting (see [[#prediction-error|figure 2]] above). Since test error estimates MSE (mean square error) best, people came up with the idea to divide the data set into three parts: training set, validation set, and test set. training set is used to build the model, validation set is used to deside the parameters and the optimal model, and the test set is used to estimate the performance of the chosen model. A classical division is 50% for training set, and 25% each for validation set and test set. All of them are randomly selected from the original data set. <br />
<br />
Then, we only use the part of our data marked as the "training set" to train our algorithm, while keeping the remaining marked as the "validation set" untouched. As a result, the validation set will be totally unknown to the trained model. The error rate is then estimated by:<br />
<br />
<math>\hat L(h) = \frac{1}{|\nu|}\sum_{X_i \in \nu}(h(x_i) \neq y_i)</math>, where <math>\,|\nu|</math> is the cardinality of the validation set.<br />
<br />
When we change the complexity, the error generated by the validation set will have the same behavior as the test set, so we are able to choose the best parameters to get the lowest error.<br />
<br />
<br />
=== K-fold Cross-validation ===<br />
[[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]]<br />
Above is the simplest form of complexity control. However, in reality, it may be hard to collect data ??and we usually suffer from the curse of dimensionality??, and a larger data set may be hard to come by. Consequently, we may not be able to afford to sacrifice part of the limited resources. In this case we use another method that addresses this problem, K-fold cross-validation. We divide the data set into <math>\,K</math> subsets roughly equal in size. The usual choice is <math>\,K = 10</math>.<br />
<br />
Generally, how to choose <math>\,K</math>:<br />
<br />
if <math>\,K=n</math>, leave one out, low bias, high variance. Each subset contains a single element, so the model is trained with all except one point, and then validated using that point.<br />
<br />
if <math>\,K=2</math>, say 2-fold, 5-fold, high bias, low variance. Each subset contains approximately <math>\,\frac{1}{2}</math> or <math>\,\frac{1}{5}</math> of the data.<br />
<br />
For every <math>\,k</math>th <math>( \,k \in [ 1, K ] )</math> part, we use the other <math>\,K-1</math> parts to fit the model and test on the <math>\,k</math>th part to estimate the prediction error <math>\hat L_k</math>, where<br />
<br />
<math>\hat L(h) = \frac{1}{K}\sum_{k=1}^K\hat L_k</math><br />
<br />
For example, suppose we want to fit a polynomial model to the data set and split the set into four equal subsets as shown in Figure 2. First we choose the degree to be 1, i.e. a linear model. Next we use the first three sets as training sets and the last as validation set, then the 1st, 2nd, 4th subsets as training set and the 3rd as validation set, so on and so forth until all the subsets have been the validation set once (all observations are used for both training and validation). After we get <math>\hat L_1, \hat L_2, \hat L_3, \hat L_4</math>, we can calculate the average <math>\hat L</math> for degree 1 model. Similarly, we can estimate the error for n degree model and generate a simulating curve. Now we are able to choose the right degree which corresponds to the minimum error. Also, we can use this method to find the optimal unit number of hidden layers of neural networks. We can begin with 1 unit number, then 2, 3 and so on and so forth. Then find the unit number of hidden layers with lowest average error.<br />
<br />
=== Generalized Cross-validation ===<br />
If the vector of observed values is denoted by <math>\mathbf{y}</math>, and the vector of fitted values by <math>\hat\mathbf{y}</math>.<br />
<br />
<math>\mathbf{y} = \mathbf{H}\hat\mathbf{y}</math>, <br />
<br />
where the hat matrix is given by<br />
<br />
<math>\mathbf{H} = \mathbf{X}( \mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X}^{T}</math>,<br />
<br />
<math> \frac{1}{N}\sum_{i=1}^{N}[y_{i} - \hat f^{-i}(\mathbf{x}_{i})]^{2}=\frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-\mathbf{H}_{ii}}]^{2}</math>,<br />
<br />
Then the GCV approximation is given by<br />
<br />
<math> GCV(\hat f) = \frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-trace(\mathbf{H})/N}]^{2}</math>,<br />
<br />
Thus, one of the biggest advantages of the GCV is that the trace is more easily computed.<br />
<br />
=== Leave-one-out Cross-validation ===<br />
Leave-one-out cross-validation involves using all but one data point in the original training data set to train our model, then using the data point that we initially left out as a means to estimate true error. But repeating this process for every data point in our original data set, we can obtain a good estimation of true error.<br />
<br />
In other words, leave-one-out cross-validation is k-fold cross-validation in which we set the subset number <math>\,K</math> to be the cardinality of the whole data set.<br />
<br />
In the above example, we can see that k-fold cross-validation can be computationally expensive: for every possible value of the parameter, we must train the model <math>\,K</math> times. This deficiency is even more obvious in leave-one-out cross-validation, where we must train the model <math>\,n</math> times, where <math>\,n</math> is the number of data point in the data set.<br />
<br />
Fortunately, when adding data points to the classifier is reversible, calculating the difference between two classifiers is computationally more efficient than calculating the two classifiers separately. So, if the classifier on all the data points is known, we simply undo the changes from a data point <math>\,K</math> times to calculate the leave-one-out cross-validation error rate.<br />
<br />
== Regularization for Neural Network — Weight Decay ==<br />
[[File:figure 2.png|350px|thumb|right|Figure 1: activation function]]<br />
Weight decay training is suggested as an implementation for achieving a robust neural network which is insensitive to noise. Since the number of hidden layers in NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting.<br />
<br />
It can be seen from Figure 1 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, we can avoid overfitting by constraining the weights to be small. This gives us a hint to initialize the random weights to be close to zero.<br />
<br />
Formally, we penalize nonlinear weights by adding a penalty term in the error function. Now the regularized error function becomes:<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}|w_i|^2 + \sum_{jk}|u_{jk}|^2)</math>, where <math>\,err</math> is the original error in back-propagation; <math>\,w_i</math> is the weights of the output layer; <math>\,u_{jk}</math> is the weights of the hidden layers.<br />
<br />
Usually, too large <math>\,\lambda</math> will make the weights <math>\,w_i</math> and <math>\,u_{jk}</math> too small. We can use cross validation to estimate <math>\,\lambda</math>.<br />
<br />
A similar penalty, weight elimination, is given by,<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}\frac{|w_i|^2}{1 + |w_i|^2} + \sum_{jk}\frac{|u_{jk}|^2}{1+|u_{jk}|^2})</math>.<br />
<br />
As in back-propagation, we take partial derivative with respect to the weights:<br />
<br />
<math>\frac{\partial REG}{\partial w_i} = \frac{\partial err}{\partial w_i} + 2\lambda w_i</math><br />
<br />
<math>\frac{\partial REG}{\partial u_{jk}} = \frac{\partial err}{\partial u_{jk}} + 2\lambda u_{jk}</math><br />
<br />
<math>w^{new} \leftarrow w^{old} - \rho\left(\frac{\partial err}{\partial w} + 2\lambda w\right)</math><br />
<br />
<math>u^{new} \leftarrow u^{old} - \rho\left(\frac{\partial err}{\partial u} + 2\lambda u\right)</math><br />
<br />
Note that here <math>\,\lambda</math> serves as a trade-off parameter, tuning between the error rate and the linearity. Actually, we may also set <math>\,\lambda</math> by cross-validation. The tuning parameter is important since weights of zero will lead to zero derivatives and the algorithm will not change. On the other hand, starting with weights that are too large means starting with a nonlinear model which can often lead to poor solutions. <ref>Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning (Springer 2009) pp.398</ref><br />
<br />
== Radial Basis Function (RBF) Networks - November 6, 2009 ==<br />
<br />
[[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]]<br />
<br />
=== Introduction === <br />
<br />
A Radial Basis Function (RBF) network [http://en.wikipedia.org/wiki/Radial_basis_function_network] is a type of artificial neural network with an output layer and a single hidden layer, with weights from the hidden layer to the output layer, and can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. One choice that has been widely used is that of radial basis functions, which have the property that each basis function depends only on the radial distance (typically Euclidean) from a center <math>\displaystyle\mu_{j}</math>, so that <math>\phi_{j}(x)= h({\Vert x - \mu_{j}\Vert})</math>.<br />
<br />
<br />
The output of an RBF network can be expressed as a weighted sum of its radial basis functions as follows:<br />
<br />
<math>\hat y_{k} = \sum_{j=1}^M\phi_{j}(x) w_{jk}</math><br />
<br />
The radial basis function is: <br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br /><br />
(Gaussian without a normalization constant)<br /><br /><br />
'''note:'''The hidden layer has a variable number of neurons (the optimal number is determined by the training process). As usual the more neurons in the hidden layer, the higher the model complexity. Each neuron consists of a radial basis function centered on a point with the same dimensions as the input data. The radii of the RBF functions may be different. The centers and radii can be determined through clustering or an EM algorithm. When the x vector is given from the input layer, the hidden neuron computes the radial distance from the neuron’s center point and then applies RBF function to this distance. The resulting value is passed to the the output layer and weighed together to form the output. <br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat{Y}_{n,k} = \left[ \begin{matrix}<br />
\hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\<br />
\hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\<br />
\vdots &\vdots & \ddots & \vdots \\<br />
\hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k}<br />
\end{matrix}\right] </math> is the matrix of output variables. <br />
<br />
:<math>\Phi_{n,m} = \left[ \begin{matrix}<br />
\phi_{1,1} & \phi_{1,2} & \cdots & \phi_{1,m} \\<br />
\phi_{2,1} & \phi_{2,2} & \cdots & \phi_{2,m} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
\phi_{n,1} & \phi_{n,2} & \cdots & \phi_{n,m}<br />
\end{matrix}\right] </math> is the matrix of Radial Basis Functions.<br />
<br />
:<math>W_{m,k} = \left[ \begin{matrix}<br />
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\<br />
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
w_{m,1} & w_{m,2} & \cdots & w_{m,k}<br />
\end{matrix}\right] </math> is the matrix of weights.<br />
<br />
Here, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>k = 1</math>, <math>\hat Y</math> and <math>W</math> are column vectors.<br />
<br />
''related reading'':<br />
<br />
Introduction of the Radial Basis Function (RBF) Networks [http://axiom.anu.edu.au/~daa/courses/GSAC6017/rbf.pdf]<br />
<br />
Radial Basis Function (RBF) Networks [http://documents.wolfram.com/applications/neuralnetworks/index6.html] [http://lcn.epfl.ch/tutorial/english/rbf/html/index.html]<br />
<br />
=== Estimation of weight matrix W ===<br />
<br />
We minimize the training error, <math>\Vert Y - \hat{Y}\Vert^2</math> in order to find <math>\,W</math>.<br /><br /><br />
From a previous result in linear algebra we know that <br />
<br />
<math>\Vert A \Vert^2 = Tr(A^{T}A)</math><br />
<br />
Thus we have a problem similar to linear regression:<br />
<br />
<math>\ err = \Vert Y - \Phi W\Vert^{2} = Tr[(Y - \Phi W)^{T}(Y - \Phi W)]</math><br />
<br />
<math>\ err = Tr[Y^{T}Y - Y^{T}\Phi W - W^{T} \Phi^{T} Y + W^{T}\Phi^{T} \Phi W]</math><br />
<br />
<br />
==== Useful properties of matrix differentiation ====<br />
<br />
<br />
<math>\frac{\partial Tr(AX)}{\partial X} = A^{T}</math><br />
<br />
<math>\frac{\partial Tr(X^{T}A)}{\partial X} = A</math><br />
<br />
<math>\frac{\partial Tr(X^{T}AX)}{\partial X} = (A^{T} + A)X</math><br />
<br />
==== Solving for W ====<br />
<br />
We find the minimum over <math>\,W</math> by setting <math>\frac{\partial err}{\partial W}</math> equal to zero and using the aforementioned properties of matrix differentiation.<br />
<br />
<math>\frac{\partial err}{\partial W} = 0</math><br />
<br />
<math>\ 0 - \Phi^{T}Y - \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ -2 \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math><br />
<br />
<math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY</math><br />
<br />
where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}</math><br />
<br />
<math>\,H</math> is the hat matrix for this model. This gives us a nice results since the solution has a closed form and we do not have to worry about convexity problems in this case.<br />
<br />
=== Including an additional bias ===<br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat Y = \left[ \begin{matrix}<br />
y_{11} & y_{12} & \cdots & y_{1k} \\<br />
y_{21} & y_{22} & \cdots & y_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
y_{n1} & y_{n2} & \cdots & y_{nk}<br />
\end{matrix}\right] </math> is the matrix(n by k) of output variables.<br />
<br />
:<math>\Phi = \left[ \begin{matrix}<br />
\phi_{10} &\phi_{11} & \phi_{12} & \cdots & \phi_{1M} \\<br />
\phi_{20} & \phi_{21} & \phi_{22} & \cdots & \phi_{2M} \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{n0} &\phi_{n1} & \phi_{n2} & \cdots & \phi_{nM}<br />
\end{matrix}\right] </math> is the matrix(n by M+1) of Radial Basis Functions.<br />
<br />
:<math>W = \left[ \begin{matrix}<br />
w_{01} & w_{02} & \cdots & w_{0k} \\<br />
w_{11} & w_{12} & \cdots & w_{1k} \\<br />
w_{21} & w_{22} & \cdots & w_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
w_{M1} & w_{M2} & \cdots & w_{Mk}<br />
\end{matrix}\right] </math> is the matrix(M+1 by k) of weights.<br />
<br />
where the extra basis function <math>\Phi_{0}</math> is set to 1.<br />
<br />
==== Normalized RBF ====<br />
<br />
In addition to the above unnormalized architecture, the normalized RBF can be represented as:<br />
<br />
<math>\hat{y}_{k}(X) = \frac{\sum_{j=1}^{M} w_{jk}\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math><br /><br /><br />
<br />
<br />
Actually, <math>\Phi^{\ast}_{j}(X) = \frac{\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math> is known as a normalized radial basis function. Giving the familiar form,<br /><br />
<br />
<math>\hat{y}_{k}(X) = \sum_{j=1}^{M} w_{jk}\Phi^{\ast}_{j}(X)</math><br /><br /><br />
<br />
=== Conceptualizing RBF networks ===<br />
<br />
In the past, we have classified data using models that were explicitly linear, quadratic, or otherwise definite. In RBF networks, like in Neural Networks, we can fit an arbitrary model. How can we do this without changing the equations being used?<br />
<br />
Recall a [[#Trick:_Using_LDA_to_do_QDA_-_October_7.2C_2009|trick]] that was discussed in the October 7 lecture: if we add new features to our original data set, we can project into higher dimensions, use a linear algorithm, and get a quadratic result by collapsing to a lower dimension afterward. In RBF networks, something similar can happen.<br />
<br />
Think of <math>\,\Phi</math>, our matrix of radial basis functions, as a feature space of the input. Each hidden unit, then, can be thought to represent a feature; we can see that, if there are more hidden units than input units, we can essentially project to a higher-dimensional space, as we did in our earlier trick. However, this does not mean that an RBF network will actually do this, it is merely a way to convince yourself that RBF networks (and neural networks) can fit arbitrary models. Nevertheless, it is also noticed that just because of such great power, the problem of overfitting appears more important. We have to control its complexity so that it does not fit to an arbitrary training model but to a general one.<br />
<br />
=== RBF networks for classification -- a probabilistic paradigm ===<br />
<br />
[[File:Rbf_graphical_model.png|350px|thumb|left|Figure 1: RBF graphical model]]<br />
<br />
An RBF network is akin to fitting a Gaussian mixture model to data. We assume that each class can be modelled by a single function <math>\,\phi</math> and data is generated by a mixture model. According to Bayes Rule,<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(x|y_{k})*Pr(y_{k})}{Pr(x)}</math><br />
<br />
While all classifiers that we have seen thus far in the course have been in discriminative form, the RBF network is a generative model that can be represented using a directed graph.<br />
<br />
We can replace the class conditional density in the above conditional probability expression by marginalizing <math>\,x</math> over <math>\,j</math>:<br />
<math>\Pr(x|y_{k}) = \sum_{j} Pr(x|j)*Pr(j|y_{k})</math><br />
<br />
<br />
<br />
<br/><br/><br />
*'''Note''' We made the assumption that each class can be modelled by a single function <math>\displaystyle\Phi</math> and that the data was generated by a mixture model. The Gaussian mixture model has the form:<br />
<math>f(x)=\sum_{m=1}^M \alpha_m \phi(x;\mu_m,\Sigma_m)</math> where <math>\displaystyle\alpha_m</math> are mixing proportions, <math>\displaystyle\sum_m \alpha_m=1</math>, and <math>\displaystyle\mu_m</math> and <math>\displaystyle\Sigma_m</math> are the mean and covariance of each Gaussian density respectively. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009), pp. 214. </ref> The generative model in Figure 1 shows graphically how each Gaussian in the mixture model is chosen to sample from.<br />
<br />
== '''Radial Basis Function (RBF) Networks - November 9th, 2009''' ==<br />
<br />
=== RBF Network for classification (A probabilistic point of view) ===<br />
Using RBF Network to do classification, we usually treat it as regression problem. We want to set a threshold to decide what the data’s class membership is. However, to find some insight to the classification problem of what we are doing in terms of RBF Network, we often think of mixture models and make certain assumptions.<br />
<br />
[[File:RBF.png|350px|thumb|right|Figure 26.1: RBF Network Classification Demo]] <br />
<br />
We assume, as we can see in the graph on the right hand side, that we have three random variables, <math>\displaystyle y_k</math>, <math>\displaystyle j</math>, and <math>\displaystyle x</math> where <math>\displaystyle y_k</math> denotes class <math>\,k</math>, <math>\displaystyle x</math> is what we observed here, and <math>\displaystyle j</math> is some hidden random variable. There is a process that there are different classes, and each class can trigger a different hidden random variable <math>\displaystyle j</math>. To understand this, we can assume that, for instance, this random variable <math>\displaystyle j</math> has a Gaussian distribution (it could have any other distribution as well) and that all the <math>\displaystyle j</math>’s have the same distribution (Gaussian), but with different parameters. From each Gaussian distribution triggered by each class, we are going to sample some data points. Therefore, in the end, we are going to get a set of data, which are not strictly Gaussian, but are actually a mixture of Gaussians.<br />
<br />
Again, we look at the posterior distribution from [http://en.wikipedia.org/wiki/Bayes'_theorem Bayes' Rule].<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(X = x | Y = y_{k})*Pr(Y = y_{k})}{Pr(X = x)}</math><br />
<br />
Since we made the assumption that the data has been generated from a mixture model, we can estimate this conditional probability by<br />
<br />
<math>\Pr(X = x | Y = y_{k}) = \sum_{j} Pr(X = x | j)*Pr(j | Y = y_{k})</math>, <br />
<br />
which is the class conditional distribution (or probability) of the mixture model. Note, here, if we only have a simple model from <math>\displaystyle y_k</math> to <math>\displaystyle x</math>, then we won’t have this summation.<br />
<br />
We can substitute this class conditional distribution into Bayes' formula. We can see that the posterior of class <math>\displaystyle k</math> is the summation over <math>\displaystyle j</math> of the probability of <math>\displaystyle x</math> given <math>\displaystyle j</math> times the probability of <math>\displaystyle j</math> given <math>\displaystyle y_k</math>, times the prior distribution of class <math>\displaystyle k</math>, and lastly divided by the marginal probability of <math>\displaystyle x</math>. That is,<br />
<br />
<math>\Pr(y_k | x) = \frac {\sum_{j} Pr(x | j)*Pr(j | y_{k})*Pr(y_{k})}{Pr(x)}</math>.<br />
<br />
Since, the prior probability of class <math>\displaystyle k</math>, <math>\displaystyle Pr(y_{k})</math>, does not have an index of <math>\displaystyle j</math>, it can be taken out of the summation. This yields,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)}</math>.<br />
<br />
We multiply this by <math>\displaystyle 1 = \frac {Pr(j)}{Pr(j)}</math>. Then, it becomes,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)} * \frac {Pr(j)}{Pr(j)}</math>.<br />
<br />
Next, note that <math>\displaystyle Pr(j | x) = \frac {Pr(x | j)*Pr(j)}{Pr(x)}</math>, and <math>\displaystyle Pr(y_k | j) = \frac {Pr(j | y_k)*Pr(y_k)}{Pr(j)}</math>. Then rearranging the terms, we finally have the posterior:<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} Pr(j | x)Pr(y_k | j)</math>.<br />
<br />
Interestingly, this is just the product of the posterior of the two functions that are summed.<br />
<br />
==== Interpretation of RBF Network classification ====<br />
<br />
[[File:2.png|350px|thumb|right|Figure 26.1.2(2): RBF Nerwork ]]<br />
<br />
We want to relate the results that we derived above to our RBF Network. In a RBF Network, as we can see on the right hand side, we have a set of data, <math>\displaystyle x_1</math> to <math>\displaystyle x_d</math>, and the hidden basis function, <math>\displaystyle \phi_{1}</math> to <math>\displaystyle \phi_{M}</math>, and then we have some output, <math>\displaystyle y_1</math> to <math>\displaystyle y_k</math>. Also, we have weights from the hidden layer to output layer. The output is just the linear sum of <math>\displaystyle \phi</math>’s. <br />
<br />
Now consider probability of <math>\displaystyle j</math> given <math>\displaystyle x</math> to be <math>\displaystyle \phi</math>, and the probability of <math>\displaystyle y_k</math> given <math>\displaystyle j</math> to be the weights <math>\displaystyle w_{jk}</math>, then the posterior can be written as,<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math>.<br />
<br />
[[File:3.png|350px|thumb|left|Figure 26.1.2(1): Gaussian mixture ]]<br />
<br />
Now, let us look at an example in one dimensional case. Suppose,<br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, and <math>\displaystyle j</math> is from 1 to 2. <br />
<br />
We know that <math>\displaystyle \phi</math> is a radial basis function. It's as if we put some Gaussian over data. And for each Gaussian, we consider the center <math>\displaystyle \mu</math>. Then, what <math>\displaystyle \phi</math> computes is the similarity of any data point to the center. <br />
<br />
We can see the graph on the left which plots the density of <math>\displaystyle \phi_{1}</math> and <math>\displaystyle \phi_{2}</math>. Take <math>\displaystyle \phi_{1}</math> for instance, if the point gets far from the center <math>\displaystyle \mu_{1}</math>, then it will reduce <math>\displaystyle \phi_{1}</math> to become nearly zero. Remember that, we can usually find a non-linear regression or classification of input space by doing a linear one in some extended space or some feature space (more details in Aside). Here, the <math>\displaystyle \phi</math>’s actually produce that feature space. <br />
<br />
So, one way to look at this is that this <math>\displaystyle \phi</math> is telling us that given an input, how likely the probability of presence of a particular feature is. Say, for example, we define the features as the centers of these Gaussian distributions. Then, this <math>\displaystyle \phi</math> function somehow computes the possibility given certain data points, of this kind of feature appearing. If the data point is right at the center, then the value of that <math>\displaystyle \phi</math> would be one, i.e. the probability is 1. If the point is far from the center, then the probability (<math>\displaystyle \phi</math> function value) will be close to zero, that is, it’s less likely. Therefore, we can treat <math>\displaystyle Pr(j | x)</math> as the probability of a particular feature given data. <br />
<br />
When we have those features, then <math>\displaystyle y</math> is the linear combination of the features. Hence, any of the weights <math>\displaystyle w</math>, which is equal to <math>\displaystyle Pr(y_k | j)</math>, tells us how likely this particular <math>\displaystyle y</math> will appear given those features. Therefore, the weight <math>\displaystyle w_{jk}</math> shows the probability of class membership given feature. <br />
<br />
Hence, we have found a probabilistic point of view to look at RBF Network!<br />
<br />
*'''Note''' There are some inconsistencies with this probabilistic point of view. There are no restrictions that force <math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math> to be between 0 and 1. So if least squares is used to solve this, <math>\displaystyle w_{jk}</math> cannot be interpreted as a probability. <br />
<br />
<br />
''' Aside '''<br />
*Feature Space:<br />
:One way to produce a feature space is LDA<br />
:Suppose, we have n data points <math>\mathbf{x}_1</math> to <math>\mathbf{x}_n </math>. Each data point has d features. And these n data points consist of the <math>X</math> matrix, <br />
:<math>X = \left[ \begin{matrix}<br />
x_{11} & x_{21} & \cdots & x_{n1} \\<br />
x_{12} & x_{22} & \cdots & x_{n2} \\<br />
\vdots & & \ddots & \vdots \\<br />
x_{1d} & x_{2d} & \cdots & x_{nd}<br />
\end{matrix}\right] </math><br />
:Also, we have feature space,<br />
:<math>\Phi^{T} = \left[ \begin{matrix}<br />
\phi_{1}(\mathbf{x_1}) & \phi_{1}(\mathbf{x_2})& \cdots & \phi_{1}(\mathbf{x_n})\\<br />
\phi_{2}(\mathbf{x_1})& \phi_{2}(\mathbf{x_2})& \cdots & \phi_{2}(\mathbf{x_n}) \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{M}(\mathbf{x_1}) & \phi_{M}(\mathbf{x_2}) & \cdots & \phi_{M}(\mathbf{x_n})<br />
\end{matrix}\right] </math> <br />
:If we want to solve a regression problem for the input data, we don’t perform Least Square on this <math>\displaystyle X</math> matrix, we do Least Square on the feature space, i.e. on the <math>\displaystyle \Phi^{T}</math> matrix. The dimensionality of <math>\displaystyle \Phi^{T}</math> is M by n.<br />
:Now, we still have n data points, but we define these n data points in terms of a new set of features. So, originally, we define our data points by d features, but now, we define them by M features. And what are those M features telling us? <br />
:Let us look at the first column of <math>\displaystyle \Phi^{T}</math> matrix. The first entry is <math>\displaystyle \phi_1</math> applied to <math>\mathbf{x_1}</math>, and so on, until the last entry is <math>\displaystyle \phi_M</math> applied to <math>\mathbf{x_1}</math>. Suppose each of these <math>\displaystyle \phi_j</math> is defined by<br />
:<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>.<br />
:Then, each <math>\displaystyle \phi_j</math> checks the similarity of the data point with its center. Hence, the new set of features are actually representing M centers in our data set, and for each data point, its new features check how this point is similar to the first center; how it is similar to the second center; and how it is similar to the <math>\displaystyle M^{th}</math> center. And this checking process will apply to all data points. Therefore, feature space gives another representation of our data set. <br />
<br />
</noinclude><br />
<br />
=== Model selection or complexity control for RBF Network - a brief introduction ===<br />
In order to obtain a better fit for the training data, we often want to increase the complexity of our RBF Network. By its construction, we know that to change the complexity of a RBF Network, the only way is to add or decrease the number of basis functions. A large number of basis function yields a more complex network. In theory, if we add enough basis functions, the RFB Network would work for any training; however, it doesn't mean this RBF Network model can generalize well. Therefore, for the purpose of avoiding overfitting problem (see Notes below), we only want to increase the number of basis function to certain point, i.e. its optimal level. <br />
<br />
For the model selection, what we usually do is estimate the training error. After working through the training error, we’ll see that the training error in fact can be decomposed, and one component of training error is called Mean Squared Error (MSE). In the later notes, we will find that our final goal is to get a good estimate of MSE. Moreover, in order to find an optimal model for our data, we select the model with the smallest MSE.<br />
<br />
Now, let us introduce some notations that we will use in the analysis:<br />
*<math>\hat f</math> -- the prediction model estimated by a RBF network from the training data<br />
*<math>\displaystyle f</math> -- the real model (not null), and ideally, we want <math>\hat f</math> to be close to <math>\displaystyle f</math><br />
*<math>\displaystyle err</math> -- the training error<br />
*<math>\displaystyle Err</math> -- the testing error<br />
*<math>\displaystyle MSE</math> -- the Mean Squared Error<br />
<br />
''' Notes '''<br />
<br />
[[File:overfitting.png|350px|thumb|left|Figure 26.2: Overfitting]]<br />
<br />
*Being more complex isn’t always a good thing. Sometime, [http://en.wikipedia.org/wiki/Overfitting overfitting] causes the model to lose its generality. For example in the graph on left hand side, the data points are sampled from the model <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle f(x_i)</math> is a linear function, which is shown by the blue line, and <math>\displaystyle \epsilon_i</math>is additive Gaussian noise from <math>~N(0,\sigma^2)</math>. The red curve displayed in the graph shows the over-fitted model. Clearly, this over-fitted model only works for any training data, and is useless for any further prediction when new data points are introduced.<br />
<br />
> n<-20;<br />
> x<-seq(1,10,length=n);<br />
> alpha<-2.5;<br />
> beta<-1.75;<br />
> y<-alpha+beta*x+rnorm(n);<br />
> plot(y~x, pch=16, lwd=3, cex=0.5, main='Overfitting');<br />
> abline(alpha, beta, col='blue');<br />
> lines(spline(x, y), col = 2);<br />
<br />
*More details on this topic later on.<br />
<br />
<br />
<br />
</noinclude><br />
<br />
<br />
<br />
<br />
<br />
<br />
== '''Model Selection(Stein's Unbiased Risk Estimate)- November 11th, 2009''' ==<br />
<br />
===Model Selection===<br />
<br />
Model selection is a task of selecting a model of optimal complexity for a given data. Learning a radial basis function network from data is a parameter estimation problem. One difficulty with this problem is selecting parameters that show good performance on both training and testing data. In principle, a model is selected to have parameters associated with the best observed performance on training data, although our goal really is to achieve good performance on unseen testing data. Not surprisingly, a model selected on the basis of training data does not necessarily exhibit comparable performance on the testing data. When squared error is used as the performance index, a zero-error model on the training data can always be achieved by using a sufficient number of basis functions.<br />
<br />
<br />
But, training error and testing error do not demonstrate a linear relationship. In particular, a smaller training error does do not necessarily result in a smaller testing error. In practice, one often observes that, up to a certain point, the model error on testing data tends to decrease as the training error decreases. However, if one attempts to decrease the training error too far by increasing model complexity, the testing error often can take a dramatic increase.<br />
<br />
<br />
The basic reason behind this phenomenon is that in the process of minimizing training error, after a certain point, the model begins to over-fit the training set. Over-fitting in this context means fitting the model to training data at the expense of losing generality. In the extreme form, a set of <math>\displaystyle N</math> training data points can be modeled exactly with <math>\displaystyle N</math> radial basis functions. Such a model follows the training data perfectly. However, the model is not representative features of the true underlying data source, and this is why it fails to correctly model new data points.<br />
<br />
<br />
In general, the training error rate will be less than the testing error on the new data. A model typically adapts to the training data, and hence the training error will be an overly optimistic estimate of the testing error. An obvious way to well estimate testing err is to add a penalty term to the training error to compensate. Actually, SURE is developed based on this.<br />
<br />
<br />
<br />
===Stein's unbiased risk estimate (SURE)===<br />
<br />
<br />
====Important Notation====<br />
<br />
Let:<br />
*<math>\hat f(X)</math> denote the ''prediction model'', which is estimated from a training sample by the RBF neural network model.<br />
*<math>\displaystyle f(X)</math> denote the ''true model''.<br />
*<math>\displaystyle err=\sum_{i=1}^N (\hat y_i-y_i)^2 </math> denote the ''training error'',which is the average loss over the training sample.<br />
*<math>\displaystyle Err=\sum_{i=1}^M (\hat y_i-y_i)^2 </math> denote the ''test error'', which is the expected prediction error on an independent test sample.<br />
*<math>\displaystyle MSE=E(\hat f-f)^2</math> denote the ''mean squared error'', where <math>\hat f(X)</math> is the estimated model and <math>\displaystyle f(X)</math> is the true model.<br />
<br />
<br />
<br />
Suppose the observations <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>.We need to estimate <math>\hat f</math> from training data set <math>T=\{(x_i,y_i)\}^2_{i=1}</math>. Let <math>\hat f_i=\hat f_(x_i)</math> and <math>\displaystyle f_i= f(x_i)</math>, then <br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i-\epsilon_i)^2]</math><math>=E[(\hat f_i-f_i)^2]+E[\epsilon_i^2]-2E[\epsilon_i(\hat f_i-f_i)]</math><br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2E[\epsilon_i(\hat f_i-f_i)]</math> <math>\displaystyle (1)</math><br />
<br />
The last term can be written as:<br />
<br />
<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=E[(y_i-f_i)(\hat f_i-f_i)]=cov(y_i,\hat f)</math>, where<math>\displaystyle y_i</math> and <math>\hat f_i</math> both have same mean <math>\displaystyle f_i</math>.<br />
<br />
<br />
<br />
====Stein's Lemma====<br />
<br />
If <math>\,Z</math> is <math>\,N(\mu,\sigma^2)</math> and if <math>\displaystyle g(Z)</math> is weakly differentiable,such that<math>\displaystyle E[\vert g'(Z)\vert]<\infty</math>, then <math>\displaystyle E[g(Z)(Z-\mu)]=\sigma^2E(g'(Z))</math>.<br />
<br />
<br />
According to Stein's Lemma, the last cross term of <math>\displaystyle (1)</math>, <math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]</math> can be written as <math>\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math>. The derivation is as follows.<br />
<br />
<math>\displaystyle Proof</math>:<br />
<math>\displaystyle E[g(Z)(Z-\mu)]=E[(\hat f-f)\epsilon]=\sigma^2E(g'(Z))=\sigma^2 E[\frac {\partial (\hat f-f)}{\partial y_i}]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}-\frac {\partial f}{\partial y_i}]</math><br />
<br />
<br />
Since <math>\displaystyle f</math> is the true model, not the function of the observations <math>\displaystyle y_i</math>, then <math>\frac {\partial f}{\partial y_i}=0</math>.<br />
<br />
So,<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math> <math>\displaystyle (2)</math><br />
<br />
<br />
<br />
====Two Different Cases====<br />
<br />
=====''Case 1''=====<br />
<br />
Consider the case in which a new data point has been introduced to the estimated model, i.e. <math>(x_i,y_i)\not\in\tau</math>; this new point belong to the validation set <math>\displaystyle \nu</math>,i.e.<math>(x_i,y_i)\in\nu</math>. Since <math>\displaystyle y_i</math> is a new point, <math>\hat f</math> and <math>\displaystyle y_i</math> are independent. Therefore <math>\displaystyle cov(y_i,\hat f)=0</math> (or think about <math>\frac{\partial \hat f}{\partial y_i}</math>, when <math>\,y_i</math>is a new point, then it has nothing with <math>\hat f</math> because the estimation of <math>\hat f</math> is from the training data, so <math>\frac{\partial \hat f}{\partial y_i}=0</math> ) and <math>\displaystyle (1)</math> in this case can be written as:<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2</math>. <br />
<br />
This expectation means <math>\frac {1}{m}\sum_{i=1}^m (\hat y_i-y_i)^2 = \frac {1}{m}\sum_{i=1}^m (\hat f_i-f_i)^2+ \sigma^2</math>.<br />
<br />
<math>\sum_{i=1}^m (\hat y_i-y_i)^2 = \sum_{i=1}^m (\hat f_i-f_i)^2+ m\sigma^2</math><br />
<br />
Based on the notation we denote above, then we obtain:<br />
<math>\displaystyle MSE=Err-m\sigma^2</math><br />
<br />
<br />
<br />
This is the justification behind the technique of cross validation. since <math>\displaystyle \sigma^2</math> is constant, to minimize <math>\displaystyle MSE</math> is equal to minimize the test err <math>\displaystyle Err</math>. In cross vaildation to avoid overfitting or underfitting, a validation data set is independent from the estimated model.<br />
<br />
<br />
=====''Case 2''=====<br />
<br />
A more interesting case is the case in which we do not use new data points to assess the performance of the estimated model. and the training data is used for both estimating and assessing a model <math>\hat f_i</math>. In this case the cross term in <math>\displaystyle (1)</math> cannot be ignored because <math>\hat f_i</math> and <math>\displaystyle y_i</math> are not independent. Therefore the cross term can be estimated by Stein's lemma, which was originally proposed to estimated the mean of a Guassian distribution.<br />
<br />
<br />
Suppose <math>(x_i,y_i)\in\tau</math>, then by applying Stein's lemma, we obtain <math>\displaystyle (2)</math> proved above.<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2\sigma^2E[\frac {\partial \hat f}{\partial y_i}]</math>.<br />
<br />
This expectation means <math>\frac {1}{N}\sum_{i=1}^N (\hat y_i-y_i)^2 = \frac {1}{N}\sum_{i=1}^N (\hat f_i-f_i)^2+ \sigma^2-\frac {2}{N}\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<br />
<math>\sum_{i=1}^N (\hat y_i-y_i)^2 = \sum_{i=1}^N (\hat f_i-f_i)^2+ N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<math>\displaystyle err=MSE+N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math><br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math> <math>\displaystyle (3)</math><br />
<br />
In statistics, this is known as [http://www.reference.com/browse/Stein%27s+unbiased+risk+estimate Stein's unbiased risk estimate (SURE)] is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter, and thus cannot be determined completely.<br />
<br />
<br />
<br />
===SURE for RBF Network===<br />
<br />
Based on SURE, the optimum number of basis functions should be assigned to have the minimum generalization err <math>\displaystyle err</math>. Based on the Radial Basis Function Network, by setting <math>\frac{\partial err}{\partial W}</math> equal to zero , we get the least squared solution of<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math>.Then we have <math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}Y = HY</math>,where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}</math>, <math>\,H</math> is the hat matrix for this model.<br />
<br />
<br />
<math>\hat f=\,H_{i1}y_2+\,H_{i2}y_2+\cdots+\,H_{in}y_n</math> <math>\displaystyle (3)</math><br />
<br />
where <math>\,H</math> depends on the input vector <math>\displaystyle x_i</math> but not on <math>\displaystyle y_i</math>. <br />
<br />
By taking the derivative of <math>\hat f_i</math> with respect to <math>\displaystyle y_i</math>, we can easily obtain:<br />
<br />
<math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Now, substituing this into<math>\displaystyle (3)</math> ,we get<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Here, we can tell that <math>\sum_{i=1}^N \,H_{ii}= \,Trace(H)</math>, the sum of the diagonal elements of <math>\,H</math>. Thus, we can obtain the further simplification that <math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1})\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=d</math>, where<math>\displaystyle d</math> is the dimension of <math>\displaystyle \Phi</math>. Since <math>\displaystyle \Phi</math> is a projection of input matrix <math>\,X</math> onto a basis set spanned by <math>\,M</math>, the number of basis functions. If considering intercept, then <math>\,Trace(H)= M+1</math>.<br />
<br />
Then,<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1)</math>.<br />
<br />
===SURE Algorithm===<br />
<br />
<br />
[[File:27.1.jpg|350px|thumb|right|Figure 27.1]]<br />
<br />
We use this method to find the optimum number of basis function by choosing the model with smallest MSE over the set of models considered. Given a set of models <math>\hat f_M(x)</math> indexed by the number of basis functions, <math>\displaystyle err(M)</math>. <br />
<br />
Then, <math>\displaystyle MSE(M)=err(M)-N\sigma^2+2\sigma^2(M+1)</math><br />
<br />
where <math>\displaystyle N</math> is the number of training samples and the noise,<math>\sigma^2</math>, can be estimated from the training data as<br />
<br />
<math>\hat \sigma^2=\frac {1}{N-1}\sum_{i=1}^N (\hat y-y)^2</math>.<br />
<br />
<br />
By applying SURE algorithm to SPECT Heart data, we get the optimal number of basis functions is <math>\displaystyle M=4</math>.<br />
<br />
<br />
Pls take a look at figure 27.1 on the right, which shows that<math>\displaystyle MSE</math> is smallest when <math>\displaystyle M=4</math>.<br />
<br />
<br />
Calculating the SURE value is easy if you have access to <math>\,\sigma</math>.<br />
<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
If <math>\,\sigma</math> is not known, it can be estimated using the error.<br />
<br />
error = (output - expected_output) .^ 2;<br />
sigma = (dot(err, ones(1, num_data_point)) / (num_data_point)) / (num_data_point - 1);<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
=='''SURE for RBF network & Support Vector Machine - November 13th, 2009'''==<br />
<br />
===SURE for RBF network===<br />
<br />
====Minimizing MSE====<br />
<br />
By Stein's unbiased risk estimate (SURE) for Radial Basis Function (RBF) Network<br />
we get:<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1) </math> (28.1)<br />
<br />
*<math>\displaystyle MSE</math>(mean square error)= <math>\sum_{i=1}^N (\hat y_i-y_i)^2 </math><br />
*<math>\displaystyle err</math>(training error)= <math>\sum_{i=1}^N (\hat f_i-f_i)^2 </math><br />
*<math>\displaystyle (M+1) </math>( number of hidden units)= <math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math><br />
<br />
<br />
'''Goal''': To minimize MSE<br />
<br />
1. If <math>\displaystyle \sigma </math> is known, then it is no impact (i.e. a constant),<br />
and we can ignore it. Only need to minimize <math>\displaystyle MSE=err +2\sigma^2(M+1)</math>.<br />
<br />
2. In reality, we do not know <math>\displaystyle \sigma</math>, and it changes when <math>\displaystyle (M+1) </math> changes. However, we can estimate <math>\displaystyle \sigma </math>.<br />
<br />
<math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>. Suppose we do not know the variance of <math>\displaystyle \epsilon</math>. Then, <br />
<br />
<math>\displaystyle \sigma^2=\frac{1}{N-1}\sum_{i=1}^N (\hat y_i-y_i)^2 =\frac{1}{N-1}err</math> (28.2)<br />
<br />
Substitute (28.2) into (28.1), get<br />
<br />
<math>\displaystyle MSE=err-N\frac{1}{N-1}err+2\frac{1}{N-1}err(M+1)</math><br />
<br />
<math>\displaystyle MSE=err(1-\frac{N}{N-1}+\frac{2(M+1)}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{N-1-N+2M+2}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{2M+1}{N-1}) </math> (28.3) <br />
<br />
<br />
[[File:28.1.jpg|350px|thumb|Figure 28.1: MSE vs err]]<br />
<br />
Figure 28.1: the training error will decrease and the MSE will increase when increasing the number of hidden units (i.e. the model is more complex).<br />
<br />
<br />
When the number of hidden units gets larger and larger, the training error will decrease until it approaches to <math>\displaystyle 0 </math>. If training error approaches <math>\displaystyle 0 </math>, then no matter how large <math>\displaystyle (M+1) </math> is, from (28.3) we can see MSE will approaches to <math>\displaystyle 0 </math> as well. However, in fact it does not happen since when training error is close to <math>\displaystyle 0 </math> [http://en.wikipedia.org/wiki/Overfitting overfitting] happens, and MSE will increase instead of being close to <math>\displaystyle 0 </math>. We can see it from the Figure 28.1. <br />
<br />
<br />
We can see <math>\displaystyle \sigma^2 </math> is the average of <math>\displaystyle err </math>. In order to deal with this problem, we can take the average for <math>\displaystyle err</math> of each hidden unit. For example: we can first take 1 hidden unit, and take 10 hidden units in the next.<br />
<br />
We can also see that unlike the classical Cross Validation (CV) or Leave one out (LOO) techniques, the SURE technique does not need to do the validation to find the optimal model. Hence, SURE technique uses less data than CV or LOO. It is suitable for the case that there is not enough data for validation. However, to implement SURE we need to find <math>\frac {\partial \hat f}{\partial y_i}</math>, which may not be trivial for models that do not have a closed-form solution.<br />
<br />
====Kmeans Clustering====<br />
<br />
Discription:<br /> [http://en.wikipedia.org/wiki/K-means_clustering Kmeans clustering] is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.<br />
<br />
*The number of hidden units is same as the number of clusters, which is <math>\displaystyle \phi </math><br />
<br />
*<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, we set it same for all clusters.<br />
<br />
The basic details for <math>K</math>-means clustering are given:<br />
<br />
The <math>K</math> initial centers are randomly chosen from the training data.<br />
<br />
Then the following two steps are iterated alternatively until convergernce.<br />
<br />
1. for each existing center, we reidentify its cluster (every points in this cluster should be closer to this center than others).<br />
<br />
2. compute the mean for each cluster and make it as the new center for each cluster.<br />
<br />
<br />
Example:<br /><br />
Partition data into 2 clusters (2 hidden values)<br />
<br />
<br />
>> X=rand(30,80); <br />
>> [IDX,C,sumD,D]=kmeans(X,2); <br />
>> size(IDX) <br />
>> 30 1<br />
>> size(C) <br />
>> 2 80<br />
>> size(sumD) <br />
>> 2 1<br />
>> c1=sum(IDX==1)<br />
>> 14<br />
>> c2=sum(IDX==2)<br />
>> 16<br />
>> sumD<br />
>> 85.6643<br />
>> 101.0419<br />
>> v1=sumD(1,1)/c1 <br />
>> 6.1189<br />
>> v2=sumD(2,1)/c2 <br />
>> 6.3151 <br />
<br />
<br />
<br />
Comments:<br />
<br />
We create <math>X</math> randomly as a train set with 80 data points and 30 dimensions, and then apply “kmeans” method to separate X into 2 clusters.IDX is a vector contains 1 or 2 which indicates 2 clusters, and its size is 30*1. <math>\displaystyle C </math> is the center (mean) of each cluster with size 2*80; sumD is sum of the square distance between the data points and center of its cluster. The <math>\displaystyle c1 </math> and <math>\displaystyle c2 </math> indicate the number of data points in cluster 1 and 2. <math>\displaystyle V1 </math> is the variance of the first cluster <math>\displaystyle (v1=\sigma_1)</math>; <math>\displaystyle V2 </math> is the variance of the second cluster <math>\displaystyle (v2=\sigma_2)</math>. Now we can get <math>\displaystyle \phi </math>, <math>\displaystyle w </math>, hat matrix <math>\displaystyle H </math> and <math>\displaystyle \hat Y </math> by following equations. Finally, we will get the <math>\displaystyle MSE </math> and predict the test set. <br />
<br />
<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br />
<br />
<math>\displaystyle W = (\phi^{T}\Phi)^{-1}\phi^{T}Y</math><br />
<br />
<math>\displaystyle H = \phi(\Phi^{T}\Phi)^{-1}\phi^{T}</math><br />
<br />
<math>\displaystyle \hat{Y} = \phi W = \phi(\phi^{T}\phi)^{-1}\phi^{T}Y = HY</math><br />
<br />
<br />
<br />
Aside:<br />
<br />
Similar in spirit to <math>K</math>-means, there is EM algorithm with respect to Gaussian mixture model. Generally speaking, the Gaussian mixture model is referred to as a soft clustering while <math>K</math>-means is hard clustering.<br />
<br />
Similar to <math>K</math>-means, the following two steps are iterated alternatively until convergernce.<br />
<br />
E-step, each point is assigned a weight for each cluster based on the likelihood of each of the corresponding Gaussians. Observations is assigned 1 for one cluster if they are closer to the center of that cluster, and is assigned 0 for other clusters. <br />
<br />
M-step, compute the weighted means and covariances and make them as the new means and covariances for every cluster.<br />
<br />
>>[P,mu,phi,1Pxtr]=mdgEM(X,2,200,0);<br />
<br />
===Support Vector Machine===<br />
<br />
====Introduction====<br />
We have seen that linear discriminate analysis and logistic regression both estimate linear decision boundaries in similar but slightly different ways. Separating hyperplane classifiers provide the basis for support vector classifier. It constructs linear decision boundaries that explicitly try to separate the data into different classes as well as possible. The techniques that make the extensions to the nonseparable case, where the classes overlap are generalized to what is known as the support vector machine. It produces nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space.<br />
<br />
Definition: <br /><br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machine (SVM)] are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space, which can be used for classification, regression or other tasks.<br />
<br />
====Optimal Seperating Hyperplane====<br />
<br />
[[File:28.2.jpg|350px|thumb|right|Figure 28.2]]<br />
<br />
Figure 28.2 An example with two classes separable by a hyperplane. The blue line is the least squares solution, which misclassifies one of the training points. Also shown are the black separating hyperplanes found by the [http://en.wikipedia.org/wiki/Perceptron perceptron] learning algorithm with different random starts.<br /><br />
<br />
We can see the data points can be separated by a linear boundary are in two classes in <math>\displaystyle \mathbb{R}^{2} </math>. Suppose a dataset is indeed linearly separable, then there exits infinitely many possible separating hyperplanes including the black lines in the figure as two of them for training data. However, which solution is the best when we introduce the new data. <br /><br />
<br />
Aside: <br /><br />
The blue line is the least squares solution to the problem,obtained by regressing the <math>\displaystyle -1/+1 </math> response <math>\displaystyle Y </math> on <math>\displaystyle X </math> (with intercept); the line is given by<br />
<math>\displaystyle {X:\hat\beta_0+\hat\beta_1X_1+\hat\beta_2X_2=0}</math>.<br />
This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by linear discriminant analysis, in light of its equivalence with linear regression in the two-class case.<br />
<br />
Classifiers such as (28.4) that compute a linear combination of the input features and return the sign were called ''perceptrons'' in the engineering literature in the late 1950s. <br />
<br />
<br />
Identifications:<br />
<br />
*Hyperplane: seprate two classes <br />
<br />
<math>\displaystyle x^{T}\beta+\beta_0=0</math><br />
<br />
*Margin: the distance between the hyperplane and the closest point.<br />
<br />
<math>\displaystyle d_i=x_i^{T}\beta+\beta_0 </math> where <math>\displaystyle i=1,....,N</math><br />
<br />
Note: since distance is positive, if the data is on <math>\displaystyle +1 </math> side the distance is <math>\displaystyle d_i(+1)</math>. If the data is on the <math>\displaystyle -1 </math> side the distance is <math>\displaystyle d_i(-1)</math>.<br />
<br />
*Data points: <math>\displaystyle y_i\in\{-1,+1\}</math>we can classify points as<math>\displaystyle sign\{d_i\}</math> if <math>\displaystyle \beta,\beta_0 </math> is known.<br /><br />
<br />
====Maximum Margin Classifiers====<br />
Choose the line farthest from both classes or Choose the line which has the maximum distance from the closest point. (i.e maximize the margin)<br /><br />
<br />
<math>\displaystyle Margin=min\{y_id_i\}</math> <math>\displaystyle i=1,2,....,N </math> <br />
where <math>\displaystyle y_i </math> is label and <math>\displaystyle d_i </math> is distance<br /><br />
<br />
[[File:28.3.jpg|350px|thumb|right|Figure 28.3 The linear algebra of a hyperplane]]<br />
<br />
<br />
<br />
Fiqure 28.3 depicts a hyperplane defined by the equation <math>\displaystyle x^{T}\beta+\beta_0=0</math>. Since they are in <math>\displaystyle \mathbb{R}^{2} </math>, the hyperplane is a line.<br /><br />
<br />
<br />
'''Properties''':<br /><br />
<br />
1. <math>\displaystyle \beta </math> is orthogonal to the hyperplane <br /><br />
<br />
Two points<math>\displaystyle x_1,x_2</math> lying on the hyperplane.<br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0\beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}(x_1-x_2)=0</math><br />
<br />
Hence,<math>\displaystyle \beta </math> is orthogonal to <math>\displaystyle (x_1-x_2)</math>, and<math>\displaystyle \beta^*=\frac{\beta}{\|\beta\|} </math> is the vector normal to the hyperplane.<br /><br />
<br />
2. For any point <math>\displaystyle x_0 </math> on the hyperplane, <br />
<br />
<math>\displaystyle \beta^{T}x_0+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1=-\beta_0</math><br />
For any point on the hyperplane <math>\displaystyle \beta^{T}</math> gives negative value of the intercept of the hyperplane. <br/><br />
<br />
<br />
3. The signed distance for any point <math>\displaystyle x </math> to the hyperplane is <math>\displaystyle d_i=\beta^{T}(x_i-x_0)</math>. <br/>Since the length of <math>\displaystyle \beta </math> is not known, let it be unit vector.<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}(x_i-x_0)}{\|\beta\|} </math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i-\beta^{T}x_0}{\|\beta\|} </math><br />
<br />
by property 2<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i+\beta_0}{\|\beta\|} </math><br />
<br />
<br />
<br />
[[File:4.jpg|350px|thumb|right|Figure 28.4]]<br />
<br />
<br />
<math>\displaystyle Margin=min(y_id_i)</math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle Margin=min\{y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\} </math><br />
<br />
Suppose <math>\displaystyle x_i </math> is not on the hyperplan<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)>0 </math><br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq c </math>for <math>\displaystyle c>0 </math><br />
<br />
<br />
<math>\displaystyle y_i(\frac{\beta^{T}x_i}{c}+\frac{\beta_0}{c})\geq1</math> <br />
<br />
This is known as the cnonical representation of the decision hyperplane.<br />
<br />
For <math>\displaystyle \beta^{T} </math> only the direction is important, so <math>\displaystyle \frac{\beta^{T}}{c} </math> does not change its direction, the hyperplance will be the same.<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq1 </math><br />
<br />
<math>\displaystyle y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\geq\frac{1}{\|\beta\|} </math><br />
<br />
<math>\displaystyle Margin=min(\frac{1}{\|\beta\|} )</math><br />
<br />
which is equivalent as <math>\displaystyle Margin=max(\|\beta\| )</math><br />
<br />
<br />
<br />
<br />
Refrence:<br /><br />
Hastie,T.,Tibshirani,R., Friedman,J.,(2008).The Elements of Statistical Learning:129-130<br />
<br />
=='''Optimizing The Support Vector Machine - November 16th, 2009'''==<br />
The Support Vector Machine is used to find a maximum margin hyperplane, assuming the two classes are separable. This margin can be written as <math>\,min\{y_id_i\}</math>, or the distance of each point from the hyperplane, where <math>\,d_i</math> is the distance and <math>\,y_i</math> is used as the sign.<br />
===Maximizing the Support Vector Machine===<br />
<math>\,Margin=min\{y_id_i\}</math> can be rewritten as <math>\,min\left\{\frac{y_i\left(\beta^Tx_i+\beta_0\right)}{|\beta|}\right\}</math>. <br />
<br\>Note that the term <math>\,y_i\left(\beta^Tx_i+\beta_0\right) = 0</math> if <math>\,x_i</math> is on the hyperplane, but <math>\,y_i\left(\beta^Tx_i+\beta_0\right) > 0</math> if <math>\,x_i</math> is <math>\displaystyle not</math> on the hyperplane<br />
<br />
This implies <math>\,\exists C</math> such that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq C</math>.<br />
<br />
Divide through by C to produce <math>\,y_i\left(\frac{\beta^T}{C}x_i + \frac{\beta_0}{C}\right) \geq 1</math>. <br />
<br />
<math>\,\beta, \beta_0</math> compose a hyperplane that can have different values but we care about the direction, dividng through by a constant does not change the hyperplane. Thus, by scaling <math>\,\beta, \beta_0</math> we can eliminate C, so that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq 1</math>. Implying that the lower bound on <math>\,y_i\left(\beta^Tx_i+\beta_0\right)</math> is <math>\displaystyle 1</math><br />
<br />
Now in order to maximize the margin, we simply need to find <math>\,min\left\{\frac{1}{|\beta|}\right\}</math>. <br />
<br />
In other words, find maximum <math>\,|\beta|</math>, s.t. <math>\,min_i\{y_i(\beta^Tx_i+\beta_0)\} = 1</math>.<br />
<br />
Note that we're dealing with the norm of <math>\,\beta</math>. The 1-norm of a vector is simply the sum of the absolute value of each element (also known as the taxicab or Manhattan distance), and is apparently more accurate, but also has a discontinuity in the derivative. 2-norm, the Euclidean norm (the intuitive length of the vector), is easier to work with - that is <math>\,\|\beta\|_2 = (\beta^T\beta)^{1/2}</math>. For convenience, we will maximize <math>\,\frac{1}{2}\|\beta\|_2^2 = \frac{1}{2}\beta^T\beta</math>.<br />
<br />
This is an example of a quadratic programming problem and we will minimize a quadratic function respective to a linear inequality constraints.<br />
<br />
<br />
====Writing Lagrangian Form of Support Vector Machine====<br />
The Lagrangian form is introduced to ensure that the conditions are satisfied, as well as finding an optimal solution. <math>\,\alpha_i</math> are introduced as dual constraints. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.<br />
<br />
<math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math>. To find the optimal value, set the derivative equal to zero.<br />
<br />
<math>\,\frac{\partial L}{\partial \beta} = 0</math>, <math>\,\frac{\partial L}{\partial \beta_0} = 0</math>. Note that <math>\,\frac{\partial L}{\partial \alpha_i}</math> is equivalent to the constraints <math>\left(y_i(\beta^Tx_i+\beta_0)-1\right) \geq 0, \,\forall\, i</math><br />
<br />
First, <math>\,\frac{\partial L}{\partial \beta} = \frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\left\{\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i)+\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0-\frac{\partial}{\partial \beta}\alpha_iy_i\right\}}</math><br />
<br />
: <math>\frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 = \beta</math>.<br />
<br />
: <math>\,\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i) = \alpha_iy_ix_i</math><br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0 = 0</math>. <br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i = 0</math>.<br />
<br />
So this simplifies to <math>\,\frac{\partial L}{\partial \beta} = \beta - \sum_{i=1}^n{\alpha_iy_ix_i} = 0</math>. In other words,<br />
<br />
<math>\,\beta = \sum_{i=1}^n{\alpha_iy_ix_i}</math>, <math>\,\beta^T = \sum_{i=1}^n{\alpha_iy_ix_i^T}</math><br />
<br />
Similarly, <math>\,\frac{\partial L}{\partial \beta_0} = -\sum_{i=1}^n{\alpha_iy_i} = 0</math>.<br />
<br />
This allows us to rewrite the Lagrangian without <math>\,\beta</math>.<br />
<br />
<math>\,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} - \sum_{i=1}^n{\alpha_iy_i(\sum_{j=1}^n{\alpha_jy_jx_j^Tx_i} + \beta_0) - 1}</math>. <br />
<br />
Because <math>\,\sum_{i=1}^n{\alpha_iy_i} = 0</math>, and <math>\,\beta_0</math> is constant, <math>\,\sum_{i=1}^n{\alpha_iy_i\beta_0} = 0</math>. So this simplifies further, to<br />
<br />
<math>L(\alpha) = \,-\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} + \sum_{i=1}^n{\alpha_i}</math><br />
A dual representation of the maximum margin.<br />
<br />
Because <math>\,\alpha_i</math> is the Lagrangian multiplier, <math>\,\alpha_i \geq 0 \forall i</math>.<br />
<br />
This is a much simpler optimization problem.<br />
<br />
=='''The Support Vector Machine algorithm - November 18, 2009'''==<br />
<br />
===Solving the Lagrangian===<br />
<br />
Continuing from the above derivation, we now have the equation that we need to minimize, as well as two constraints.<br />
<br />
The Support Vector Machine problem boils down to:<br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math><br />
:such that <math>\alpha_i \geq 0</math><br />
:and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
<br />
We are looking to minimize <math>\,\alpha</math>, which is our only unknown. Once we know <math>\,\alpha</math>, we can easily find <math>\,\beta</math> and <math>\,\beta_0</math> (see the Support Vector algorithm below for complete details).<br />
<br />
If we examine the Lagrangian equation, we can see that <math>\,\alpha</math> is multiplied by itself; that is, the Lagrangian is quadratic with respect to <math>\,\alpha</math>. Our constraints are linear. This is therefore a problem that can be solved through [http://en.wikipedia.org/wiki/Quadratic_programming quadratic programming] techniques. We will examine how to do this in Matlab shortly.<br />
<br />
We can write the Lagrangian equation in matrix form:<br />
<br />
<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:such that <math>\underline{\alpha} \geq \underline{0}</math><br />
:and <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Where:<br />
* <math>\underline{\alpha}</math> denotes an <math>\,n \times 1</math> vector; <math>\underline{\alpha}^T = [\alpha_1, ..., \alpha_n]</math><br />
* Matrix <math>S = y_iy_jx_i^Tx_j = (y_ix_i)^T(y_ix_i)</math><br />
* <math>\,\underline{0}</math> and <math>\,\underline{1}</math> are vectors containing all 0s or all 1s respectively<br />
<br />
Using this matrix notation, we can use Matlab's build in quadratic programming routine, [http://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/quadprog.html quadprog].<br />
<br />
===Quadprog example===<br />
<br />
Let's use quadprog to find the solution to <math>\,L(\alpha)</math>.<br />
<br />
Matlab's quadprog function minimizes an equation of the following form:<br />
:<math>\min_x\frac{1}{2}x^THx+f^Tx</math><br />
:such that: <math>\,A \cdot x \leq b</math>, <math>\,Aeq \cdot x = beq</math> and <math>\,lb \leq x \leq ub</math><br />
<br />
We can now see why we kept the <math>\frac{1}{2}</math> constant in the original derivation of the equation.<br />
<br />
The function is called as such: <code>x = quadprog(H,f,A,b,Aeq,beq,lb,ub)</code>. The variables correspond to values in the equation above.<br />
<br />
We can now try to find the solution to <math>\,L(\alpha)</math> (though, it should be noted, that in <math>\,L(\alpha)</math> we subtract the first term rather than add it; I'm not sure if this difference would change how we call <code>quadprog</code>).<br />
<br />
We'll use a simple one-dimensional data set, which is essentially y = -1 or 1 + Gaussian noise. (Note: you could easily put the values straight into the quadprog equation; they are separated for clarity)<br />
<br />
x = [mvnrnd([-1],[0.01],100); mvnrnd([1],[0.01],100)]';<br />
y = [-ones(100,1); ones(100,1)];<br />
S = (y*x)' * (y*x);<br />
f = ones(200,1);<br />
A = -ones(1,200);<br />
b = 0;<br />
Aeq = y';<br />
beq = 0;<br />
lb = 0;<br />
ub = []; % There is no upper bound<br />
alpha = quadprog(S,f,A,b,Aeq,beq,lb,ub);<br />
<br />
This gives us the optimal <math>\,\alpha</math>... or at least I think it should, but it does not appear to work for me (that is, despite setting the lower boundary of 0, a number of values are still negative. Whether this is just the nature of <code>quadprog</code> or an error on my part is an exercise for the reader).<br />
<br />
===Examining K.K.T. conditions===<br />
<br />
[http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker conditions] ([http://webrum.uni-mannheim.de/mokuhn/public/KarushKuhnTucker.pdf more info]) give us a closer look into the Lagrangian equation and the associated conditions.<br />
<br />
Suppose we are looking to minimize <math>\,f(x)</math> such that <math>\,g_i(x) \geq 0, \forall{x}</math>. If <math>\,f</math> and <math>\,g</math> are differentiable, then the ''necessary'' conditions for <math>\hat{x}</math> to be local minimum are:<br />
<br />
# At the optimal point, <math>\frac{\partial L}{\partial \hat{x}} = 0</math>; i.e. <math>f'(\hat{x}) - \sum{\alpha_ig'(\hat{x})}=0</math><br />
# <math>\alpha_i \geq 0</math>. (Dual Feasibility) <br />
# <math>\alpha_ig_i(\hat{x}) = 0, \forall{i}</math> (Complementary Slackness)<br />
# <math>g_i(\hat{x}) \geq 0</math> (Primal Feasibility)<br />
<br />
If any of these conditions are violated, then the problem is deemed not feasible.<br />
<br />
These are all trivial except for condition 4. Let's examine it further in our support vector machine problem.<br />
<br />
===Support Vectors===<br />
<br />
Basically, the support vectors are the training points that actually determine the optimal separating hyperplane which we are look for. Also, they are the most difficult points to classify or the most informative for the classification.<br />
<br />
In our case, the <math>g_i(\hat{x})</math> function is:<br />
:<math>\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1</math><br />
<br />
Substituting <math>\,g_i</math> into KKT condition 3, we get <math>\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0</math>. <br\>In order for this condition to be satisfied either <br/><math>\,\alpha_i= 0</math> or <br/><math>\,y_i(\beta^Tx_i+\beta_0=1</math><br />
<br />
All points <math>x_i</math> are either be 1 or greater than 1 distance away from the hyperplane.<br />
<br />
'''Case 1: a point <math>\displaystyle x_i</math> > 1 away from the margin'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) > 1 \Rightarrow \alpha_i = 0</math>.<br />
<br />
If point <math>\, x_i</math> is not on the margin, then the corresponding <math>\,\alpha_i=0</math>.<br />
<br />
'''Case 2: a point <math>\displaystyle x_i</math> 1 away from the margin'''<br />
<br />
If <math>\,\alpha_i > 0 \Rightarrow y_i(\beta^Tx_i+\beta_0) = 1</math> <br />
<br\>If point <math>\, x_i</math> is on the margin, then the corresponding <math>\,\alpha_i>0</math>.<br />
<br />
<br />
<br />
Points on the margin, points with corresponding <math>\,\alpha_i > 0</math>, are called '''''support vectors'''''.<br />
<br />
===Using support vectors===<br />
<br />
Support vectors are important because they allow the support vector machine algorithm to be insensitive to outliers. If <math>\,\alpha = 0</math>, then the cost function is also 0, and won't contribute to the solution of the SVM problem; only points on the margin -- support vectors -- contribute.<br />
<br />
====The support vector machine algorithm====<br />
<br />
# Solve the quadratic programming problem:<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
## Use Matlab's quadprog to find the optimal <math>\,\underline{\alpha}</math><br />
# Find <math>\beta = \sum_{i=1}^n{\alpha_iy_i\underline{x_i}}</math><br />
# Find <math>\,\beta_0</math> by choosing a support vector (a point with <math>\,\alpha_i > 0</math> and solving <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math><br />
<br />
===Example in Matlab===<br />
<br />
The following code, taken verbatim from the lecture, shows how to use Matlab built-in SVM routines (found in the Bioinformatics toolkit) to do classification through support vector machines.<br />
<br />
load 2_3;<br />
[U,Y] = princomp(X');<br />
data = Y(:,1:2);<br />
l = [-ones(1,200) ones(1,200)];<br />
[train,test] = crossvalind('holdOut',400);<br />
% Gives indices of train and test; so, train is a matrix of 0 or 1, 1 where the point should be used as part of the training set<br />
svmStruct = svmtrain(data(train,:), l(train), 'showPlot', true);<br />
<br />
[[File:Svm1.png|frame|center|The plot produced by training on some of the 2_3 data's first two features.]]<br />
<br />
yh = svmclassify(svmStruct, data(test,:), 'showPlot', true);<br />
<br />
[[File:Svm2.png|frame|center|The plot produced by testing some of the 2_3 data.]]</div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=5453stat8412009-11-21T19:53:17Z<p>Ipargaru: /* Maximum Margin Classfifiers */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rise of fields such as data-mining, bioinformatics, and machine learning, classification has becomes a fast-developing topic. In the age of information, vast amounts of data are generated constantly, and the goal of classification is to learn from data. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
'''Definition''': The problem of Prediction a discrete random variable <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math> is called Classification.<br />
<br />
In classification,, we attempt to approximate a function <math>\,h</math>, by using a training data set, which will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>d</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data which are identical independent distributions, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, for instance, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability. Intuitively speaking, to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we find <math>\,y\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y|X=x)</math>, and classify <math>\,X</math> into class <math>\,y</math>. In order to calculate the value of <math>\,P(Y=y|X=x)</math>, we use ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posterior probability, <math>\,P(Y=y)</math> as the prior probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two classes, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<sub></sub><br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''3 different approaches to classification''':<br />
<br />
1) Empirical Risk Minimization: Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of <math>\,L(h)</math><br />
<br />
2) Regression: Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3) Density Estimation: estimate <math>\,P(X=x|Y=0)</math> and <math>\,P(X=x|Y=1)</math> (less popular in high-dimension cases)<br />
<br />
<br />
<br />
'''Bayes Classification Rule Optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when<math></math> the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
<br />
:<math>\, h^*(X)= \left\{\begin{matrix} <br />
1 & if P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
Remark:<br />
<br />
1)Bayes classification rule is optimal. Proof:[http://www.ee.columbia.edu/~vittorio/BayesProof.pdf]<br />
<br />
2)We still need any other method, since we cannot define prior probability in realistic.<br />
<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, since it is generally impossible for us to know the prior <math>\,P(Y=1)</math>, and class conditional density <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], tree-augmented naive Bayes (TAN), Bayesian network augmented naive Bayes (BAN) and general Bayesian network (GBN).<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as having objective existence. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayesian and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayesian'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In Bayesian method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
'''Multi-class Classification''':<br />
<br />
Y takes on more than two values.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
''Theorem'': Suppose that <math>\,Y \in \mathcal{Y}= \{1,\dots,k\}</math>, the optimal rule is :<math>\,h^*(X) = \arg\max_{k}{P(Y = k|X = x)}</math><br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
1 Empirical Risk Minimization:Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
<br />
2 Regression:Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3 Density estimation, estimate <math>P(X = x|Y = 0)</math> and <math>P(X = x|Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
It is quadratic because there is no boundaries.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on [http://academicearth.org/lectures/advice-for-applying-machine-learning LDA and QDA] so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
LDA is for classification and FDA is used for feature extraction.<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
<br />
=== logistic function ===<br />
A logistic function or logistic curve is the most common sigmoid curve. <br />
<br />
:<math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
1. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
2. <math>y(0) = \frac{1}{2}</math><br />
<br />
3. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
4. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
5. The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood, using Pr(Y|X). The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Seeing these equations as a weighted least squares problem makes them easier to derivate.<br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== Perceptron (Foundation of Neural Network) ===<br />
<br />
==== Separating Hyperplane Classifiers ====<br />
Separating hyperplane trys to separate the data using linear decision boundaries. When the classes overlap, it can be generalized to support vector machine, which constructs nonlinear boundaries by constructing a linear boundary in an enlarged and transformed feature space.<br />
<br />
==== Perceptron ====<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
Least Squares returns the sign of a linear combination of data points as the class label<br />
<br />
sign(<math>(\underline{\beta}^T \underline{x} + {\beta}_0)) = sign(\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2})</math><br />
<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br />
<br />
<br />
If we find a hyperplane that is not unique between 2 classes, there will be infinitely many solutions obtained from the perceptron algorithm.<br />
<br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
A Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math>, where <math>\,I</math> indicates the sign of the expression and returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math> (initial guess). Its goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach], which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
====Problems with the Algorithm and Issues Affecting Convergence====<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. This problem can be eliminated by using basis expansions technique. To be specific, we try to find a hyperplane not in the original space, but in the enlarged space obtained by using some basis functions.<br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#A perfect separation is not always available even desirable. If observations comes from different classes sharing the same imput, the classification model seems to be overfitting and will generally have poor predictive performance.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
<br />
</ref>.<br />
====Comment on gradient descent algorithm====<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction in which the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and you initially stand in the middle, then you will finally arrive at the saddle point (local minimum) and get stuck there.<br />
<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over <math>\,i</math> (all data points). Actually, this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent) known as Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>\,{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
*A Perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection with branches that "fan out" onto as many connections as desired, each carrying the same signal - the processing element output signal. <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref> It is a multistage regression or classification model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
A regression problem typically has only one unit in the output layer. In a k-class classification problem, there are usually k units in the output layer that each represent the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1).<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classification by NN. <br />
<br />
In perceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] and is not continuous at 0. Thus, we replace it by a smooth function <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it the '''activation function'''.<br />
<br>The choice of this function <math>\displaystyle \sigma </math> is determined by the properties of the data and the assumed distribution of target variables, but for multiple binary classification problems the logistic function, also known as inverse-logit, is often used: <br />
<math>\sigma(a)=\frac {1}{1+e^{-a}}</math><br />
<br />
There are some important properties for the activation function.<br />
<br />
# Activation function is nonlinear. It can be shown that if the activation function of the hidden units is linear, a three-layer neural network is equivalent to a two layer one. <br />
# Activation function saturate, which means there are maximum and minimum output value. This property ensures that the weights are bounded and therefore the searching time is limited. <br />
# Activation function is continuous and smooth.<br />
# Activation function is monotonic. This property is not necessary, since we know that RBF networks is also a kind of power model. <br />
<br />
'''Note:''' A key difference between a perceptron and a neural network is that a neural network uses continuous nonlinearities in the units, for the purpose of differentiation, whereas the perceptron often uses a non-differentiable activation function. The neural network function is differentiable with respect to the network parameters so that a gradient descent method can be used in training. Moreover, a perceptron is a linear classifier, whereas a neural network, by combining layers of perceptrons, is able to classify non-linear problems through proper training.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes into the perceptron, to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural network is called [http://en.wikipedia.org/wiki/Feedforward_neural_network Feed-Forward Neural Network]. Applications to Feed-Forward Neural Networks include data reduction, speech recognition, sensor signal processing, and ECG abnormality detection, to name a few. <ref>J. Annema, Feed-Forward Neural Networks, (Springer 1995), pp. 9 </ref><br />
<br />
===Back-propagation===<br />
For a while, the Neural Network model was just an idea, since there were no algorithms for training the model until 1986, when Geoffrey Hinton <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> devised an algorithm called '''back-propagation''' [http://en.wikipedia.org/wiki/Backpropagation#Algorithm]. After that, a number of other training algorithms and various configurations of neural networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied a gradient descent algorithm for optimizing weights. Back-propagation uses this idea of gradient descent to train a neural network based on the chain rule in calculus. <br />
<br />
Assume that the last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
For simplicity, there is only 1 unit at the end and assume for the moment we are doing regression.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each unit we have a function <math>\displaystyle z_i=\sigma(a_i)</math> that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classification and the resulting classification output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
''i j are reversed!''<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\delta_j \cdot z_l</math><br />
<br />
Note that a change in <math>\,a_j</math> causes changes in all <math>\,a_i</math> in the next layer on which the error is based, so we need to sum over i in the chain:<br />
<math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} =\sum_i \delta_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\,\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math> Using the activation function <math>\,\sigma(\cdot)</math><br />
<br />
So <math>\delta_j = \sum_i \delta_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \delta_i \cdot u_{ij}</math><br />
<br />
We can propagate the error calculated in the output back through the previous layers and adjust weights to minimize error.<br />
<br />
==Neural Networks (NN) - October 30, 2009 ==<br />
<br />
=== Back-propagation ===<br />
The idea is that we first feed an input from the training set to the Neural Network, then find the error rate at the output and then we propagate the error to previous layers and for each edge of weight <math>\,u_{ij}</math> we find <math>\frac{\partial \mathrm{err}}{\partial u_{ij}}</math>. Having the error rates at hand we adjust the weight of each edge by taking steps proportional to the negative of the gradient to decrease the error at output. The next step is to apply the next input from the training set and go through the described adjustment procedure.<br />
The overview of Back-propagation algorithm:<br />
#Feed a point <math>\,x</math> in the training set to the network, and find the output of all the nodes.<br />
#Evaluate <math>\,\delta_k=y_k-\hat{y_k}</math> for all output units, where <math>y_k</math> is the expected output and <math>\hat{y_k}</math> is the real output.<br />
#By propagating to the previous layers evaluate all <math>\,\delta_j</math>s for hidden units: <math>\,\delta_j=\sigma'(a_j)\sum_i \delta_i u_{ij}</math> where <math>i</math> is associated to the previous layer.<br />
#Using <math>\frac{\partial \mathrm{err}}{\partial u_{jl}} = \delta_j\cdot z_l</math> find all the derivatives.<br />
#Adjust each weight by taking steps proportional to the negative of the gradient: <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} -\rho \frac{\partial \mathrm{err}}{u_{ij}}</math><br />
#Feed the next point in the training set and repeat the above steps.<br />
==== How to initialize the weights ====<br />
This still leaves the question of how to initialize the weights <math>\,u_{ij}, w_i</math>. The method of choosing weights mentioned in class was to randomize the weights before the first step. This is not likely to be near the optimal solution in every case, but is simple to implement. To be more specific, random values near zero will be a good choice for the initial weights(usually from [-1,1]). In this case, the model evolves from a nearly linear one to a nonlinear one as we desired. An alternative is to use an orthogonal least squares method to find the initial weights <ref>http://www.mitpressjournals.org/doi/abs/10.1162/neco.1995.7.5.982</ref>. Regression is performed on the weights and output by using a linear approximation of <math>\,\sigma(a_i)</math>, and finds optimal weights in the linear model. Back propagation is used afterward to find the optimal solution, since the NN is non-linear.<br />
<br />
==== How to set learning rates ====<br />
The learning rate <math>\,\rho</math> is usually a constant. <br />
<br />
If we use On-line learning, as a form of stochastic approximation process, <math>\,\rho</math> should decrease as the iteration increase.<br />
<br />
<br />
Choosing too large learning rate may cause the unstability of the system, while the too small learning rate may lead to a very slow convergence rate(very long time in learning phase). However, the advantage of small learning rate is that it can guarantee the convergence. Thus, generally, it is better to choose a relatively small learning rate to ensure the stability. Usually, choose <math>\,\rho</math> between 0.01 and 0.7<br />
<br />
==== How to determine the number of hidden units ====<br />
<br />
Here we will mainly discuss how to estimate the number of hidden units at very beginning. Obviously, we should adjust it to be more precise using CV, LOO or other complexity control methods. <br />
<br />
Basically, if the patterns are well separated, few hidden units are fairly enough. If the patterns are drawn from some highly complicated mixture models, more hidden units are really needed. <br />
<br />
Actually, the number of hidden units determines the size of the model, and therefore the total number of the weights in the model. Typically speaking, the number of weights should not be larger than the number of training data, say N. Thus, sometimes, N/10 is a good choice. However, in pratice, many well performed models will use more hidden units.<br />
<br />
=== Dimensionality reduction application ===<br />
[[File:NN-bottelneck.png|350px|thumb|right|Figure 1: Bottleneck configuration for applying dimensionality reduction.]]<br />
One possible application of Neural Networks is to perform dimensionality reduction, like other techniques, e.g., PCA, MDS, LLE and Isomap.<br />
<br />
Consider the following configuration as shown in figure 1:<br />
As we go forward in layers of this Neural Network, the number of nodes is reduced, until we reach a layer with the number of nodes representing the desired dimensionality. However, note that at the very first few layers the number of nodes may not be strictly decreasing, as long as finally it can reach a layer with less nodes. From now on in the Neural Network <br />
the previous layers are mirrored. So at the output layer we have the same number of states as we have in the input layer. Now note that if we feed the network with each point and get an output approximately equal to the fed input, that means at the output the same input is reconstructed from the middle layer units. So the output of the middle layer units can represent the input with less dimensions.<br />
<br />
To train this Neural Network, we feed the network with a training point and through back propagation we adjust the network weights based on the error between the input layer and the reconstruction at the output layer. Our low dimensional mapping will be the observed output from the middle layer. Data reconstruction consists of putting the low dimensional data through the second half of the network.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
== Complexity Control October 30, 2009 ==<br />
<br />
[[File:overfitting-model.png|500px|thumb|right|Figure 2. The overfitting model passes through all the points of the training set, but has poor predictive power for new points.<br />
In exchange the line model has some error on the training points but has extracted the main characteristic of the training points, and has good predictive power.]]<br />
There are [http://academicearth.org/lectures/underfitting-and-overfitting two issues] that we have to avoid in Machine Learning:<br />
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]<br />
#Underfitting<br />
<br />
Overfitting occurs when our model is heavily complex with so many degrees of freedom, that we can learn every detail of the training set. Such a model will have very high precision on the training set but will show very poor ability to predict outcomes of new instances, especially outside the domain of the training set.<br />
<br />
In a Neural Network if the depth is too much, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will show a very precise outcome of the training set but will not be able to generalize the commonality of the training set to predict the outcome of new cases.<br />
<br />
Underfitting occurs when the model we picked to describe the data is not complex enough, and has high error rate on the training set.<br />
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.<br />
<br />
'''Example'''<br />
#Consider the example showed in the figure. We have a training set and we want to find a model which fits it the best. We can find a polynomial of high degree which almost passes through all the points in the training set. But, in fact the training set is coming from a line model. Now the problem is although the complex model has less error on the training set it diverges from the line in other ranges which we have no training point. Because of that the high degree polynomail has very poor predictive result on test cases. This is an example of overfitting model.<br />
#Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem.<br />
#Consider a simple classification example. If our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is a yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features typical of bananas, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie., we have overfit the data. This occurs when we have considered so many features that we have perfectly described the existing bananas, but if presented with a new banana of slightly different shape than the existing bananas, for example, it cannot be detected. This is the tradeoff; what is the right level of complexity?<br />
<br />
== Complexity Control - Nov 2, 2009 ==<br />
<br />
Overfitting occurs when the model becomes too complex and underfitting occurs when it is not complex enough, both of which are not desirable. To control complexity, it is necessary to make assumptions for the model before fitting the data. Assumptions that we can make for a model are with polynomials or a neural network. There are other ways as well.<br />
<br />
[[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 1: An example of a model with a family of polynomials]]<br />
We do not want a model to get too complex, so we control it by making an assumption on the model. With complexity control, we want a model or a classifier with a low error rate.<br />
<br />
=== '''How do we choose a good classifier?''' ===<br />
<br />
Our goal is to find a classifier that minimizes the true error rate. <br />
Recall the empirical error rate<br />
<br />
<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math><br />
<br />
<math>\,h</math> is a classifier and we want to minimize the error rate. So we apply <math>\displaystyle x_1</math> to <math>\displaystyle x_n</math> to <math>\displaystyle h</math>, and take the average to get the empirical true error rate estimation of probability that <br />
<math>h(x_{i}) \neq y_{i}</math>.<br />
<br />
<span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 2]]</span><br />
There is a downward bias to this estimate meaning that it is always less than the true error rate. <br />
<br />
If there is a change in our complexity from low to high, our error rate is always decreasing. When we apply our model to the test data, our error rate will start to decrease to a point, but then it will increase since the model hasn't seen it before. This can be explained since training error will decrease when we fit the model better by increasing its complexity, but as we have seen, this complex model will not generalize well, resulting in a larger test error. <br />
<br />
We use our test data (from the test sample line shown on Figure 2) to get our empirical error rate.<br />
Right complexity is defined as where error rate of the test data is minimum; and this is one idea behind complexity control.<br />
<br />
<br />
<br />
[[File:Bias.jpg|200px|thumb|left|Figure 3]]<br />
<br />
We assume that we have samples <math>\,X_1, . . . ,X_n</math> that follow some (possibly unknown) distribution. We want to estimate a parameter <math>\,f</math> of the unknown distribution. This parameter may be the mean <math>\,E(X_i)</math>, the variance <math>\,var(X_i)</math> or some other quantity.<br />
<br />
The unknown parameter <math>\,f</math> is a fixed real number <math>f\in R</math>. To estimate it, we use an estimator which is a<br />
function of our observations, <math>\hat{f}(X_1,...,X_n)</math>. <br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f</math><br />
<br />
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]</math><br />
<br />
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math><br />
<br />
One property we desire of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>.<br />
However, there is a more important property for an estimator than just being unbiased: the mean squared error. In statistics, there are problems for which it may be good to use an estimator with a small bias. In some cases, an estimator with a small bias may have lesser mean squared error or be median-unbiased (rather than mean-unbiased, the standard unbiasedness property). The property of median-unbiasedness is invariant under transformations while the property of mean-unbiasedness may be lost under nonlinear transformations. For example, while using an unbiased estimator with large mean square error to estimate the parameter, we highly risk a big error. In contrast, a biased estimator with small mean square error will well improve the precision of our prediction.<br />
<br />
Hence, our goal is to minimize <math>MSE (\hat{f})</math>.<br />
<br />
From figure 3, we can see that the relationship of the three parameters is:<br />
<math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.<br />
<br />
A Test error is a good estimation on MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias.<br />
<br />
<br />
Referring to Figure 2, overfitting happens after the point where training data (training sample line) starts to decrease and test data (test sample line) starts to increase. There are 2 main approaches to avoid overfitting:<br />
<br />
1. Estimating error rate<br />
<br />
<math>\hookrightarrow</math> Empirical training error is not a good estimation<br />
<br />
<math>\hookrightarrow</math> Empirical test error is a better estimation<br />
<br />
<math>\hookrightarrow</math> Cross-Validation is fast<br />
<br />
<math>\hookrightarrow</math> Computing error bound (analytically) using some probability inequality.<br />
<br />
We will not discuss computing the error bound in class; however, a popular method for doing this computation is called VC Dimension (short for Vapnik–Chervonenkis Dimension). Information can be found from [http://www.autonlab.org/tutorials/vcdim.html Andrew Moore] and [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.7171&rep=rep1&type=pdf Steve Gunn].<br />
<br />
2. Regularization<br />
<br />
<math>\hookrightarrow</math> Use of shrinkage method<br />
<br />
<math>\hookrightarrow</math> Decrease the chance of overfitting by controlling the weights<br />
<br />
=== '''Example of under and overfitting in R''' ===<br />
<br />
To give further intuition of over and underfitting, consider this example. A simple quadratic data set with some random noise is generated, and then polynomials of varying degrees are fitted. The errors for the training set and a test set are calculated.<br />
[[File:Curvefitting-rex2.png|250px|thumb|right|Polynomial fits to curved data set.]]<br />
<br />
>> x <- rnorm(200,0,1)<br />
>> y <- x^2-0.5*x+rnorm(200,0,0.3)<br />
>> xtest <- rnorm(50,1,1)<br />
>> ytest <- xtest^2-0.5*xtest+rnorm(50,0,0.3)<br />
>> p1 <- lm(y~x)<br />
>> p2 <- lm(y ~ poly(x,2))<br />
>> pn <- lm(y ~ poly(x,10))<br />
>> psi <- lm(y~I(sin(x))+I(cos(x)))<br />
<br />
: <code>x</code> values for the training set are based on a <math>\,N(0,1)</math> distribution, while the test set has a <math>\,N(1,1)</math> distribution. <code>y</code> values are determined by <math>\,y = x^2 + x + N(0,0.3)</math>, a quadratic function with some random variation. Polynomial least square fits of degree 1, 2, and 10 are calculated, as well as a fit of <math>\,sin(x)+cos(x)</math>.<br />
<br />
>> > # calculate the mean squared error of degree 1 poly<br />
>> > sum((y-predict(p1,data.frame(x)))^2)/length(y)<br />
>> [1] 1.576042<br />
>> > sum((ytest-predict(p1,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 7.727615<br />
: Training and test mean squared errors for the linear fit. These are both quite high - and since the data is non-linear, the different mean value of the test data increases the error quite a bit.<br />
>> > # calculate the mean squared error of degree 2 poly<br />
>> > sum((y-predict(p2,data.frame(x)))^2)/length(y)<br />
>> [1] 0.08608467<br />
>> > sum((ytest-predict(p2,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 0.08407432<br />
: This fit is far better - and there is not much difference between the training and test error, either.<br />
>> > # calculate the mean squared error of degree 10 poly<br />
>> > sum((y-predict(pn,data.frame(x)))^2)/length(y)<br />
>> [1] 0.07967558<br />
>> > sum((ytest-predict(pn,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 156.7139<br />
: With a high-degree polynomial, the training error continues to decrease, but not by much - and the test set error has risen again. The overfitting makes it a poor predictor. As the degree of the polynomial rises further, the accuracy of the computer becomes an issue - and a good fit is not even consistently produced for the training data.<br />
>> > # calculate mse of sin/cos fit<br />
>> > sum((y-predict(psi,data.frame(x)))^2)/length(y)<br />
>> [1] 0.1105446<br />
>> > sum((ytest-predict(psi,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 1.320404<br />
: Fitting a function of the form sin(x)+cos(x) works pretty well on the training set, but because it is not the real underlying function, it fails on test data which doesn't lie on the same domain.<br />
<br />
== ''' Cross-Validation (CV) - Introduction ''' ==<br />
<br />
[[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]]<br />
Cross-Validation is used to estimate the error rate of a classifier with respect to test data rather than data used in the model. Here is a general introduction to CV:<br />
<br />
<math>\hookrightarrow</math> We have a set of collected data for which we know the proper labels<br />
<br />
<math>\hookrightarrow</math> We divide it into 2 parts, Training data (T) and Validation data (V)<br />
<br />
<math>\hookrightarrow</math> For our calculation, we pretend that we do not know the label of V and we use data in T to train the classifier<br />
<br />
<math>\hookrightarrow</math> We estimate an empirical error rate on V since the model hasn't seen V yet and we know the proper label of all elements in V to know how many were misclassified.<br />
<br />
CV has different implementations which can reduce the variance of the calculated error rate, but sometimes with a tradeoff of a higher calculation time.<br />
<br />
== ''' Complexity Control - Nov 4, 2009''' ==<br />
<br />
== Cross-validation ==<br />
[[File:Cross-validation.png|350px|thumb|right|Figure 1: Classical/Standard cross-validation]]<br />
Cross-validation is the simplest and most widely used method to estimate the true error. It comes from the observation that although training error always decreases with the increasing complexity of the model, the test error starts to increase from a certain point, which is noted as overfitting (see [[#prediction-error|figure 2]] above). Since test error estimates MSE (mean square error) best, people came up with the idea to divide the data set into three parts: training set, validation set, and test set. training set is used to build the model, validation set is used to deside the parameters and the optimal model, and the test set is used to estimate the performance of the chosen model. A classical division is 50% for training set, and 25% each for validation set and test set. All of them are randomly selected from the original data set. <br />
<br />
Then, we only use the part of our data marked as the "training set" to train our algorithm, while keeping the remaining marked as the "validation set" untouched. As a result, the validation set will be totally unknown to the trained model. The error rate is then estimated by:<br />
<br />
<math>\hat L(h) = \frac{1}{|\nu|}\sum_{X_i \in \nu}(h(x_i) \neq y_i)</math>, where <math>\,|\nu|</math> is the cardinality of the validation set.<br />
<br />
When we change the complexity, the error generated by the validation set will have the same behavior as the test set, so we are able to choose the best parameters to get the lowest error.<br />
<br />
<br />
=== K-fold Cross-validation ===<br />
[[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]]<br />
Above is the simplest form of complexity control. However, in reality, it may be hard to collect data ??and we usually suffer from the curse of dimensionality??, and a larger data set may be hard to come by. Consequently, we may not be able to afford to sacrifice part of the limited resources. In this case we use another method that addresses this problem, K-fold cross-validation. We divide the data set into <math>\,K</math> subsets roughly equal in size. The usual choice is <math>\,K = 10</math>.<br />
<br />
Generally, how to choose <math>\,K</math>:<br />
<br />
if <math>\,K=n</math>, leave one out, low bias, high variance. Each subset contains a single element, so the model is trained with all except one point, and then validated using that point.<br />
<br />
if <math>\,K=2</math>, say 2-fold, 5-fold, high bias, low variance. Each subset contains approximately <math>\,\frac{1}{2}</math> or <math>\,\frac{1}{5}</math> of the data.<br />
<br />
For every <math>\,k</math>th <math>( \,k \in [ 1, K ] )</math> part, we use the other <math>\,K-1</math> parts to fit the model and test on the <math>\,k</math>th part to estimate the prediction error <math>\hat L_k</math>, where<br />
<br />
<math>\hat L(h) = \frac{1}{K}\sum_{k=1}^K\hat L_k</math><br />
<br />
For example, suppose we want to fit a polynomial model to the data set and split the set into four equal subsets as shown in Figure 2. First we choose the degree to be 1, i.e. a linear model. Next we use the first three sets as training sets and the last as validation set, then the 1st, 2nd, 4th subsets as training set and the 3rd as validation set, so on and so forth until all the subsets have been the validation set once (all observations are used for both training and validation). After we get <math>\hat L_1, \hat L_2, \hat L_3, \hat L_4</math>, we can calculate the average <math>\hat L</math> for degree 1 model. Similarly, we can estimate the error for n degree model and generate a simulating curve. Now we are able to choose the right degree which corresponds to the minimum error. Also, we can use this method to find the optimal unit number of hidden layers of neural networks. We can begin with 1 unit number, then 2, 3 and so on and so forth. Then find the unit number of hidden layers with lowest average error.<br />
<br />
=== Generalized Cross-validation ===<br />
If the vector of observed values is denoted by <math>\mathbf{y}</math>, and the vector of fitted values by <math>\hat\mathbf{y}</math>.<br />
<br />
<math>\mathbf{y} = \mathbf{H}\hat\mathbf{y}</math>, <br />
<br />
where the hat matrix is given by<br />
<br />
<math>\mathbf{H} = \mathbf{X}( \mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X}^{T}</math>,<br />
<br />
<math> \frac{1}{N}\sum_{i=1}^{N}[y_{i} - \hat f^{-i}(\mathbf{x}_{i})]^{2}=\frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-\mathbf{H}_{ii}}]^{2}</math>,<br />
<br />
Then the GCV approximation is given by<br />
<br />
<math> GCV(\hat f) = \frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-trace(\mathbf{H})/N}]^{2}</math>,<br />
<br />
Thus, one of the biggest advantages of the GCV is that the trace is more easily computed.<br />
<br />
=== Leave-one-out Cross-validation ===<br />
Leave-one-out cross-validation involves using all but one data point in the original training data set to train our model, then using the data point that we initially left out as a means to estimate true error. But repeating this process for every data point in our original data set, we can obtain a good estimation of true error.<br />
<br />
In other words, leave-one-out cross-validation is k-fold cross-validation in which we set the subset number <math>\,K</math> to be the cardinality of the whole data set.<br />
<br />
In the above example, we can see that k-fold cross-validation can be computationally expensive: for every possible value of the parameter, we must train the model <math>\,K</math> times. This deficiency is even more obvious in leave-one-out cross-validation, where we must train the model <math>\,n</math> times, where <math>\,n</math> is the number of data point in the data set.<br />
<br />
Fortunately, when adding data points to the classifier is reversible, calculating the difference between two classifiers is computationally more efficient than calculating the two classifiers separately. So, if the classifier on all the data points is known, we simply undo the changes from a data point <math>\,K</math> times to calculate the leave-one-out cross-validation error rate.<br />
<br />
== Regularization for Neural Network — Weight Decay ==<br />
[[File:figure 2.png|350px|thumb|right|Figure 1: activation function]]<br />
Weight decay training is suggested as an implementation for achieving a robust neural network which is insensitive to noise. Since the number of hidden layers in NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting.<br />
<br />
It can be seen from Figure 1 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, we can avoid overfitting by constraining the weights to be small. This gives us a hint to initialize the random weights to be close to zero.<br />
<br />
Formally, we penalize nonlinear weights by adding a penalty term in the error function. Now the regularized error function becomes:<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}|w_i|^2 + \sum_{jk}|u_{jk}|^2)</math>, where <math>\,err</math> is the original error in back-propagation; <math>\,w_i</math> is the weights of the output layer; <math>\,u_{jk}</math> is the weights of the hidden layers.<br />
<br />
Usually, too large <math>\,\lambda</math> will make the weights <math>\,w_i</math> and <math>\,u_{jk}</math> too small. We can use cross validation to estimate <math>\,\lambda</math>.<br />
<br />
A similar penalty, weight elimination, is given by,<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}\frac{|w_i|^2}{1 + |w_i|^2} + \sum_{jk}\frac{|u_{jk}|^2}{1+|u_{jk}|^2})</math>.<br />
<br />
As in back-propagation, we take partial derivative with respect to the weights:<br />
<br />
<math>\frac{\partial REG}{\partial w_i} = \frac{\partial err}{\partial w_i} + 2\lambda w_i</math><br />
<br />
<math>\frac{\partial REG}{\partial u_{jk}} = \frac{\partial err}{\partial u_{jk}} + 2\lambda u_{jk}</math><br />
<br />
<math>w^{new} \leftarrow w^{old} - \rho\left(\frac{\partial err}{\partial w} + 2\lambda w\right)</math><br />
<br />
<math>u^{new} \leftarrow u^{old} - \rho\left(\frac{\partial err}{\partial u} + 2\lambda u\right)</math><br />
<br />
Note that here <math>\,\lambda</math> serves as a trade-off parameter, tuning between the error rate and the linearity. Actually, we may also set <math>\,\lambda</math> by cross-validation. The tuning parameter is important since weights of zero will lead to zero derivatives and the algorithm will not change. On the other hand, starting with weights that are too large means starting with a nonlinear model which can often lead to poor solutions. <ref>Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning (Springer 2009) pp.398</ref><br />
<br />
== Radial Basis Function (RBF) Networks - November 6, 2009 ==<br />
<br />
[[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]]<br />
<br />
=== Introduction === <br />
<br />
A Radial Basis Function (RBF) network [http://en.wikipedia.org/wiki/Radial_basis_function_network] is a type of artificial neural network with an output layer and a single hidden layer, with weights from the hidden layer to the output layer, and can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. One choice that has been widely used is that of radial basis functions, which have the property that each basis function depends only on the radial distance (typically Euclidean) from a center <math>\displaystyle\mu_{j}</math>, so that <math>\phi_{j}(x)= h({\Vert x - \mu_{j}\Vert})</math>.<br />
<br />
<br />
The output of an RBF network can be expressed as a weighted sum of its radial basis functions as follows:<br />
<br />
<math>\hat y_{k} = \sum_{j=1}^M\phi_{j}(x) w_{jk}</math><br />
<br />
The radial basis function is: <br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br /><br />
(Gaussian without a normalization constant)<br /><br /><br />
'''note:'''The hidden layer has a variable number of neurons (the optimal number is determined by the training process). As usual the more neurons in the hidden layer, the higher the model complexity. Each neuron consists of a radial basis function centered on a point with the same dimensions as the input data. The radii of the RBF functions may be different. The centers and radii can be determined through clustering or an EM algorithm. When the x vector is given from the input layer, the hidden neuron computes the radial distance from the neuron’s center point and then applies RBF function to this distance. The resulting value is passed to the the output layer and weighed together to form the output. <br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat{Y}_{n,k} = \left[ \begin{matrix}<br />
\hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\<br />
\hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\<br />
\vdots &\vdots & \ddots & \vdots \\<br />
\hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k}<br />
\end{matrix}\right] </math> is the matrix of output variables. <br />
<br />
:<math>\Phi_{n,m} = \left[ \begin{matrix}<br />
\phi_{1,1} & \phi_{1,2} & \cdots & \phi_{1,m} \\<br />
\phi_{2,1} & \phi_{2,2} & \cdots & \phi_{2,m} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
\phi_{n,1} & \phi_{n,2} & \cdots & \phi_{n,m}<br />
\end{matrix}\right] </math> is the matrix of Radial Basis Functions.<br />
<br />
:<math>W_{m,k} = \left[ \begin{matrix}<br />
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\<br />
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
w_{m,1} & w_{m,2} & \cdots & w_{m,k}<br />
\end{matrix}\right] </math> is the matrix of weights.<br />
<br />
Here, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>k = 1</math>, <math>\hat Y</math> and <math>W</math> are column vectors.<br />
<br />
''related reading'':<br />
<br />
Introduction of the Radial Basis Function (RBF) Networks [http://axiom.anu.edu.au/~daa/courses/GSAC6017/rbf.pdf]<br />
<br />
Radial Basis Function (RBF) Networks [http://documents.wolfram.com/applications/neuralnetworks/index6.html] [http://lcn.epfl.ch/tutorial/english/rbf/html/index.html]<br />
<br />
=== Estimation of weight matrix W ===<br />
<br />
We minimize the training error, <math>\Vert Y - \hat{Y}\Vert^2</math> in order to find <math>\,W</math>.<br /><br /><br />
From a previous result in linear algebra we know that <br />
<br />
<math>\Vert A \Vert^2 = Tr(A^{T}A)</math><br />
<br />
Thus we have a problem similar to linear regression:<br />
<br />
<math>\ err = \Vert Y - \Phi W\Vert^{2} = Tr[(Y - \Phi W)^{T}(Y - \Phi W)]</math><br />
<br />
<math>\ err = Tr[Y^{T}Y - Y^{T}\Phi W - W^{T} \Phi^{T} Y + W^{T}\Phi^{T} \Phi W]</math><br />
<br />
<br />
==== Useful properties of matrix differentiation ====<br />
<br />
<br />
<math>\frac{\partial Tr(AX)}{\partial X} = A^{T}</math><br />
<br />
<math>\frac{\partial Tr(X^{T}A)}{\partial X} = A</math><br />
<br />
<math>\frac{\partial Tr(X^{T}AX)}{\partial X} = (A^{T} + A)X</math><br />
<br />
==== Solving for W ====<br />
<br />
We find the minimum over <math>\,W</math> by setting <math>\frac{\partial err}{\partial W}</math> equal to zero and using the aforementioned properties of matrix differentiation.<br />
<br />
<math>\frac{\partial err}{\partial W} = 0</math><br />
<br />
<math>\ 0 - \Phi^{T}Y - \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ -2 \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math><br />
<br />
<math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY</math><br />
<br />
where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}</math><br />
<br />
<math>\,H</math> is the hat matrix for this model. This gives us a nice results since the solution has a closed form and we do not have to worry about convexity problems in this case.<br />
<br />
=== Including an additional bias ===<br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat Y = \left[ \begin{matrix}<br />
y_{11} & y_{12} & \cdots & y_{1k} \\<br />
y_{21} & y_{22} & \cdots & y_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
y_{n1} & y_{n2} & \cdots & y_{nk}<br />
\end{matrix}\right] </math> is the matrix(n by k) of output variables.<br />
<br />
:<math>\Phi = \left[ \begin{matrix}<br />
\phi_{10} &\phi_{11} & \phi_{12} & \cdots & \phi_{1M} \\<br />
\phi_{20} & \phi_{21} & \phi_{22} & \cdots & \phi_{2M} \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{n0} &\phi_{n1} & \phi_{n2} & \cdots & \phi_{nM}<br />
\end{matrix}\right] </math> is the matrix(n by M+1) of Radial Basis Functions.<br />
<br />
:<math>W = \left[ \begin{matrix}<br />
w_{01} & w_{02} & \cdots & w_{0k} \\<br />
w_{11} & w_{12} & \cdots & w_{1k} \\<br />
w_{21} & w_{22} & \cdots & w_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
w_{M1} & w_{M2} & \cdots & w_{Mk}<br />
\end{matrix}\right] </math> is the matrix(M+1 by k) of weights.<br />
<br />
where the extra basis function <math>\Phi_{0}</math> is set to 1.<br />
<br />
==== Normalized RBF ====<br />
<br />
In addition to the above unnormalized architecture, the normalized RBF can be represented as:<br />
<br />
<math>\hat{y}_{k}(X) = \frac{\sum_{j=1}^{M} w_{jk}\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math><br /><br /><br />
<br />
<br />
Actually, <math>\Phi^{\ast}_{j}(X) = \frac{\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math> is known as a normalized radial basis function. Giving the familiar form,<br /><br />
<br />
<math>\hat{y}_{k}(X) = \sum_{j=1}^{M} w_{jk}\Phi^{\ast}_{j}(X)</math><br /><br /><br />
<br />
=== Conceptualizing RBF networks ===<br />
<br />
In the past, we have classified data using models that were explicitly linear, quadratic, or otherwise definite. In RBF networks, like in Neural Networks, we can fit an arbitrary model. How can we do this without changing the equations being used?<br />
<br />
Recall a [[#Trick:_Using_LDA_to_do_QDA_-_October_7.2C_2009|trick]] that was discussed in the October 7 lecture: if we add new features to our original data set, we can project into higher dimensions, use a linear algorithm, and get a quadratic result by collapsing to a lower dimension afterward. In RBF networks, something similar can happen.<br />
<br />
Think of <math>\,\Phi</math>, our matrix of radial basis functions, as a feature space of the input. Each hidden unit, then, can be thought to represent a feature; we can see that, if there are more hidden units than input units, we can essentially project to a higher-dimensional space, as we did in our earlier trick. However, this does not mean that an RBF network will actually do this, it is merely a way to convince yourself that RBF networks (and neural networks) can fit arbitrary models. Nevertheless, it is also noticed that just because of such great power, the problem of overfitting appears more important. We have to control its complexity so that it does not fit to an arbitrary training model but to a general one.<br />
<br />
=== RBF networks for classification -- a probabilistic paradigm ===<br />
<br />
[[File:Rbf_graphical_model.png|350px|thumb|left|Figure 1: RBF graphical model]]<br />
<br />
An RBF network is akin to fitting a Gaussian mixture model to data. We assume that each class can be modelled by a single function <math>\,\phi</math> and data is generated by a mixture model. According to Bayes Rule,<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(x|y_{k})*Pr(y_{k})}{Pr(x)}</math><br />
<br />
While all classifiers that we have seen thus far in the course have been in discriminative form, the RBF network is a generative model that can be represented using a directed graph.<br />
<br />
We can replace the class conditional density in the above conditional probability expression by marginalizing <math>\,x</math> over <math>\,j</math>:<br />
<math>\Pr(x|y_{k}) = \sum_{j} Pr(x|j)*Pr(j|y_{k})</math><br />
<br />
<br />
<br />
<br/><br/><br />
*'''Note''' We made the assumption that each class can be modelled by a single function <math>\displaystyle\Phi</math> and that the data was generated by a mixture model. The Gaussian mixture model has the form:<br />
<math>f(x)=\sum_{m=1}^M \alpha_m \phi(x;\mu_m,\Sigma_m)</math> where <math>\displaystyle\alpha_m</math> are mixing proportions, <math>\displaystyle\sum_m \alpha_m=1</math>, and <math>\displaystyle\mu_m</math> and <math>\displaystyle\Sigma_m</math> are the mean and covariance of each Gaussian density respectively. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009), pp. 214. </ref> The generative model in Figure 1 shows graphically how each Gaussian in the mixture model is chosen to sample from.<br />
<br />
== '''Radial Basis Function (RBF) Networks - November 9th, 2009''' ==<br />
<br />
=== RBF Network for classification (A probabilistic point of view) ===<br />
Using RBF Network to do classification, we usually treat it as regression problem. We want to set a threshold to decide what the data’s class membership is. However, to find some insight to the classification problem of what we are doing in terms of RBF Network, we often think of mixture models and make certain assumptions.<br />
<br />
[[File:RBF.png|350px|thumb|right|Figure 26.1: RBF Network Classification Demo]] <br />
<br />
We assume, as we can see in the graph on the right hand side, that we have three random variables, <math>\displaystyle y_k</math>, <math>\displaystyle j</math>, and <math>\displaystyle x</math> where <math>\displaystyle y_k</math> denotes class <math>\,k</math>, <math>\displaystyle x</math> is what we observed here, and <math>\displaystyle j</math> is some hidden random variable. There is a process that there are different classes, and each class can trigger a different hidden random variable <math>\displaystyle j</math>. To understand this, we can assume that, for instance, this random variable <math>\displaystyle j</math> has a Gaussian distribution (it could have any other distribution as well) and that all the <math>\displaystyle j</math>’s have the same distribution (Gaussian), but with different parameters. From each Gaussian distribution triggered by each class, we are going to sample some data points. Therefore, in the end, we are going to get a set of data, which are not strictly Gaussian, but are actually a mixture of Gaussians.<br />
<br />
Again, we look at the posterior distribution from [http://en.wikipedia.org/wiki/Bayes'_theorem Bayes' Rule].<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(X = x | Y = y_{k})*Pr(Y = y_{k})}{Pr(X = x)}</math><br />
<br />
Since we made the assumption that the data has been generated from a mixture model, we can estimate this conditional probability by<br />
<br />
<math>\Pr(X = x | Y = y_{k}) = \sum_{j} Pr(X = x | j)*Pr(j | Y = y_{k})</math>, <br />
<br />
which is the class conditional distribution (or probability) of the mixture model. Note, here, if we only have a simple model from <math>\displaystyle y_k</math> to <math>\displaystyle x</math>, then we won’t have this summation.<br />
<br />
We can substitute this class conditional distribution into Bayes' formula. We can see that the posterior of class <math>\displaystyle k</math> is the summation over <math>\displaystyle j</math> of the probability of <math>\displaystyle x</math> given <math>\displaystyle j</math> times the probability of <math>\displaystyle j</math> given <math>\displaystyle y_k</math>, times the prior distribution of class <math>\displaystyle k</math>, and lastly divided by the marginal probability of <math>\displaystyle x</math>. That is,<br />
<br />
<math>\Pr(y_k | x) = \frac {\sum_{j} Pr(x | j)*Pr(j | y_{k})*Pr(y_{k})}{Pr(x)}</math>.<br />
<br />
Since, the prior probability of class <math>\displaystyle k</math>, <math>\displaystyle Pr(y_{k})</math>, does not have an index of <math>\displaystyle j</math>, it can be taken out of the summation. This yields,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)}</math>.<br />
<br />
We multiply this by <math>\displaystyle 1 = \frac {Pr(j)}{Pr(j)}</math>. Then, it becomes,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)} * \frac {Pr(j)}{Pr(j)}</math>.<br />
<br />
Next, note that <math>\displaystyle Pr(j | x) = \frac {Pr(x | j)*Pr(j)}{Pr(x)}</math>, and <math>\displaystyle Pr(y_k | j) = \frac {Pr(j | y_k)*Pr(y_k)}{Pr(j)}</math>. Then rearranging the terms, we finally have the posterior:<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} Pr(j | x)Pr(y_k | j)</math>.<br />
<br />
Interestingly, this is just the product of the posterior of the two functions that are summed.<br />
<br />
==== Interpretation of RBF Network classification ====<br />
<br />
[[File:2.png|350px|thumb|right|Figure 26.1.2(2): RBF Nerwork ]]<br />
<br />
We want to relate the results that we derived above to our RBF Network. In a RBF Network, as we can see on the right hand side, we have a set of data, <math>\displaystyle x_1</math> to <math>\displaystyle x_d</math>, and the hidden basis function, <math>\displaystyle \phi_{1}</math> to <math>\displaystyle \phi_{M}</math>, and then we have some output, <math>\displaystyle y_1</math> to <math>\displaystyle y_k</math>. Also, we have weights from the hidden layer to output layer. The output is just the linear sum of <math>\displaystyle \phi</math>’s. <br />
<br />
Now consider probability of <math>\displaystyle j</math> given <math>\displaystyle x</math> to be <math>\displaystyle \phi</math>, and the probability of <math>\displaystyle y_k</math> given <math>\displaystyle j</math> to be the weights <math>\displaystyle w_{jk}</math>, then the posterior can be written as,<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math>.<br />
<br />
[[File:3.png|350px|thumb|left|Figure 26.1.2(1): Gaussian mixture ]]<br />
<br />
Now, let us look at an example in one dimensional case. Suppose,<br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, and <math>\displaystyle j</math> is from 1 to 2. <br />
<br />
We know that <math>\displaystyle \phi</math> is a radial basis function. It's as if we put some Gaussian over data. And for each Gaussian, we consider the center <math>\displaystyle \mu</math>. Then, what <math>\displaystyle \phi</math> computes is the similarity of any data point to the center. <br />
<br />
We can see the graph on the left which plots the density of <math>\displaystyle \phi_{1}</math> and <math>\displaystyle \phi_{2}</math>. Take <math>\displaystyle \phi_{1}</math> for instance, if the point gets far from the center <math>\displaystyle \mu_{1}</math>, then it will reduce <math>\displaystyle \phi_{1}</math> to become nearly zero. Remember that, we can usually find a non-linear regression or classification of input space by doing a linear one in some extended space or some feature space (more details in Aside). Here, the <math>\displaystyle \phi</math>’s actually produce that feature space. <br />
<br />
So, one way to look at this is that this <math>\displaystyle \phi</math> is telling us that given an input, how likely the probability of presence of a particular feature is. Say, for example, we define the features as the centers of these Gaussian distributions. Then, this <math>\displaystyle \phi</math> function somehow computes the possibility given certain data points, of this kind of feature appearing. If the data point is right at the center, then the value of that <math>\displaystyle \phi</math> would be one, i.e. the probability is 1. If the point is far from the center, then the probability (<math>\displaystyle \phi</math> function value) will be close to zero, that is, it’s less likely. Therefore, we can treat <math>\displaystyle Pr(j | x)</math> as the probability of a particular feature given data. <br />
<br />
When we have those features, then <math>\displaystyle y</math> is the linear combination of the features. Hence, any of the weights <math>\displaystyle w</math>, which is equal to <math>\displaystyle Pr(y_k | j)</math>, tells us how likely this particular <math>\displaystyle y</math> will appear given those features. Therefore, the weight <math>\displaystyle w_{jk}</math> shows the probability of class membership given feature. <br />
<br />
Hence, we have found a probabilistic point of view to look at RBF Network!<br />
<br />
*'''Note''' There are some inconsistencies with this probabilistic point of view. There are no restrictions that force <math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math> to be between 0 and 1. So if least squares is used to solve this, <math>\displaystyle w_{jk}</math> cannot be interpreted as a probability. <br />
<br />
<br />
''' Aside '''<br />
*Feature Space:<br />
:One way to produce a feature space is LDA<br />
:Suppose, we have n data points <math>\mathbf{x}_1</math> to <math>\mathbf{x}_n </math>. Each data point has d features. And these n data points consist of the <math>X</math> matrix, <br />
:<math>X = \left[ \begin{matrix}<br />
x_{11} & x_{21} & \cdots & x_{n1} \\<br />
x_{12} & x_{22} & \cdots & x_{n2} \\<br />
\vdots & & \ddots & \vdots \\<br />
x_{1d} & x_{2d} & \cdots & x_{nd}<br />
\end{matrix}\right] </math><br />
:Also, we have feature space,<br />
:<math>\Phi^{T} = \left[ \begin{matrix}<br />
\phi_{1}(\mathbf{x_1}) & \phi_{1}(\mathbf{x_2})& \cdots & \phi_{1}(\mathbf{x_n})\\<br />
\phi_{2}(\mathbf{x_1})& \phi_{2}(\mathbf{x_2})& \cdots & \phi_{2}(\mathbf{x_n}) \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{M}(\mathbf{x_1}) & \phi_{M}(\mathbf{x_2}) & \cdots & \phi_{M}(\mathbf{x_n})<br />
\end{matrix}\right] </math> <br />
:If we want to solve a regression problem for the input data, we don’t perform Least Square on this <math>\displaystyle X</math> matrix, we do Least Square on the feature space, i.e. on the <math>\displaystyle \Phi^{T}</math> matrix. The dimensionality of <math>\displaystyle \Phi^{T}</math> is M by n.<br />
:Now, we still have n data points, but we define these n data points in terms of a new set of features. So, originally, we define our data points by d features, but now, we define them by M features. And what are those M features telling us? <br />
:Let us look at the first column of <math>\displaystyle \Phi^{T}</math> matrix. The first entry is <math>\displaystyle \phi_1</math> applied to <math>\mathbf{x_1}</math>, and so on, until the last entry is <math>\displaystyle \phi_M</math> applied to <math>\mathbf{x_1}</math>. Suppose each of these <math>\displaystyle \phi_j</math> is defined by<br />
:<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>.<br />
:Then, each <math>\displaystyle \phi_j</math> checks the similarity of the data point with its center. Hence, the new set of features are actually representing M centers in our data set, and for each data point, its new features check how this point is similar to the first center; how it is similar to the second center; and how it is similar to the <math>\displaystyle M^{th}</math> center. And this checking process will apply to all data points. Therefore, feature space gives another representation of our data set. <br />
<br />
</noinclude><br />
<br />
=== Model selection or complexity control for RBF Network - a brief introduction ===<br />
In order to obtain a better fit for the training data, we often want to increase the complexity of our RBF Network. By its construction, we know that to change the complexity of a RBF Network, the only way is to add or decrease the number of basis functions. A large number of basis function yields a more complex network. In theory, if we add enough basis functions, the RFB Network would work for any training; however, it doesn't mean this RBF Network model can generalize well. Therefore, for the purpose of avoiding overfitting problem (see Notes below), we only want to increase the number of basis function to certain point, i.e. its optimal level. <br />
<br />
For the model selection, what we usually do is estimate the training error. After working through the training error, we’ll see that the training error in fact can be decomposed, and one component of training error is called Mean Squared Error (MSE). In the later notes, we will find that our final goal is to get a good estimate of MSE. Moreover, in order to find an optimal model for our data, we select the model with the smallest MSE.<br />
<br />
Now, let us introduce some notations that we will use in the analysis:<br />
*<math>\hat f</math> -- the prediction model estimated by a RBF network from the training data<br />
*<math>\displaystyle f</math> -- the real model (not null), and ideally, we want <math>\hat f</math> to be close to <math>\displaystyle f</math><br />
*<math>\displaystyle err</math> -- the training error<br />
*<math>\displaystyle Err</math> -- the testing error<br />
*<math>\displaystyle MSE</math> -- the Mean Squared Error<br />
<br />
''' Notes '''<br />
<br />
[[File:overfitting.png|350px|thumb|left|Figure 26.2: Overfitting]]<br />
<br />
*Being more complex isn’t always a good thing. Sometime, [http://en.wikipedia.org/wiki/Overfitting overfitting] causes the model to lose its generality. For example in the graph on left hand side, the data points are sampled from the model <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle f(x_i)</math> is a linear function, which is shown by the blue line, and <math>\displaystyle \epsilon_i</math>is additive Gaussian noise from <math>~N(0,\sigma^2)</math>. The red curve displayed in the graph shows the over-fitted model. Clearly, this over-fitted model only works for any training data, and is useless for any further prediction when new data points are introduced.<br />
<br />
> n<-20;<br />
> x<-seq(1,10,length=n);<br />
> alpha<-2.5;<br />
> beta<-1.75;<br />
> y<-alpha+beta*x+rnorm(n);<br />
> plot(y~x, pch=16, lwd=3, cex=0.5, main='Overfitting');<br />
> abline(alpha, beta, col='blue');<br />
> lines(spline(x, y), col = 2);<br />
<br />
*More details on this topic later on.<br />
<br />
<br />
<br />
</noinclude><br />
<br />
<br />
<br />
<br />
<br />
<br />
== '''Model Selection(Stein's Unbiased Risk Estimate)- November 11th, 2009''' ==<br />
<br />
===Model Selection===<br />
<br />
Model selection is a task of selecting a model of optimal complexity for a given data. Learning a radial basis function network from data is a parameter estimation problem. One difficulty with this problem is selecting parameters that show good performance on both training and testing data. In principle, a model is selected to have parameters associated with the best observed performance on training data, although our goal really is to achieve good performance on unseen testing data. Not surprisingly, a model selected on the basis of training data does not necessarily exhibit comparable performance on the testing data. When squared error is used as the performance index, a zero-error model on the training data can always be achieved by using a sufficient number of basis functions.<br />
<br />
<br />
But, training error and testing error do not demonstrate a linear relationship. In particular, a smaller training error does do not necessarily result in a smaller testing error. In practice, one often observes that, up to a certain point, the model error on testing data tends to decrease as the training error decreases. However, if one attempts to decrease the training error too far by increasing model complexity, the testing error often can take a dramatic increase.<br />
<br />
<br />
The basic reason behind this phenomenon is that in the process of minimizing training error, after a certain point, the model begins to over-fit the training set. Over-fitting in this context means fitting the model to training data at the expense of losing generality. In the extreme form, a set of <math>\displaystyle N</math> training data points can be modeled exactly with <math>\displaystyle N</math> radial basis functions. Such a model follows the training data perfectly. However, the model is not representative features of the true underlying data source, and this is why it fails to correctly model new data points.<br />
<br />
<br />
In general, the training error rate will be less than the testing error on the new data. A model typically adapts to the training data, and hence the training error will be an overly optimistic estimate of the testing error. An obvious way to well estimate testing err is to add a penalty term to the training error to compensate. Actually, SURE is developed based on this.<br />
<br />
<br />
<br />
===Stein's unbiased risk estimate (SURE)===<br />
<br />
<br />
====Important Notation====<br />
<br />
Let:<br />
*<math>\hat f(X)</math> denote the ''prediction model'', which is estimated from a training sample by the RBF neural network model.<br />
*<math>\displaystyle f(X)</math> denote the ''true model''.<br />
*<math>\displaystyle err=\sum_{i=1}^N (\hat y_i-y_i)^2 </math> denote the ''training error'',which is the average loss over the training sample.<br />
*<math>\displaystyle Err=\sum_{i=1}^M (\hat y_i-y_i)^2 </math> denote the ''test error'', which is the expected prediction error on an independent test sample.<br />
*<math>\displaystyle MSE=E(\hat f-f)^2</math> denote the ''mean squared error'', where <math>\hat f(X)</math> is the estimated model and <math>\displaystyle f(X)</math> is the true model.<br />
<br />
<br />
<br />
Suppose the observations <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>.We need to estimate <math>\hat f</math> from training data set <math>T=\{(x_i,y_i)\}^2_{i=1}</math>. Let <math>\hat f_i=\hat f_(x_i)</math> and <math>\displaystyle f_i= f(x_i)</math>, then <br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i-\epsilon_i)^2]</math><math>=E[(\hat f_i-f_i)^2]+E[\epsilon_i^2]-2E[\epsilon_i(\hat f_i-f_i)]</math><br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2E[\epsilon_i(\hat f_i-f_i)]</math> <math>\displaystyle (1)</math><br />
<br />
The last term can be written as:<br />
<br />
<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=E[(y_i-f_i)(\hat f_i-f_i)]=cov(y_i,\hat f)</math>, where<math>\displaystyle y_i</math> and <math>\hat f_i</math> both have same mean <math>\displaystyle f_i</math>.<br />
<br />
<br />
<br />
====Stein's Lemma====<br />
<br />
If <math>\,Z</math> is <math>\,N(\mu,\sigma^2)</math> and if <math>\displaystyle g(Z)</math> is weakly differentiable,such that<math>\displaystyle E[\vert g'(Z)\vert]<\infty</math>, then <math>\displaystyle E[g(Z)(Z-\mu)]=\sigma^2E(g'(Z))</math>.<br />
<br />
<br />
According to Stein's Lemma, the last cross term of <math>\displaystyle (1)</math>, <math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]</math> can be written as <math>\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math>. The derivation is as follows.<br />
<br />
<math>\displaystyle Proof</math>:<br />
<math>\displaystyle E[g(Z)(Z-\mu)]=E[(\hat f-f)\epsilon]=\sigma^2E(g'(Z))=\sigma^2 E[\frac {\partial (\hat f-f)}{\partial y_i}]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}-\frac {\partial f}{\partial y_i}]</math><br />
<br />
<br />
Since <math>\displaystyle f</math> is the true model, not the function of the observations <math>\displaystyle y_i</math>, then <math>\frac {\partial f}{\partial y_i}=0</math>.<br />
<br />
So,<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math> <math>\displaystyle (2)</math><br />
<br />
<br />
<br />
====Two Different Cases====<br />
<br />
=====''Case 1''=====<br />
<br />
Consider the case in which a new data point has been introduced to the estimated model, i.e. <math>(x_i,y_i)\not\in\tau</math>; this new point belong to the validation set <math>\displaystyle \nu</math>,i.e.<math>(x_i,y_i)\in\nu</math>. Since <math>\displaystyle y_i</math> is a new point, <math>\hat f</math> and <math>\displaystyle y_i</math> are independent. Therefore <math>\displaystyle cov(y_i,\hat f)=0</math> (or think about <math>\frac{\partial \hat f}{\partial y_i}</math>, when <math>\,y_i</math>is a new point, then it has nothing with <math>\hat f</math> because the estimation of <math>\hat f</math> is from the training data, so <math>\frac{\partial \hat f}{\partial y_i}=0</math> ) and <math>\displaystyle (1)</math> in this case can be written as:<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2</math>. <br />
<br />
This expectation means <math>\frac {1}{m}\sum_{i=1}^m (\hat y_i-y_i)^2 = \frac {1}{m}\sum_{i=1}^m (\hat f_i-f_i)^2+ \sigma^2</math>.<br />
<br />
<math>\sum_{i=1}^m (\hat y_i-y_i)^2 = \sum_{i=1}^m (\hat f_i-f_i)^2+ m\sigma^2</math><br />
<br />
Based on the notation we denote above, then we obtain:<br />
<math>\displaystyle MSE=Err-m\sigma^2</math><br />
<br />
<br />
<br />
This is the justification behind the technique of cross validation. since <math>\displaystyle \sigma^2</math> is constant, to minimize <math>\displaystyle MSE</math> is equal to minimize the test err <math>\displaystyle Err</math>. In cross vaildation to avoid overfitting or underfitting, a validation data set is independent from the estimated model.<br />
<br />
<br />
=====''Case 2''=====<br />
<br />
A more interesting case is the case in which we do not use new data points to assess the performance of the estimated model. and the training data is used for both estimating and assessing a model <math>\hat f_i</math>. In this case the cross term in <math>\displaystyle (1)</math> cannot be ignored because <math>\hat f_i</math> and <math>\displaystyle y_i</math> are not independent. Therefore the cross term can be estimated by Stein's lemma, which was originally proposed to estimated the mean of a Guassian distribution.<br />
<br />
<br />
Suppose <math>(x_i,y_i)\in\tau</math>, then by applying Stein's lemma, we obtain <math>\displaystyle (2)</math> proved above.<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2\sigma^2E[\frac {\partial \hat f}{\partial y_i}]</math>.<br />
<br />
This expectation means <math>\frac {1}{N}\sum_{i=1}^N (\hat y_i-y_i)^2 = \frac {1}{N}\sum_{i=1}^N (\hat f_i-f_i)^2+ \sigma^2-\frac {2}{N}\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<br />
<math>\sum_{i=1}^N (\hat y_i-y_i)^2 = \sum_{i=1}^N (\hat f_i-f_i)^2+ N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<math>\displaystyle err=MSE+N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math><br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math> <math>\displaystyle (3)</math><br />
<br />
In statistics, this is known as [http://www.reference.com/browse/Stein%27s+unbiased+risk+estimate Stein's unbiased risk estimate (SURE)] is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter, and thus cannot be determined completely.<br />
<br />
<br />
<br />
===SURE for RBF Network===<br />
<br />
Based on SURE, the optimum number of basis functions should be assigned to have the minimum generalization err <math>\displaystyle err</math>. Based on the Radial Basis Function Network, by setting <math>\frac{\partial err}{\partial W}</math> equal to zero , we get the least squared solution of<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math>.Then we have <math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}Y = HY</math>,where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}</math>, <math>\,H</math> is the hat matrix for this model.<br />
<br />
<br />
<math>\hat f=\,H_{i1}y_2+\,H_{i2}y_2+\cdots+\,H_{in}y_n</math> <math>\displaystyle (3)</math><br />
<br />
where <math>\,H</math> depends on the input vector <math>\displaystyle x_i</math> but not on <math>\displaystyle y_i</math>. <br />
<br />
By taking the derivative of <math>\hat f_i</math> with respect to <math>\displaystyle y_i</math>, we can easily obtain:<br />
<br />
<math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Now, substituing this into<math>\displaystyle (3)</math> ,we get<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Here, we can tell that <math>\sum_{i=1}^N \,H_{ii}= \,Trace(H)</math>, the sum of the diagonal elements of <math>\,H</math>. Thus, we can obtain the further simplification that <math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1})\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=d</math>, where<math>\displaystyle d</math> is the dimension of <math>\displaystyle \Phi</math>. Since <math>\displaystyle \Phi</math> is a projection of input matrix <math>\,X</math> onto a basis set spanned by <math>\,M</math>, the number of basis functions. If considering intercept, then <math>\,Trace(H)= M+1</math>.<br />
<br />
Then,<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1)</math>.<br />
<br />
===SURE Algorithm===<br />
<br />
<br />
[[File:27.1.jpg|350px|thumb|right|Figure 27.1]]<br />
<br />
We use this method to find the optimum number of basis function by choosing the model with smallest MSE over the set of models considered. Given a set of models <math>\hat f_M(x)</math> indexed by the number of basis functions, <math>\displaystyle err(M)</math>. <br />
<br />
Then, <math>\displaystyle MSE(M)=err(M)-N\sigma^2+2\sigma^2(M+1)</math><br />
<br />
where <math>\displaystyle N</math> is the number of training samples and the noise,<math>\sigma^2</math>, can be estimated from the training data as<br />
<br />
<math>\hat \sigma^2=\frac {1}{N-1}\sum_{i=1}^N (\hat y-y)^2</math>.<br />
<br />
<br />
By applying SURE algorithm to SPECT Heart data, we get the optimal number of basis functions is <math>\displaystyle M=4</math>.<br />
<br />
<br />
Pls take a look at figure 27.1 on the right, which shows that<math>\displaystyle MSE</math> is smallest when <math>\displaystyle M=4</math>.<br />
<br />
<br />
Calculating the SURE value is easy if you have access to <math>\,\sigma</math>.<br />
<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
If <math>\,\sigma</math> is not known, it can be estimated using the error.<br />
<br />
error = (output - expected_output) .^ 2;<br />
sigma = (dot(err, ones(1, num_data_point)) / (num_data_point)) / (num_data_point - 1);<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
=='''SURE for RBF network & Support Vector Machine - November 13th, 2009'''==<br />
<br />
===SURE for RBF network===<br />
<br />
====Minimizing MSE====<br />
<br />
By Stein's unbiased risk estimate (SURE) for Radial Basis Function (RBF) Network<br />
we get:<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1) </math> (28.1)<br />
<br />
*<math>\displaystyle MSE</math>(mean square error)= <math>\sum_{i=1}^N (\hat y_i-y_i)^2 </math><br />
*<math>\displaystyle err</math>(training error)= <math>\sum_{i=1}^N (\hat f_i-f_i)^2 </math><br />
*<math>\displaystyle (M+1) </math>( number of hidden units)= <math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math><br />
<br />
<br />
'''Goal''': To minimize MSE<br />
<br />
1. If <math>\displaystyle \sigma </math> is known, then it is no impact (i.e. a constant),<br />
and we can ignore it. Only need to minimize <math>\displaystyle MSE=err +2\sigma^2(M+1)</math>.<br />
<br />
2. In reality, we do not know <math>\displaystyle \sigma</math>, and it changes when <math>\displaystyle (M+1) </math> changes. However, we can estimate <math>\displaystyle \sigma </math>.<br />
<br />
<math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>. Suppose we do not know the variance of <math>\displaystyle \epsilon</math>. Then, <br />
<br />
<math>\displaystyle \sigma^2=\frac{1}{N-1}\sum_{i=1}^N (\hat y_i-y_i)^2 =\frac{1}{N-1}err</math> (28.2)<br />
<br />
Substitute (28.2) into (28.1), get<br />
<br />
<math>\displaystyle MSE=err-N\frac{1}{N-1}err+2\frac{1}{N-1}err(M+1)</math><br />
<br />
<math>\displaystyle MSE=err(1-\frac{N}{N-1}+\frac{2(M+1)}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{N-1-N+2M+2}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{2M+1}{N-1}) </math> (28.3) <br />
<br />
<br />
[[File:28.1.jpg|350px|thumb|Figure 28.1: MSE vs err]]<br />
<br />
Figure 28.1: the training error will decrease and the MSE will increase when increasing the number of hidden units (i.e. the model is more complex).<br />
<br />
<br />
When the number of hidden units gets larger and larger, the training error will decrease until it approaches to <math>\displaystyle 0 </math>. If training error approaches <math>\displaystyle 0 </math>, then no matter how large <math>\displaystyle (M+1) </math> is, from (28.3) we can see MSE will approaches to <math>\displaystyle 0 </math> as well. However, in fact it does not happen since when training error is close to <math>\displaystyle 0 </math> [http://en.wikipedia.org/wiki/Overfitting overfitting] happens, and MSE will increase instead of being close to <math>\displaystyle 0 </math>. We can see it from the Figure 28.1. <br />
<br />
<br />
We can see <math>\displaystyle \sigma^2 </math> is the average of <math>\displaystyle err </math>. In order to deal with this problem, we can take the average for <math>\displaystyle err</math> of each hidden unit. For example: we can first take 1 hidden unit, and take 10 hidden units in the next.<br />
<br />
We can also see that unlike the classical Cross Validation (CV) or Leave one out (LOO) techniques, the SURE technique does not need to do the validation to find the optimal model. Hence, SURE technique uses less data than CV or LOO. It is suitable for the case that there is not enough data for validation. However, to implement SURE we need to find <math>\frac {\partial \hat f}{\partial y_i}</math>, which may not be trivial for models that do not have a closed-form solution.<br />
<br />
====Kmeans Clustering====<br />
<br />
Discription:<br /> [http://en.wikipedia.org/wiki/K-means_clustering Kmeans clustering] is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.<br />
<br />
*The number of hidden units is same as the number of clusters, which is <math>\displaystyle \phi </math><br />
<br />
*<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, we set it same for all clusters.<br />
<br />
The basic details for <math>K</math>-means clustering are given:<br />
<br />
The <math>K</math> initial centers are randomly chosen from the training data.<br />
<br />
Then the following two steps are iterated alternatively until convergernce.<br />
<br />
1. for each existing center, we reidentify its cluster (every points in this cluster should be closer to this center than others).<br />
<br />
2. compute the mean for each cluster and make it as the new center for each cluster.<br />
<br />
<br />
Example:<br /><br />
Partition data into 2 clusters (2 hidden values)<br />
<br />
<br />
>> X=rand(30,80); <br />
>> [IDX,C,sumD,D]=kmeans(X,2); <br />
>> size(IDX) <br />
>> 30 1<br />
>> size(C) <br />
>> 2 80<br />
>> size(sumD) <br />
>> 2 1<br />
>> c1=sum(IDX==1)<br />
>> 14<br />
>> c2=sum(IDX==2)<br />
>> 16<br />
>> sumD<br />
>> 85.6643<br />
>> 101.0419<br />
>> v1=sumD(1,1)/c1 <br />
>> 6.1189<br />
>> v2=sumD(2,1)/c2 <br />
>> 6.3151 <br />
<br />
<br />
<br />
Comments:<br />
<br />
We create <math>X</math> randomly as a train set with 80 data points and 30 dimensions, and then apply “kmeans” method to separate X into 2 clusters.IDX is a vector contains 1 or 2 which indicates 2 clusters, and its size is 30*1. <math>\displaystyle C </math> is the center (mean) of each cluster with size 2*80; sumD is sum of the square distance between the data points and center of its cluster. The <math>\displaystyle c1 </math> and <math>\displaystyle c2 </math> indicate the number of data points in cluster 1 and 2. <math>\displaystyle V1 </math> is the variance of the first cluster <math>\displaystyle (v1=\sigma_1)</math>; <math>\displaystyle V2 </math> is the variance of the second cluster <math>\displaystyle (v2=\sigma_2)</math>. Now we can get <math>\displaystyle \phi </math>, <math>\displaystyle w </math>, hat matrix <math>\displaystyle H </math> and <math>\displaystyle \hat Y </math> by following equations. Finally, we will get the <math>\displaystyle MSE </math> and predict the test set. <br />
<br />
<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br />
<br />
<math>\displaystyle W = (\phi^{T}\Phi)^{-1}\phi^{T}Y</math><br />
<br />
<math>\displaystyle H = \phi(\Phi^{T}\Phi)^{-1}\phi^{T}</math><br />
<br />
<math>\displaystyle \hat{Y} = \phi W = \phi(\phi^{T}\phi)^{-1}\phi^{T}Y = HY</math><br />
<br />
<br />
<br />
Aside:<br />
<br />
Similar in spirit to <math>K</math>-means, there is EM algorithm with respect to Gaussian mixture model. Generally speaking, the Gaussian mixture model is referred to as a soft clustering while <math>K</math>-means is hard clustering.<br />
<br />
Similar to <math>K</math>-means, the following two steps are iterated alternatively until convergernce.<br />
<br />
E-step, each point is assigned a weight for each cluster based on the likelihood of each of the corresponding Gaussians. Observations is assigned 1 for one cluster if they are closer to the center of that cluster, and is assigned 0 for other clusters. <br />
<br />
M-step, compute the weighted means and covariances and make them as the new means and covariances for every cluster.<br />
<br />
>>[P,mu,phi,1Pxtr]=mdgEM(X,2,200,0);<br />
<br />
===Support Vector Machine===<br />
<br />
====Introduction====<br />
We have seen that linear discriminate analysis and logistic regression both estimate linear decision boundaries in similar but slightly different ways. Separating hyperplane classifiers provide the basis for support vector classifier. It constructs linear decision boundaries that explicitly try to separate the data into different classes as well as possible. The techniques that make the extensions to the nonseparable case, where the classes overlap are generalized to what is known as the support vector machine. It produces nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space.<br />
<br />
Definition: <br /><br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machine (SVM)] are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space, which can be used for classification, regression or other tasks.<br />
<br />
====Optimal Seperating Hyperplane====<br />
<br />
[[File:28.2.jpg|350px|thumb|right|Figure 28.2]]<br />
<br />
Figure 28.2 An example with two classes separable by a hyperplane. The blue line is the least squares solution, which misclassifies one of the training points. Also shown are the black separating hyperplanes found by the [http://en.wikipedia.org/wiki/Perceptron perceptron] learning algorithm with different random starts.<br /><br />
<br />
We can see the data points can be separated by a linear boundary are in two classes in <math>\displaystyle \mathbb{R}^{2} </math>. Suppose a dataset is indeed linearly separable, then there exits infinitely many possible separating hyperplanes including the black lines in the figure as two of them for training data. However, which solution is the best when we introduce the new data. <br /><br />
<br />
Aside: <br /><br />
The blue line is the least squares solution to the problem,obtained by regressing the <math>\displaystyle -1/+1 </math> response <math>\displaystyle Y </math> on <math>\displaystyle X </math> (with intercept); the line is given by<br />
<math>\displaystyle {X:\hat\beta_0+\hat\beta_1X_1+\hat\beta_2X_2=0}</math>.<br />
This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by linear discriminant analysis, in light of its equivalence with linear regression in the two-class case.<br />
<br />
Classifiers such as (28.4) that compute a linear combination of the input features and return the sign were called ''perceptrons'' in the engineering literature in the late 1950s. <br />
<br />
<br />
Identifications:<br />
<br />
*Hyperplane: seprate two classes <br />
<br />
<math>\displaystyle x^{T}\beta+\beta_0=0</math><br />
<br />
*Margin: the distance between the hyperplane and the closest point.<br />
<br />
<math>\displaystyle d_i=x_i^{T}\beta+\beta_0 </math> where <math>\displaystyle i=1,....,N</math><br />
<br />
Note: since distance is positive, if the data is on <math>\displaystyle +1 </math> side the distance is <math>\displaystyle d_i(+1)</math>. If the data is on the <math>\displaystyle -1 </math> side the distance is <math>\displaystyle d_i(-1)</math>.<br />
<br />
*Data points: <math>\displaystyle y_i\in\{-1,+1\}</math>we can classify points as<math>\displaystyle sign\{d_i\}</math> if <math>\displaystyle \beta,\beta_0 </math> is known.<br /><br />
<br />
====Maximum Margin Classfifiers====<br />
Choose the line farthest from both classes or Choose the line which has the maximum distance from the closest point. (i.e maximize the margin)<br /><br />
<br />
<math>\displaystyle Margin=min\{y_id_i\}</math> <math>\displaystyle i=1,2,....,N </math> <br />
where <math>\displaystyle y_i </math> is label and <math>\displaystyle d_i </math> is distance<br /><br />
<br />
[[File:28.3.jpg|350px|thumb|right|Figure 28.3 The linear algebra of a hyperplane]]<br />
<br />
<br />
<br />
Fiqure 28.3 depicts a hyperplane defined by the equation <math>\displaystyle x^{T}\beta+\beta_0=0</math>. Since they are in <math>\displaystyle \mathbb{R}^{2} </math>, the hyperplane is a line.<br /><br />
<br />
<br />
'''Properties''':<br /><br />
<br />
1. <math>\displaystyle \beta </math> is orthogonal to the hyperplane <br /><br />
<br />
Two points<math>\displaystyle x_1,x_2</math> lying on the hyperplane.<br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0\beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}(x_1-x_2)=0</math><br />
<br />
Hence,<math>\displaystyle \beta </math> is orthogonal to <math>\displaystyle (x_1-x_2)</math>, and<math>\displaystyle \beta^*=\frac{\beta}{\|\beta\|} </math> is the vector normal to the hyperplane.<br /><br />
<br />
2. For any point <math>\displaystyle x_0 </math> on the hyperplane, <br />
<br />
<math>\displaystyle \beta^{T}x_0+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1=-\beta_0</math><br />
<br />
<br />
3. The signed distance fo any point <math>\displaystyle x </math> to the hyperplane is: since the length of <math>\displaystyle \beta </math> is not known, let it be unit vector.<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}(x_i-x_0)}{\|\beta\|} </math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i-\beta^{T}x_0}{\|\beta\|} </math><br />
<br />
by property 2<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i+\beta_0}{\|\beta\|} </math><br />
<br />
<br />
<br />
[[File:4.jpg|350px|thumb|right|Figure 28.4]]<br />
<br />
<br />
<math>\displaystyle Margin=min(y_id_i)</math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle Margin=min\{y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\} </math><br />
<br />
Suppose <math>\displaystyle x_i </math> is not on the hyperplan<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)>0 </math><br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq c </math>for <math>\displaystyle c>0 </math><br />
<br />
<br />
<math>\displaystyle y_i(\frac{\beta^{T}x_i}{c}+\frac{\beta_0}{c})\geq1</math> <br />
<br />
This is known as the cnonical representation of the decision hyperplane.<br />
<br />
For <math>\displaystyle \beta^{T} </math> only the direction is important, so <math>\displaystyle \frac{\beta^{T}}{c} </math> does not change its direction, the hyperplance will be the same.<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq1 </math><br />
<br />
<math>\displaystyle y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\geq\frac{1}{\|\beta\|} </math><br />
<br />
<math>\displaystyle Margin=min(\frac{1}{\|\beta\|} )</math><br />
<br />
which is equivalent as <math>\displaystyle Margin=max(\|\beta\| )</math><br />
<br />
<br />
<br />
<br />
Refrence:<br /><br />
Hastie,T.,Tibshirani,R., Friedman,J.,(2008).The Elements of Statistical Learning:129-130<br />
<br />
=='''Optimizing The Support Vector Machine - November 16th, 2009'''==<br />
The Support Vector Machine is used to find a maximum margin hyperplane, assuming the two classes are separable. This margin can be written as <math>\,min\{y_id_i\}</math>, or the distance of each point from the hyperplane, where <math>\,d_i</math> is the distance and <math>\,y_i</math> is used as the sign.<br />
===Maximizing the Support Vector Machine===<br />
<math>\,Margin=min\{y_id_i\}</math> can be rewritten as <math>\,min\left\{\frac{y_i\left(\beta^Tx_i+\beta_0\right)}{|\beta|}\right\}</math>. <br />
<br\>Note that the term <math>\,y_i\left(\beta^Tx_i+\beta_0\right) = 0</math> if <math>\,x_i</math> is on the hyperplane, but <math>\,y_i\left(\beta^Tx_i+\beta_0\right) > 0</math> if <math>\,x_i</math> is <math>\displaystyle not</math> on the hyperplane<br />
<br />
This implies <math>\,\exists C</math> such that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq C</math>.<br />
<br />
Divide through by C to produce <math>\,y_i\left(\frac{\beta^T}{C}x_i + \frac{\beta_0}{C}\right) \geq 1</math>. <br />
<br />
<math>\,\beta, \beta_0</math> compose a hyperplane that can have different values but we care about the direction, dividng through by a constant does not change the hyperplane. Thus, by scaling <math>\,\beta, \beta_0</math> we can eliminate C, so that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq 1</math>. Implying that the lower bound on <math>\,y_i\left(\beta^Tx_i+\beta_0\right)</math> is <math>\displaystyle 1</math><br />
<br />
Now in order to maximize the margin, we simply need to find <math>\,min\left\{\frac{1}{|\beta|}\right\}</math>. <br />
<br />
In other words, find maximum <math>\,|\beta|</math>, s.t. <math>\,min_i\{y_i(\beta^Tx_i+\beta_0)\} = 1</math>.<br />
<br />
Note that we're dealing with the norm of <math>\,\beta</math>. The 1-norm of a vector is simply the sum of the absolute value of each element (also known as the taxicab or Manhattan distance), and is apparently more accurate, but also has a discontinuity in the derivative. 2-norm, the Euclidean norm (the intuitive length of the vector), is easier to work with - that is <math>\,\|\beta\|_2 = (\beta^T\beta)^{1/2}</math>. For convenience, we will maximize <math>\,\frac{1}{2}\|\beta\|_2^2 = \frac{1}{2}\beta^T\beta</math>.<br />
<br />
This is an example of a quadratic programming problem and we will minimize a quadratic function respective to a linear inequality constraints.<br />
<br />
<br />
====Writing Lagrangian Form of Support Vector Machine====<br />
The Lagrangian form is introduced to ensure that the conditions are satisfied, as well as finding an optimal solution. <math>\,\alpha_i</math> are introduced as dual constraints. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.<br />
<br />
<math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math>. To find the optimal value, set the derivative equal to zero.<br />
<br />
<math>\,\frac{\partial L}{\partial \beta} = 0</math>, <math>\,\frac{\partial L}{\partial \beta_0} = 0</math>. Note that <math>\,\frac{\partial L}{\partial \alpha_i}</math> is equivalent to the constraints <math>\left(y_i(\beta^Tx_i+\beta_0)-1\right) \geq 0, \,\forall\, i</math><br />
<br />
First, <math>\,\frac{\partial L}{\partial \beta} = \frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\left\{\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i)+\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0-\frac{\partial}{\partial \beta}\alpha_iy_i\right\}}</math><br />
<br />
: <math>\frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 = \beta</math>.<br />
<br />
: <math>\,\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i) = \alpha_iy_ix_i</math><br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0 = 0</math>. <br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i = 0</math>.<br />
<br />
So this simplifies to <math>\,\frac{\partial L}{\partial \beta} = \beta - \sum_{i=1}^n{\alpha_iy_ix_i} = 0</math>. In other words,<br />
<br />
<math>\,\beta = \sum_{i=1}^n{\alpha_iy_ix_i}</math>, <math>\,\beta^T = \sum_{i=1}^n{\alpha_iy_ix_i^T}</math><br />
<br />
Similarly, <math>\,\frac{\partial L}{\partial \beta_0} = -\sum_{i=1}^n{\alpha_iy_i} = 0</math>.<br />
<br />
This allows us to rewrite the Lagrangian without <math>\,\beta</math>.<br />
<br />
<math>\,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} - \sum_{i=1}^n{\alpha_iy_i(\sum_{j=1}^n{\alpha_jy_jx_j^Tx_i} + \beta_0) - 1}</math>. <br />
<br />
Because <math>\,\sum_{i=1}^n{\alpha_iy_i} = 0</math>, and <math>\,\beta_0</math> is constant, <math>\,\sum_{i=1}^n{\alpha_iy_i\beta_0} = 0</math>. So this simplifies further, to<br />
<br />
<math>L(\alpha) = \,-\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} + \sum_{i=1}^n{\alpha_i}</math><br />
A dual representation of the maximum margin.<br />
<br />
Because <math>\,\alpha_i</math> is the Lagrangian multiplier, <math>\,\alpha_i \geq 0 \forall i</math>.<br />
<br />
This is a much simpler optimization problem.<br />
<br />
=='''The Support Vector Machine algorithm - November 18, 2009'''==<br />
<br />
===Solving the Lagrangian===<br />
<br />
Continuing from the above derivation, we now have the equation that we need to minimize, as well as two constraints.<br />
<br />
The Support Vector Machine problem boils down to:<br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math><br />
:such that <math>\alpha_i \geq 0</math><br />
:and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
<br />
We are looking to minimize <math>\,\alpha</math>, which is our only unknown. Once we know <math>\,\alpha</math>, we can easily find <math>\,\beta</math> and <math>\,\beta_0</math> (see the Support Vector algorithm below for complete details).<br />
<br />
If we examine the Lagrangian equation, we can see that <math>\,\alpha</math> is multiplied by itself; that is, the Lagrangian is quadratic with respect to <math>\,\alpha</math>. Our constraints are linear. This is therefore a problem that can be solved through [http://en.wikipedia.org/wiki/Quadratic_programming quadratic programming] techniques. We will examine how to do this in Matlab shortly.<br />
<br />
We can write the Lagrangian equation in matrix form:<br />
<br />
<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:such that <math>\underline{\alpha} \geq \underline{0}</math><br />
:and <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Where:<br />
* <math>\underline{\alpha}</math> denotes an <math>\,n \times 1</math> vector; <math>\underline{\alpha}^T = [\alpha_1, ..., \alpha_n]</math><br />
* Matrix <math>S = y_iy_jx_i^Tx_j = (y_ix_i)^T(y_ix_i)</math><br />
* <math>\,\underline{0}</math> and <math>\,\underline{1}</math> are vectors containing all 0s or all 1s respectively<br />
<br />
Using this matrix notation, we can use Matlab's build in quadratic programming routine, [http://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/quadprog.html quadprog].<br />
<br />
===Quadprog example===<br />
<br />
Let's use quadprog to find the solution to <math>\,L(\alpha)</math>.<br />
<br />
Matlab's quadprog function minimizes an equation of the following form:<br />
:<math>\min_x\frac{1}{2}x^THx+f^Tx</math><br />
:such that: <math>\,A \cdot x \leq b</math>, <math>\,Aeq \cdot x = beq</math> and <math>\,lb \leq x \leq ub</math><br />
<br />
We can now see why we kept the <math>\frac{1}{2}</math> constant in the original derivation of the equation.<br />
<br />
The function is called as such: <code>x = quadprog(H,f,A,b,Aeq,beq,lb,ub)</code>. The variables correspond to values in the equation above.<br />
<br />
We can now try to find the solution to <math>\,L(\alpha)</math> (though, it should be noted, that in <math>\,L(\alpha)</math> we subtract the first term rather than add it; I'm not sure if this difference would change how we call <code>quadprog</code>).<br />
<br />
We'll use a simple one-dimensional data set, which is essentially y = -1 or 1 + Gaussian noise. (Note: you could easily put the values straight into the quadprog equation; they are separated for clarity)<br />
<br />
x = [mvnrnd([-1],[0.01],100); mvnrnd([1],[0.01],100)]';<br />
y = [-ones(100,1); ones(100,1)];<br />
S = (y*x)' * (y*x);<br />
f = ones(200,1);<br />
A = -ones(1,200);<br />
b = 0;<br />
Aeq = y';<br />
beq = 0;<br />
lb = 0;<br />
ub = []; % There is no upper bound<br />
alpha = quadprog(S,f,A,b,Aeq,beq,lb,ub);<br />
<br />
This gives us the optimal <math>\,\alpha</math>... or at least I think it should, but it does not appear to work for me (that is, despite setting the lower boundary of 0, a number of values are still negative. Whether this is just the nature of <code>quadprog</code> or an error on my part is an exercise for the reader).<br />
<br />
===Examining K.K.T. conditions===<br />
<br />
[http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker conditions] ([http://webrum.uni-mannheim.de/mokuhn/public/KarushKuhnTucker.pdf more info]) give us a closer look into the Lagrangian equation and the associated conditions.<br />
<br />
Suppose we are looking to minimize <math>\,f(x)</math> such that <math>\,g_i(x) \geq 0, \forall{x}</math>. If <math>\,f</math> and <math>\,g</math> are differentiable, then the ''necessary'' conditions for <math>\hat{x}</math> to be local minimum are:<br />
<br />
# At the optimal point, <math>\frac{\partial L}{\partial \hat{x}} = 0</math>; i.e. <math>f'(\hat{x}) - \sum{\alpha_ig'(\hat{x})}=0</math><br />
# <math>\alpha_i \geq 0</math>. (Dual Feasibility) <br />
# <math>\alpha_ig_i(\hat{x}) = 0, \forall{i}</math> (Complementary Slackness)<br />
# <math>g_i(\hat{x}) \geq 0</math> (Primal Feasibility)<br />
<br />
If any of these conditions are violated, then the problem is deemed not feasible.<br />
<br />
These are all trivial except for condition 4. Let's examine it further in our support vector machine problem.<br />
<br />
===Support Vectors===<br />
<br />
Basically, the support vectors are the training points that actually determine the optimal separating hyperplane which we are look for. Also, they are the most difficult points to classify or the most informative for the classification.<br />
<br />
In our case, the <math>g_i(\hat{x})</math> function is:<br />
:<math>\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1</math><br />
<br />
Substituting <math>\,g_i</math> into KKT condition 3, we get <math>\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0</math>. <br\>In order for this condition to be satisfied either <br/><math>\,\alpha_i= 0</math> or <br/><math>\,y_i(\beta^Tx_i+\beta_0=1</math><br />
<br />
All points <math>x_i</math> are either be 1 or greater than 1 distance away from the hyperplane.<br />
<br />
'''Case 1: a point <math>\displaystyle x_i</math> > 1 away from the margin'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) > 1 \Rightarrow \alpha_i = 0</math>.<br />
<br />
If point <math>\, x_i</math> is not on the margin, then the corresponding <math>\,\alpha_i=0</math>.<br />
<br />
'''Case 2: a point <math>\displaystyle x_i</math> 1 away from the margin'''<br />
<br />
If <math>\,\alpha_i > 0 \Rightarrow y_i(\beta^Tx_i+\beta_0) = 1</math> <br />
<br\>If point <math>\, x_i</math> is on the margin, then the corresponding <math>\,\alpha_i>0</math>.<br />
<br />
<br />
<br />
Points on the margin, points with corresponding <math>\,\alpha_i > 0</math>, are called '''''support vectors'''''.<br />
<br />
===Using support vectors===<br />
<br />
Support vectors are important because they allow the support vector machine algorithm to be insensitive to outliers. If <math>\,\alpha = 0</math>, then the cost function is also 0, and won't contribute to the solution of the SVM problem; only points on the margin -- support vectors -- contribute.<br />
<br />
====The support vector machine algorithm====<br />
<br />
# Solve the quadratic programming problem:<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
## Use Matlab's quadprog to find the optimal <math>\,\underline{\alpha}</math><br />
# Find <math>\beta = \sum_{i=1}^n{\alpha_iy_i\underline{x_i}}</math><br />
# Find <math>\,\beta_0</math> by choosing a support vector (a point with <math>\,\alpha_i > 0</math> and solving <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math><br />
<br />
===Example in Matlab===<br />
<br />
The following code, taken verbatim from the lecture, shows how to use Matlab built-in SVM routines (found in the Bioinformatics toolkit) to do classification through support vector machines.<br />
<br />
load 2_3;<br />
[U,Y] = princomp(X');<br />
data = Y(:,1:2);<br />
l = [-ones(1,200) ones(1,200)];<br />
[train,test] = crossvalind('holdOut',400);<br />
% Gives indices of train and test; so, train is a matrix of 0 or 1, 1 where the point should be used as part of the training set<br />
svmStruct = svmtrain(data(train,:), l(train), 'showPlot', true);<br />
<br />
[[File:Svm1.png|frame|center|The plot produced by training on some of the 2_3 data's first two features.]]<br />
<br />
yh = svmclassify(svmStruct, data(test,:), 'showPlot', true);<br />
<br />
[[File:Svm2.png|frame|center|The plot produced by testing some of the 2_3 data.]]</div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=5452stat8412009-11-21T19:48:52Z<p>Ipargaru: /* Support Vectors */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rise of fields such as data-mining, bioinformatics, and machine learning, classification has becomes a fast-developing topic. In the age of information, vast amounts of data are generated constantly, and the goal of classification is to learn from data. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
'''Definition''': The problem of Prediction a discrete random variable <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math> is called Classification.<br />
<br />
In classification,, we attempt to approximate a function <math>\,h</math>, by using a training data set, which will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>d</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data which are identical independent distributions, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, for instance, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability. Intuitively speaking, to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we find <math>\,y\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y|X=x)</math>, and classify <math>\,X</math> into class <math>\,y</math>. In order to calculate the value of <math>\,P(Y=y|X=x)</math>, we use ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posterior probability, <math>\,P(Y=y)</math> as the prior probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two classes, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<sub></sub><br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''3 different approaches to classification''':<br />
<br />
1) Empirical Risk Minimization: Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of <math>\,L(h)</math><br />
<br />
2) Regression: Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3) Density Estimation: estimate <math>\,P(X=x|Y=0)</math> and <math>\,P(X=x|Y=1)</math> (less popular in high-dimension cases)<br />
<br />
<br />
<br />
'''Bayes Classification Rule Optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when<math></math> the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
<br />
:<math>\, h^*(X)= \left\{\begin{matrix} <br />
1 & if P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
Remark:<br />
<br />
1)Bayes classification rule is optimal. Proof:[http://www.ee.columbia.edu/~vittorio/BayesProof.pdf]<br />
<br />
2)We still need any other method, since we cannot define prior probability in realistic.<br />
<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, since it is generally impossible for us to know the prior <math>\,P(Y=1)</math>, and class conditional density <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], tree-augmented naive Bayes (TAN), Bayesian network augmented naive Bayes (BAN) and general Bayesian network (GBN).<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as having objective existence. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayesian and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayesian'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In Bayesian method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
'''Multi-class Classification''':<br />
<br />
Y takes on more than two values.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
''Theorem'': Suppose that <math>\,Y \in \mathcal{Y}= \{1,\dots,k\}</math>, the optimal rule is :<math>\,h^*(X) = \arg\max_{k}{P(Y = k|X = x)}</math><br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
1 Empirical Risk Minimization:Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
<br />
2 Regression:Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3 Density estimation, estimate <math>P(X = x|Y = 0)</math> and <math>P(X = x|Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
It is quadratic because there is no boundaries.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on [http://academicearth.org/lectures/advice-for-applying-machine-learning LDA and QDA] so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
LDA is for classification and FDA is used for feature extraction.<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
<br />
=== logistic function ===<br />
A logistic function or logistic curve is the most common sigmoid curve. <br />
<br />
:<math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
1. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
2. <math>y(0) = \frac{1}{2}</math><br />
<br />
3. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
4. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
5. The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood, using Pr(Y|X). The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Seeing these equations as a weighted least squares problem makes them easier to derivate.<br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== Perceptron (Foundation of Neural Network) ===<br />
<br />
==== Separating Hyperplane Classifiers ====<br />
Separating hyperplane trys to separate the data using linear decision boundaries. When the classes overlap, it can be generalized to support vector machine, which constructs nonlinear boundaries by constructing a linear boundary in an enlarged and transformed feature space.<br />
<br />
==== Perceptron ====<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
Least Squares returns the sign of a linear combination of data points as the class label<br />
<br />
sign(<math>(\underline{\beta}^T \underline{x} + {\beta}_0)) = sign(\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2})</math><br />
<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br />
<br />
<br />
If we find a hyperplane that is not unique between 2 classes, there will be infinitely many solutions obtained from the perceptron algorithm.<br />
<br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
A Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math>, where <math>\,I</math> indicates the sign of the expression and returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math> (initial guess). Its goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach], which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
====Problems with the Algorithm and Issues Affecting Convergence====<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. This problem can be eliminated by using basis expansions technique. To be specific, we try to find a hyperplane not in the original space, but in the enlarged space obtained by using some basis functions.<br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#A perfect separation is not always available even desirable. If observations comes from different classes sharing the same imput, the classification model seems to be overfitting and will generally have poor predictive performance.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
<br />
</ref>.<br />
====Comment on gradient descent algorithm====<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction in which the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and you initially stand in the middle, then you will finally arrive at the saddle point (local minimum) and get stuck there.<br />
<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over <math>\,i</math> (all data points). Actually, this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent) known as Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>\,{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
*A Perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection with branches that "fan out" onto as many connections as desired, each carrying the same signal - the processing element output signal. <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref> It is a multistage regression or classification model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
A regression problem typically has only one unit in the output layer. In a k-class classification problem, there are usually k units in the output layer that each represent the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1).<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classification by NN. <br />
<br />
In perceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] and is not continuous at 0. Thus, we replace it by a smooth function <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it the '''activation function'''.<br />
<br>The choice of this function <math>\displaystyle \sigma </math> is determined by the properties of the data and the assumed distribution of target variables, but for multiple binary classification problems the logistic function, also known as inverse-logit, is often used: <br />
<math>\sigma(a)=\frac {1}{1+e^{-a}}</math><br />
<br />
There are some important properties for the activation function.<br />
<br />
# Activation function is nonlinear. It can be shown that if the activation function of the hidden units is linear, a three-layer neural network is equivalent to a two layer one. <br />
# Activation function saturate, which means there are maximum and minimum output value. This property ensures that the weights are bounded and therefore the searching time is limited. <br />
# Activation function is continuous and smooth.<br />
# Activation function is monotonic. This property is not necessary, since we know that RBF networks is also a kind of power model. <br />
<br />
'''Note:''' A key difference between a perceptron and a neural network is that a neural network uses continuous nonlinearities in the units, for the purpose of differentiation, whereas the perceptron often uses a non-differentiable activation function. The neural network function is differentiable with respect to the network parameters so that a gradient descent method can be used in training. Moreover, a perceptron is a linear classifier, whereas a neural network, by combining layers of perceptrons, is able to classify non-linear problems through proper training.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes into the perceptron, to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural network is called [http://en.wikipedia.org/wiki/Feedforward_neural_network Feed-Forward Neural Network]. Applications to Feed-Forward Neural Networks include data reduction, speech recognition, sensor signal processing, and ECG abnormality detection, to name a few. <ref>J. Annema, Feed-Forward Neural Networks, (Springer 1995), pp. 9 </ref><br />
<br />
===Back-propagation===<br />
For a while, the Neural Network model was just an idea, since there were no algorithms for training the model until 1986, when Geoffrey Hinton <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> devised an algorithm called '''back-propagation''' [http://en.wikipedia.org/wiki/Backpropagation#Algorithm]. After that, a number of other training algorithms and various configurations of neural networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied a gradient descent algorithm for optimizing weights. Back-propagation uses this idea of gradient descent to train a neural network based on the chain rule in calculus. <br />
<br />
Assume that the last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
For simplicity, there is only 1 unit at the end and assume for the moment we are doing regression.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each unit we have a function <math>\displaystyle z_i=\sigma(a_i)</math> that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classification and the resulting classification output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
''i j are reversed!''<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\delta_j \cdot z_l</math><br />
<br />
Note that a change in <math>\,a_j</math> causes changes in all <math>\,a_i</math> in the next layer on which the error is based, so we need to sum over i in the chain:<br />
<math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} =\sum_i \delta_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\,\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math> Using the activation function <math>\,\sigma(\cdot)</math><br />
<br />
So <math>\delta_j = \sum_i \delta_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \delta_i \cdot u_{ij}</math><br />
<br />
We can propagate the error calculated in the output back through the previous layers and adjust weights to minimize error.<br />
<br />
==Neural Networks (NN) - October 30, 2009 ==<br />
<br />
=== Back-propagation ===<br />
The idea is that we first feed an input from the training set to the Neural Network, then find the error rate at the output and then we propagate the error to previous layers and for each edge of weight <math>\,u_{ij}</math> we find <math>\frac{\partial \mathrm{err}}{\partial u_{ij}}</math>. Having the error rates at hand we adjust the weight of each edge by taking steps proportional to the negative of the gradient to decrease the error at output. The next step is to apply the next input from the training set and go through the described adjustment procedure.<br />
The overview of Back-propagation algorithm:<br />
#Feed a point <math>\,x</math> in the training set to the network, and find the output of all the nodes.<br />
#Evaluate <math>\,\delta_k=y_k-\hat{y_k}</math> for all output units, where <math>y_k</math> is the expected output and <math>\hat{y_k}</math> is the real output.<br />
#By propagating to the previous layers evaluate all <math>\,\delta_j</math>s for hidden units: <math>\,\delta_j=\sigma'(a_j)\sum_i \delta_i u_{ij}</math> where <math>i</math> is associated to the previous layer.<br />
#Using <math>\frac{\partial \mathrm{err}}{\partial u_{jl}} = \delta_j\cdot z_l</math> find all the derivatives.<br />
#Adjust each weight by taking steps proportional to the negative of the gradient: <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} -\rho \frac{\partial \mathrm{err}}{u_{ij}}</math><br />
#Feed the next point in the training set and repeat the above steps.<br />
==== How to initialize the weights ====<br />
This still leaves the question of how to initialize the weights <math>\,u_{ij}, w_i</math>. The method of choosing weights mentioned in class was to randomize the weights before the first step. This is not likely to be near the optimal solution in every case, but is simple to implement. To be more specific, random values near zero will be a good choice for the initial weights(usually from [-1,1]). In this case, the model evolves from a nearly linear one to a nonlinear one as we desired. An alternative is to use an orthogonal least squares method to find the initial weights <ref>http://www.mitpressjournals.org/doi/abs/10.1162/neco.1995.7.5.982</ref>. Regression is performed on the weights and output by using a linear approximation of <math>\,\sigma(a_i)</math>, and finds optimal weights in the linear model. Back propagation is used afterward to find the optimal solution, since the NN is non-linear.<br />
<br />
==== How to set learning rates ====<br />
The learning rate <math>\,\rho</math> is usually a constant. <br />
<br />
If we use On-line learning, as a form of stochastic approximation process, <math>\,\rho</math> should decrease as the iteration increase.<br />
<br />
<br />
Choosing too large learning rate may cause the unstability of the system, while the too small learning rate may lead to a very slow convergence rate(very long time in learning phase). However, the advantage of small learning rate is that it can guarantee the convergence. Thus, generally, it is better to choose a relatively small learning rate to ensure the stability. Usually, choose <math>\,\rho</math> between 0.01 and 0.7<br />
<br />
==== How to determine the number of hidden units ====<br />
<br />
Here we will mainly discuss how to estimate the number of hidden units at very beginning. Obviously, we should adjust it to be more precise using CV, LOO or other complexity control methods. <br />
<br />
Basically, if the patterns are well separated, few hidden units are fairly enough. If the patterns are drawn from some highly complicated mixture models, more hidden units are really needed. <br />
<br />
Actually, the number of hidden units determines the size of the model, and therefore the total number of the weights in the model. Typically speaking, the number of weights should not be larger than the number of training data, say N. Thus, sometimes, N/10 is a good choice. However, in pratice, many well performed models will use more hidden units.<br />
<br />
=== Dimensionality reduction application ===<br />
[[File:NN-bottelneck.png|350px|thumb|right|Figure 1: Bottleneck configuration for applying dimensionality reduction.]]<br />
One possible application of Neural Networks is to perform dimensionality reduction, like other techniques, e.g., PCA, MDS, LLE and Isomap.<br />
<br />
Consider the following configuration as shown in figure 1:<br />
As we go forward in layers of this Neural Network, the number of nodes is reduced, until we reach a layer with the number of nodes representing the desired dimensionality. However, note that at the very first few layers the number of nodes may not be strictly decreasing, as long as finally it can reach a layer with less nodes. From now on in the Neural Network <br />
the previous layers are mirrored. So at the output layer we have the same number of states as we have in the input layer. Now note that if we feed the network with each point and get an output approximately equal to the fed input, that means at the output the same input is reconstructed from the middle layer units. So the output of the middle layer units can represent the input with less dimensions.<br />
<br />
To train this Neural Network, we feed the network with a training point and through back propagation we adjust the network weights based on the error between the input layer and the reconstruction at the output layer. Our low dimensional mapping will be the observed output from the middle layer. Data reconstruction consists of putting the low dimensional data through the second half of the network.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
== Complexity Control October 30, 2009 ==<br />
<br />
[[File:overfitting-model.png|500px|thumb|right|Figure 2. The overfitting model passes through all the points of the training set, but has poor predictive power for new points.<br />
In exchange the line model has some error on the training points but has extracted the main characteristic of the training points, and has good predictive power.]]<br />
There are [http://academicearth.org/lectures/underfitting-and-overfitting two issues] that we have to avoid in Machine Learning:<br />
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]<br />
#Underfitting<br />
<br />
Overfitting occurs when our model is heavily complex with so many degrees of freedom, that we can learn every detail of the training set. Such a model will have very high precision on the training set but will show very poor ability to predict outcomes of new instances, especially outside the domain of the training set.<br />
<br />
In a Neural Network if the depth is too much, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will show a very precise outcome of the training set but will not be able to generalize the commonality of the training set to predict the outcome of new cases.<br />
<br />
Underfitting occurs when the model we picked to describe the data is not complex enough, and has high error rate on the training set.<br />
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.<br />
<br />
'''Example'''<br />
#Consider the example showed in the figure. We have a training set and we want to find a model which fits it the best. We can find a polynomial of high degree which almost passes through all the points in the training set. But, in fact the training set is coming from a line model. Now the problem is although the complex model has less error on the training set it diverges from the line in other ranges which we have no training point. Because of that the high degree polynomail has very poor predictive result on test cases. This is an example of overfitting model.<br />
#Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem.<br />
#Consider a simple classification example. If our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is a yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features typical of bananas, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie., we have overfit the data. This occurs when we have considered so many features that we have perfectly described the existing bananas, but if presented with a new banana of slightly different shape than the existing bananas, for example, it cannot be detected. This is the tradeoff; what is the right level of complexity?<br />
<br />
== Complexity Control - Nov 2, 2009 ==<br />
<br />
Overfitting occurs when the model becomes too complex and underfitting occurs when it is not complex enough, both of which are not desirable. To control complexity, it is necessary to make assumptions for the model before fitting the data. Assumptions that we can make for a model are with polynomials or a neural network. There are other ways as well.<br />
<br />
[[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 1: An example of a model with a family of polynomials]]<br />
We do not want a model to get too complex, so we control it by making an assumption on the model. With complexity control, we want a model or a classifier with a low error rate.<br />
<br />
=== '''How do we choose a good classifier?''' ===<br />
<br />
Our goal is to find a classifier that minimizes the true error rate. <br />
Recall the empirical error rate<br />
<br />
<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math><br />
<br />
<math>\,h</math> is a classifier and we want to minimize the error rate. So we apply <math>\displaystyle x_1</math> to <math>\displaystyle x_n</math> to <math>\displaystyle h</math>, and take the average to get the empirical true error rate estimation of probability that <br />
<math>h(x_{i}) \neq y_{i}</math>.<br />
<br />
<span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 2]]</span><br />
There is a downward bias to this estimate meaning that it is always less than the true error rate. <br />
<br />
If there is a change in our complexity from low to high, our error rate is always decreasing. When we apply our model to the test data, our error rate will start to decrease to a point, but then it will increase since the model hasn't seen it before. This can be explained since training error will decrease when we fit the model better by increasing its complexity, but as we have seen, this complex model will not generalize well, resulting in a larger test error. <br />
<br />
We use our test data (from the test sample line shown on Figure 2) to get our empirical error rate.<br />
Right complexity is defined as where error rate of the test data is minimum; and this is one idea behind complexity control.<br />
<br />
<br />
<br />
[[File:Bias.jpg|200px|thumb|left|Figure 3]]<br />
<br />
We assume that we have samples <math>\,X_1, . . . ,X_n</math> that follow some (possibly unknown) distribution. We want to estimate a parameter <math>\,f</math> of the unknown distribution. This parameter may be the mean <math>\,E(X_i)</math>, the variance <math>\,var(X_i)</math> or some other quantity.<br />
<br />
The unknown parameter <math>\,f</math> is a fixed real number <math>f\in R</math>. To estimate it, we use an estimator which is a<br />
function of our observations, <math>\hat{f}(X_1,...,X_n)</math>. <br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f</math><br />
<br />
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]</math><br />
<br />
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math><br />
<br />
One property we desire of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>.<br />
However, there is a more important property for an estimator than just being unbiased: the mean squared error. In statistics, there are problems for which it may be good to use an estimator with a small bias. In some cases, an estimator with a small bias may have lesser mean squared error or be median-unbiased (rather than mean-unbiased, the standard unbiasedness property). The property of median-unbiasedness is invariant under transformations while the property of mean-unbiasedness may be lost under nonlinear transformations. For example, while using an unbiased estimator with large mean square error to estimate the parameter, we highly risk a big error. In contrast, a biased estimator with small mean square error will well improve the precision of our prediction.<br />
<br />
Hence, our goal is to minimize <math>MSE (\hat{f})</math>.<br />
<br />
From figure 3, we can see that the relationship of the three parameters is:<br />
<math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.<br />
<br />
A Test error is a good estimation on MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias.<br />
<br />
<br />
Referring to Figure 2, overfitting happens after the point where training data (training sample line) starts to decrease and test data (test sample line) starts to increase. There are 2 main approaches to avoid overfitting:<br />
<br />
1. Estimating error rate<br />
<br />
<math>\hookrightarrow</math> Empirical training error is not a good estimation<br />
<br />
<math>\hookrightarrow</math> Empirical test error is a better estimation<br />
<br />
<math>\hookrightarrow</math> Cross-Validation is fast<br />
<br />
<math>\hookrightarrow</math> Computing error bound (analytically) using some probability inequality.<br />
<br />
We will not discuss computing the error bound in class; however, a popular method for doing this computation is called VC Dimension (short for Vapnik–Chervonenkis Dimension). Information can be found from [http://www.autonlab.org/tutorials/vcdim.html Andrew Moore] and [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.7171&rep=rep1&type=pdf Steve Gunn].<br />
<br />
2. Regularization<br />
<br />
<math>\hookrightarrow</math> Use of shrinkage method<br />
<br />
<math>\hookrightarrow</math> Decrease the chance of overfitting by controlling the weights<br />
<br />
=== '''Example of under and overfitting in R''' ===<br />
<br />
To give further intuition of over and underfitting, consider this example. A simple quadratic data set with some random noise is generated, and then polynomials of varying degrees are fitted. The errors for the training set and a test set are calculated.<br />
[[File:Curvefitting-rex2.png|250px|thumb|right|Polynomial fits to curved data set.]]<br />
<br />
>> x <- rnorm(200,0,1)<br />
>> y <- x^2-0.5*x+rnorm(200,0,0.3)<br />
>> xtest <- rnorm(50,1,1)<br />
>> ytest <- xtest^2-0.5*xtest+rnorm(50,0,0.3)<br />
>> p1 <- lm(y~x)<br />
>> p2 <- lm(y ~ poly(x,2))<br />
>> pn <- lm(y ~ poly(x,10))<br />
>> psi <- lm(y~I(sin(x))+I(cos(x)))<br />
<br />
: <code>x</code> values for the training set are based on a <math>\,N(0,1)</math> distribution, while the test set has a <math>\,N(1,1)</math> distribution. <code>y</code> values are determined by <math>\,y = x^2 + x + N(0,0.3)</math>, a quadratic function with some random variation. Polynomial least square fits of degree 1, 2, and 10 are calculated, as well as a fit of <math>\,sin(x)+cos(x)</math>.<br />
<br />
>> > # calculate the mean squared error of degree 1 poly<br />
>> > sum((y-predict(p1,data.frame(x)))^2)/length(y)<br />
>> [1] 1.576042<br />
>> > sum((ytest-predict(p1,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 7.727615<br />
: Training and test mean squared errors for the linear fit. These are both quite high - and since the data is non-linear, the different mean value of the test data increases the error quite a bit.<br />
>> > # calculate the mean squared error of degree 2 poly<br />
>> > sum((y-predict(p2,data.frame(x)))^2)/length(y)<br />
>> [1] 0.08608467<br />
>> > sum((ytest-predict(p2,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 0.08407432<br />
: This fit is far better - and there is not much difference between the training and test error, either.<br />
>> > # calculate the mean squared error of degree 10 poly<br />
>> > sum((y-predict(pn,data.frame(x)))^2)/length(y)<br />
>> [1] 0.07967558<br />
>> > sum((ytest-predict(pn,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 156.7139<br />
: With a high-degree polynomial, the training error continues to decrease, but not by much - and the test set error has risen again. The overfitting makes it a poor predictor. As the degree of the polynomial rises further, the accuracy of the computer becomes an issue - and a good fit is not even consistently produced for the training data.<br />
>> > # calculate mse of sin/cos fit<br />
>> > sum((y-predict(psi,data.frame(x)))^2)/length(y)<br />
>> [1] 0.1105446<br />
>> > sum((ytest-predict(psi,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 1.320404<br />
: Fitting a function of the form sin(x)+cos(x) works pretty well on the training set, but because it is not the real underlying function, it fails on test data which doesn't lie on the same domain.<br />
<br />
== ''' Cross-Validation (CV) - Introduction ''' ==<br />
<br />
[[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]]<br />
Cross-Validation is used to estimate the error rate of a classifier with respect to test data rather than data used in the model. Here is a general introduction to CV:<br />
<br />
<math>\hookrightarrow</math> We have a set of collected data for which we know the proper labels<br />
<br />
<math>\hookrightarrow</math> We divide it into 2 parts, Training data (T) and Validation data (V)<br />
<br />
<math>\hookrightarrow</math> For our calculation, we pretend that we do not know the label of V and we use data in T to train the classifier<br />
<br />
<math>\hookrightarrow</math> We estimate an empirical error rate on V since the model hasn't seen V yet and we know the proper label of all elements in V to know how many were misclassified.<br />
<br />
CV has different implementations which can reduce the variance of the calculated error rate, but sometimes with a tradeoff of a higher calculation time.<br />
<br />
== ''' Complexity Control - Nov 4, 2009''' ==<br />
<br />
== Cross-validation ==<br />
[[File:Cross-validation.png|350px|thumb|right|Figure 1: Classical/Standard cross-validation]]<br />
Cross-validation is the simplest and most widely used method to estimate the true error. It comes from the observation that although training error always decreases with the increasing complexity of the model, the test error starts to increase from a certain point, which is noted as overfitting (see [[#prediction-error|figure 2]] above). Since test error estimates MSE (mean square error) best, people came up with the idea to divide the data set into three parts: training set, validation set, and test set. training set is used to build the model, validation set is used to deside the parameters and the optimal model, and the test set is used to estimate the performance of the chosen model. A classical division is 50% for training set, and 25% each for validation set and test set. All of them are randomly selected from the original data set. <br />
<br />
Then, we only use the part of our data marked as the "training set" to train our algorithm, while keeping the remaining marked as the "validation set" untouched. As a result, the validation set will be totally unknown to the trained model. The error rate is then estimated by:<br />
<br />
<math>\hat L(h) = \frac{1}{|\nu|}\sum_{X_i \in \nu}(h(x_i) \neq y_i)</math>, where <math>\,|\nu|</math> is the cardinality of the validation set.<br />
<br />
When we change the complexity, the error generated by the validation set will have the same behavior as the test set, so we are able to choose the best parameters to get the lowest error.<br />
<br />
<br />
=== K-fold Cross-validation ===<br />
[[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]]<br />
Above is the simplest form of complexity control. However, in reality, it may be hard to collect data ??and we usually suffer from the curse of dimensionality??, and a larger data set may be hard to come by. Consequently, we may not be able to afford to sacrifice part of the limited resources. In this case we use another method that addresses this problem, K-fold cross-validation. We divide the data set into <math>\,K</math> subsets roughly equal in size. The usual choice is <math>\,K = 10</math>.<br />
<br />
Generally, how to choose <math>\,K</math>:<br />
<br />
if <math>\,K=n</math>, leave one out, low bias, high variance. Each subset contains a single element, so the model is trained with all except one point, and then validated using that point.<br />
<br />
if <math>\,K=2</math>, say 2-fold, 5-fold, high bias, low variance. Each subset contains approximately <math>\,\frac{1}{2}</math> or <math>\,\frac{1}{5}</math> of the data.<br />
<br />
For every <math>\,k</math>th <math>( \,k \in [ 1, K ] )</math> part, we use the other <math>\,K-1</math> parts to fit the model and test on the <math>\,k</math>th part to estimate the prediction error <math>\hat L_k</math>, where<br />
<br />
<math>\hat L(h) = \frac{1}{K}\sum_{k=1}^K\hat L_k</math><br />
<br />
For example, suppose we want to fit a polynomial model to the data set and split the set into four equal subsets as shown in Figure 2. First we choose the degree to be 1, i.e. a linear model. Next we use the first three sets as training sets and the last as validation set, then the 1st, 2nd, 4th subsets as training set and the 3rd as validation set, so on and so forth until all the subsets have been the validation set once (all observations are used for both training and validation). After we get <math>\hat L_1, \hat L_2, \hat L_3, \hat L_4</math>, we can calculate the average <math>\hat L</math> for degree 1 model. Similarly, we can estimate the error for n degree model and generate a simulating curve. Now we are able to choose the right degree which corresponds to the minimum error. Also, we can use this method to find the optimal unit number of hidden layers of neural networks. We can begin with 1 unit number, then 2, 3 and so on and so forth. Then find the unit number of hidden layers with lowest average error.<br />
<br />
=== Generalized Cross-validation ===<br />
If the vector of observed values is denoted by <math>\mathbf{y}</math>, and the vector of fitted values by <math>\hat\mathbf{y}</math>.<br />
<br />
<math>\mathbf{y} = \mathbf{H}\hat\mathbf{y}</math>, <br />
<br />
where the hat matrix is given by<br />
<br />
<math>\mathbf{H} = \mathbf{X}( \mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X}^{T}</math>,<br />
<br />
<math> \frac{1}{N}\sum_{i=1}^{N}[y_{i} - \hat f^{-i}(\mathbf{x}_{i})]^{2}=\frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-\mathbf{H}_{ii}}]^{2}</math>,<br />
<br />
Then the GCV approximation is given by<br />
<br />
<math> GCV(\hat f) = \frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-trace(\mathbf{H})/N}]^{2}</math>,<br />
<br />
Thus, one of the biggest advantages of the GCV is that the trace is more easily computed.<br />
<br />
=== Leave-one-out Cross-validation ===<br />
Leave-one-out cross-validation involves using all but one data point in the original training data set to train our model, then using the data point that we initially left out as a means to estimate true error. But repeating this process for every data point in our original data set, we can obtain a good estimation of true error.<br />
<br />
In other words, leave-one-out cross-validation is k-fold cross-validation in which we set the subset number <math>\,K</math> to be the cardinality of the whole data set.<br />
<br />
In the above example, we can see that k-fold cross-validation can be computationally expensive: for every possible value of the parameter, we must train the model <math>\,K</math> times. This deficiency is even more obvious in leave-one-out cross-validation, where we must train the model <math>\,n</math> times, where <math>\,n</math> is the number of data point in the data set.<br />
<br />
Fortunately, when adding data points to the classifier is reversible, calculating the difference between two classifiers is computationally more efficient than calculating the two classifiers separately. So, if the classifier on all the data points is known, we simply undo the changes from a data point <math>\,K</math> times to calculate the leave-one-out cross-validation error rate.<br />
<br />
== Regularization for Neural Network — Weight Decay ==<br />
[[File:figure 2.png|350px|thumb|right|Figure 1: activation function]]<br />
Weight decay training is suggested as an implementation for achieving a robust neural network which is insensitive to noise. Since the number of hidden layers in NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting.<br />
<br />
It can be seen from Figure 1 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, we can avoid overfitting by constraining the weights to be small. This gives us a hint to initialize the random weights to be close to zero.<br />
<br />
Formally, we penalize nonlinear weights by adding a penalty term in the error function. Now the regularized error function becomes:<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}|w_i|^2 + \sum_{jk}|u_{jk}|^2)</math>, where <math>\,err</math> is the original error in back-propagation; <math>\,w_i</math> is the weights of the output layer; <math>\,u_{jk}</math> is the weights of the hidden layers.<br />
<br />
Usually, too large <math>\,\lambda</math> will make the weights <math>\,w_i</math> and <math>\,u_{jk}</math> too small. We can use cross validation to estimate <math>\,\lambda</math>.<br />
<br />
A similar penalty, weight elimination, is given by,<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}\frac{|w_i|^2}{1 + |w_i|^2} + \sum_{jk}\frac{|u_{jk}|^2}{1+|u_{jk}|^2})</math>.<br />
<br />
As in back-propagation, we take partial derivative with respect to the weights:<br />
<br />
<math>\frac{\partial REG}{\partial w_i} = \frac{\partial err}{\partial w_i} + 2\lambda w_i</math><br />
<br />
<math>\frac{\partial REG}{\partial u_{jk}} = \frac{\partial err}{\partial u_{jk}} + 2\lambda u_{jk}</math><br />
<br />
<math>w^{new} \leftarrow w^{old} - \rho\left(\frac{\partial err}{\partial w} + 2\lambda w\right)</math><br />
<br />
<math>u^{new} \leftarrow u^{old} - \rho\left(\frac{\partial err}{\partial u} + 2\lambda u\right)</math><br />
<br />
Note that here <math>\,\lambda</math> serves as a trade-off parameter, tuning between the error rate and the linearity. Actually, we may also set <math>\,\lambda</math> by cross-validation. The tuning parameter is important since weights of zero will lead to zero derivatives and the algorithm will not change. On the other hand, starting with weights that are too large means starting with a nonlinear model which can often lead to poor solutions. <ref>Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning (Springer 2009) pp.398</ref><br />
<br />
== Radial Basis Function (RBF) Networks - November 6, 2009 ==<br />
<br />
[[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]]<br />
<br />
=== Introduction === <br />
<br />
A Radial Basis Function (RBF) network [http://en.wikipedia.org/wiki/Radial_basis_function_network] is a type of artificial neural network with an output layer and a single hidden layer, with weights from the hidden layer to the output layer, and can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. One choice that has been widely used is that of radial basis functions, which have the property that each basis function depends only on the radial distance (typically Euclidean) from a center <math>\displaystyle\mu_{j}</math>, so that <math>\phi_{j}(x)= h({\Vert x - \mu_{j}\Vert})</math>.<br />
<br />
<br />
The output of an RBF network can be expressed as a weighted sum of its radial basis functions as follows:<br />
<br />
<math>\hat y_{k} = \sum_{j=1}^M\phi_{j}(x) w_{jk}</math><br />
<br />
The radial basis function is: <br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br /><br />
(Gaussian without a normalization constant)<br /><br /><br />
'''note:'''The hidden layer has a variable number of neurons (the optimal number is determined by the training process). As usual the more neurons in the hidden layer, the higher the model complexity. Each neuron consists of a radial basis function centered on a point with the same dimensions as the input data. The radii of the RBF functions may be different. The centers and radii can be determined through clustering or an EM algorithm. When the x vector is given from the input layer, the hidden neuron computes the radial distance from the neuron’s center point and then applies RBF function to this distance. The resulting value is passed to the the output layer and weighed together to form the output. <br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat{Y}_{n,k} = \left[ \begin{matrix}<br />
\hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\<br />
\hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\<br />
\vdots &\vdots & \ddots & \vdots \\<br />
\hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k}<br />
\end{matrix}\right] </math> is the matrix of output variables. <br />
<br />
:<math>\Phi_{n,m} = \left[ \begin{matrix}<br />
\phi_{1,1} & \phi_{1,2} & \cdots & \phi_{1,m} \\<br />
\phi_{2,1} & \phi_{2,2} & \cdots & \phi_{2,m} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
\phi_{n,1} & \phi_{n,2} & \cdots & \phi_{n,m}<br />
\end{matrix}\right] </math> is the matrix of Radial Basis Functions.<br />
<br />
:<math>W_{m,k} = \left[ \begin{matrix}<br />
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\<br />
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
w_{m,1} & w_{m,2} & \cdots & w_{m,k}<br />
\end{matrix}\right] </math> is the matrix of weights.<br />
<br />
Here, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>k = 1</math>, <math>\hat Y</math> and <math>W</math> are column vectors.<br />
<br />
''related reading'':<br />
<br />
Introduction of the Radial Basis Function (RBF) Networks [http://axiom.anu.edu.au/~daa/courses/GSAC6017/rbf.pdf]<br />
<br />
Radial Basis Function (RBF) Networks [http://documents.wolfram.com/applications/neuralnetworks/index6.html] [http://lcn.epfl.ch/tutorial/english/rbf/html/index.html]<br />
<br />
=== Estimation of weight matrix W ===<br />
<br />
We minimize the training error, <math>\Vert Y - \hat{Y}\Vert^2</math> in order to find <math>\,W</math>.<br /><br /><br />
From a previous result in linear algebra we know that <br />
<br />
<math>\Vert A \Vert^2 = Tr(A^{T}A)</math><br />
<br />
Thus we have a problem similar to linear regression:<br />
<br />
<math>\ err = \Vert Y - \Phi W\Vert^{2} = Tr[(Y - \Phi W)^{T}(Y - \Phi W)]</math><br />
<br />
<math>\ err = Tr[Y^{T}Y - Y^{T}\Phi W - W^{T} \Phi^{T} Y + W^{T}\Phi^{T} \Phi W]</math><br />
<br />
<br />
==== Useful properties of matrix differentiation ====<br />
<br />
<br />
<math>\frac{\partial Tr(AX)}{\partial X} = A^{T}</math><br />
<br />
<math>\frac{\partial Tr(X^{T}A)}{\partial X} = A</math><br />
<br />
<math>\frac{\partial Tr(X^{T}AX)}{\partial X} = (A^{T} + A)X</math><br />
<br />
==== Solving for W ====<br />
<br />
We find the minimum over <math>\,W</math> by setting <math>\frac{\partial err}{\partial W}</math> equal to zero and using the aforementioned properties of matrix differentiation.<br />
<br />
<math>\frac{\partial err}{\partial W} = 0</math><br />
<br />
<math>\ 0 - \Phi^{T}Y - \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ -2 \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math><br />
<br />
<math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY</math><br />
<br />
where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}</math><br />
<br />
<math>\,H</math> is the hat matrix for this model. This gives us a nice results since the solution has a closed form and we do not have to worry about convexity problems in this case.<br />
<br />
=== Including an additional bias ===<br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat Y = \left[ \begin{matrix}<br />
y_{11} & y_{12} & \cdots & y_{1k} \\<br />
y_{21} & y_{22} & \cdots & y_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
y_{n1} & y_{n2} & \cdots & y_{nk}<br />
\end{matrix}\right] </math> is the matrix(n by k) of output variables.<br />
<br />
:<math>\Phi = \left[ \begin{matrix}<br />
\phi_{10} &\phi_{11} & \phi_{12} & \cdots & \phi_{1M} \\<br />
\phi_{20} & \phi_{21} & \phi_{22} & \cdots & \phi_{2M} \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{n0} &\phi_{n1} & \phi_{n2} & \cdots & \phi_{nM}<br />
\end{matrix}\right] </math> is the matrix(n by M+1) of Radial Basis Functions.<br />
<br />
:<math>W = \left[ \begin{matrix}<br />
w_{01} & w_{02} & \cdots & w_{0k} \\<br />
w_{11} & w_{12} & \cdots & w_{1k} \\<br />
w_{21} & w_{22} & \cdots & w_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
w_{M1} & w_{M2} & \cdots & w_{Mk}<br />
\end{matrix}\right] </math> is the matrix(M+1 by k) of weights.<br />
<br />
where the extra basis function <math>\Phi_{0}</math> is set to 1.<br />
<br />
==== Normalized RBF ====<br />
<br />
In addition to the above unnormalized architecture, the normalized RBF can be represented as:<br />
<br />
<math>\hat{y}_{k}(X) = \frac{\sum_{j=1}^{M} w_{jk}\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math><br /><br /><br />
<br />
<br />
Actually, <math>\Phi^{\ast}_{j}(X) = \frac{\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math> is known as a normalized radial basis function. Giving the familiar form,<br /><br />
<br />
<math>\hat{y}_{k}(X) = \sum_{j=1}^{M} w_{jk}\Phi^{\ast}_{j}(X)</math><br /><br /><br />
<br />
=== Conceptualizing RBF networks ===<br />
<br />
In the past, we have classified data using models that were explicitly linear, quadratic, or otherwise definite. In RBF networks, like in Neural Networks, we can fit an arbitrary model. How can we do this without changing the equations being used?<br />
<br />
Recall a [[#Trick:_Using_LDA_to_do_QDA_-_October_7.2C_2009|trick]] that was discussed in the October 7 lecture: if we add new features to our original data set, we can project into higher dimensions, use a linear algorithm, and get a quadratic result by collapsing to a lower dimension afterward. In RBF networks, something similar can happen.<br />
<br />
Think of <math>\,\Phi</math>, our matrix of radial basis functions, as a feature space of the input. Each hidden unit, then, can be thought to represent a feature; we can see that, if there are more hidden units than input units, we can essentially project to a higher-dimensional space, as we did in our earlier trick. However, this does not mean that an RBF network will actually do this, it is merely a way to convince yourself that RBF networks (and neural networks) can fit arbitrary models. Nevertheless, it is also noticed that just because of such great power, the problem of overfitting appears more important. We have to control its complexity so that it does not fit to an arbitrary training model but to a general one.<br />
<br />
=== RBF networks for classification -- a probabilistic paradigm ===<br />
<br />
[[File:Rbf_graphical_model.png|350px|thumb|left|Figure 1: RBF graphical model]]<br />
<br />
An RBF network is akin to fitting a Gaussian mixture model to data. We assume that each class can be modelled by a single function <math>\,\phi</math> and data is generated by a mixture model. According to Bayes Rule,<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(x|y_{k})*Pr(y_{k})}{Pr(x)}</math><br />
<br />
While all classifiers that we have seen thus far in the course have been in discriminative form, the RBF network is a generative model that can be represented using a directed graph.<br />
<br />
We can replace the class conditional density in the above conditional probability expression by marginalizing <math>\,x</math> over <math>\,j</math>:<br />
<math>\Pr(x|y_{k}) = \sum_{j} Pr(x|j)*Pr(j|y_{k})</math><br />
<br />
<br />
<br />
<br/><br/><br />
*'''Note''' We made the assumption that each class can be modelled by a single function <math>\displaystyle\Phi</math> and that the data was generated by a mixture model. The Gaussian mixture model has the form:<br />
<math>f(x)=\sum_{m=1}^M \alpha_m \phi(x;\mu_m,\Sigma_m)</math> where <math>\displaystyle\alpha_m</math> are mixing proportions, <math>\displaystyle\sum_m \alpha_m=1</math>, and <math>\displaystyle\mu_m</math> and <math>\displaystyle\Sigma_m</math> are the mean and covariance of each Gaussian density respectively. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009), pp. 214. </ref> The generative model in Figure 1 shows graphically how each Gaussian in the mixture model is chosen to sample from.<br />
<br />
== '''Radial Basis Function (RBF) Networks - November 9th, 2009''' ==<br />
<br />
=== RBF Network for classification (A probabilistic point of view) ===<br />
Using RBF Network to do classification, we usually treat it as regression problem. We want to set a threshold to decide what the data’s class membership is. However, to find some insight to the classification problem of what we are doing in terms of RBF Network, we often think of mixture models and make certain assumptions.<br />
<br />
[[File:RBF.png|350px|thumb|right|Figure 26.1: RBF Network Classification Demo]] <br />
<br />
We assume, as we can see in the graph on the right hand side, that we have three random variables, <math>\displaystyle y_k</math>, <math>\displaystyle j</math>, and <math>\displaystyle x</math> where <math>\displaystyle y_k</math> denotes class <math>\,k</math>, <math>\displaystyle x</math> is what we observed here, and <math>\displaystyle j</math> is some hidden random variable. There is a process that there are different classes, and each class can trigger a different hidden random variable <math>\displaystyle j</math>. To understand this, we can assume that, for instance, this random variable <math>\displaystyle j</math> has a Gaussian distribution (it could have any other distribution as well) and that all the <math>\displaystyle j</math>’s have the same distribution (Gaussian), but with different parameters. From each Gaussian distribution triggered by each class, we are going to sample some data points. Therefore, in the end, we are going to get a set of data, which are not strictly Gaussian, but are actually a mixture of Gaussians.<br />
<br />
Again, we look at the posterior distribution from [http://en.wikipedia.org/wiki/Bayes'_theorem Bayes' Rule].<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(X = x | Y = y_{k})*Pr(Y = y_{k})}{Pr(X = x)}</math><br />
<br />
Since we made the assumption that the data has been generated from a mixture model, we can estimate this conditional probability by<br />
<br />
<math>\Pr(X = x | Y = y_{k}) = \sum_{j} Pr(X = x | j)*Pr(j | Y = y_{k})</math>, <br />
<br />
which is the class conditional distribution (or probability) of the mixture model. Note, here, if we only have a simple model from <math>\displaystyle y_k</math> to <math>\displaystyle x</math>, then we won’t have this summation.<br />
<br />
We can substitute this class conditional distribution into Bayes' formula. We can see that the posterior of class <math>\displaystyle k</math> is the summation over <math>\displaystyle j</math> of the probability of <math>\displaystyle x</math> given <math>\displaystyle j</math> times the probability of <math>\displaystyle j</math> given <math>\displaystyle y_k</math>, times the prior distribution of class <math>\displaystyle k</math>, and lastly divided by the marginal probability of <math>\displaystyle x</math>. That is,<br />
<br />
<math>\Pr(y_k | x) = \frac {\sum_{j} Pr(x | j)*Pr(j | y_{k})*Pr(y_{k})}{Pr(x)}</math>.<br />
<br />
Since, the prior probability of class <math>\displaystyle k</math>, <math>\displaystyle Pr(y_{k})</math>, does not have an index of <math>\displaystyle j</math>, it can be taken out of the summation. This yields,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)}</math>.<br />
<br />
We multiply this by <math>\displaystyle 1 = \frac {Pr(j)}{Pr(j)}</math>. Then, it becomes,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)} * \frac {Pr(j)}{Pr(j)}</math>.<br />
<br />
Next, note that <math>\displaystyle Pr(j | x) = \frac {Pr(x | j)*Pr(j)}{Pr(x)}</math>, and <math>\displaystyle Pr(y_k | j) = \frac {Pr(j | y_k)*Pr(y_k)}{Pr(j)}</math>. Then rearranging the terms, we finally have the posterior:<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} Pr(j | x)Pr(y_k | j)</math>.<br />
<br />
Interestingly, this is just the product of the posterior of the two functions that are summed.<br />
<br />
==== Interpretation of RBF Network classification ====<br />
<br />
[[File:2.png|350px|thumb|right|Figure 26.1.2(2): RBF Nerwork ]]<br />
<br />
We want to relate the results that we derived above to our RBF Network. In a RBF Network, as we can see on the right hand side, we have a set of data, <math>\displaystyle x_1</math> to <math>\displaystyle x_d</math>, and the hidden basis function, <math>\displaystyle \phi_{1}</math> to <math>\displaystyle \phi_{M}</math>, and then we have some output, <math>\displaystyle y_1</math> to <math>\displaystyle y_k</math>. Also, we have weights from the hidden layer to output layer. The output is just the linear sum of <math>\displaystyle \phi</math>’s. <br />
<br />
Now consider probability of <math>\displaystyle j</math> given <math>\displaystyle x</math> to be <math>\displaystyle \phi</math>, and the probability of <math>\displaystyle y_k</math> given <math>\displaystyle j</math> to be the weights <math>\displaystyle w_{jk}</math>, then the posterior can be written as,<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math>.<br />
<br />
[[File:3.png|350px|thumb|left|Figure 26.1.2(1): Gaussian mixture ]]<br />
<br />
Now, let us look at an example in one dimensional case. Suppose,<br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, and <math>\displaystyle j</math> is from 1 to 2. <br />
<br />
We know that <math>\displaystyle \phi</math> is a radial basis function. It's as if we put some Gaussian over data. And for each Gaussian, we consider the center <math>\displaystyle \mu</math>. Then, what <math>\displaystyle \phi</math> computes is the similarity of any data point to the center. <br />
<br />
We can see the graph on the left which plots the density of <math>\displaystyle \phi_{1}</math> and <math>\displaystyle \phi_{2}</math>. Take <math>\displaystyle \phi_{1}</math> for instance, if the point gets far from the center <math>\displaystyle \mu_{1}</math>, then it will reduce <math>\displaystyle \phi_{1}</math> to become nearly zero. Remember that, we can usually find a non-linear regression or classification of input space by doing a linear one in some extended space or some feature space (more details in Aside). Here, the <math>\displaystyle \phi</math>’s actually produce that feature space. <br />
<br />
So, one way to look at this is that this <math>\displaystyle \phi</math> is telling us that given an input, how likely the probability of presence of a particular feature is. Say, for example, we define the features as the centers of these Gaussian distributions. Then, this <math>\displaystyle \phi</math> function somehow computes the possibility given certain data points, of this kind of feature appearing. If the data point is right at the center, then the value of that <math>\displaystyle \phi</math> would be one, i.e. the probability is 1. If the point is far from the center, then the probability (<math>\displaystyle \phi</math> function value) will be close to zero, that is, it’s less likely. Therefore, we can treat <math>\displaystyle Pr(j | x)</math> as the probability of a particular feature given data. <br />
<br />
When we have those features, then <math>\displaystyle y</math> is the linear combination of the features. Hence, any of the weights <math>\displaystyle w</math>, which is equal to <math>\displaystyle Pr(y_k | j)</math>, tells us how likely this particular <math>\displaystyle y</math> will appear given those features. Therefore, the weight <math>\displaystyle w_{jk}</math> shows the probability of class membership given feature. <br />
<br />
Hence, we have found a probabilistic point of view to look at RBF Network!<br />
<br />
*'''Note''' There are some inconsistencies with this probabilistic point of view. There are no restrictions that force <math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math> to be between 0 and 1. So if least squares is used to solve this, <math>\displaystyle w_{jk}</math> cannot be interpreted as a probability. <br />
<br />
<br />
''' Aside '''<br />
*Feature Space:<br />
:One way to produce a feature space is LDA<br />
:Suppose, we have n data points <math>\mathbf{x}_1</math> to <math>\mathbf{x}_n </math>. Each data point has d features. And these n data points consist of the <math>X</math> matrix, <br />
:<math>X = \left[ \begin{matrix}<br />
x_{11} & x_{21} & \cdots & x_{n1} \\<br />
x_{12} & x_{22} & \cdots & x_{n2} \\<br />
\vdots & & \ddots & \vdots \\<br />
x_{1d} & x_{2d} & \cdots & x_{nd}<br />
\end{matrix}\right] </math><br />
:Also, we have feature space,<br />
:<math>\Phi^{T} = \left[ \begin{matrix}<br />
\phi_{1}(\mathbf{x_1}) & \phi_{1}(\mathbf{x_2})& \cdots & \phi_{1}(\mathbf{x_n})\\<br />
\phi_{2}(\mathbf{x_1})& \phi_{2}(\mathbf{x_2})& \cdots & \phi_{2}(\mathbf{x_n}) \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{M}(\mathbf{x_1}) & \phi_{M}(\mathbf{x_2}) & \cdots & \phi_{M}(\mathbf{x_n})<br />
\end{matrix}\right] </math> <br />
:If we want to solve a regression problem for the input data, we don’t perform Least Square on this <math>\displaystyle X</math> matrix, we do Least Square on the feature space, i.e. on the <math>\displaystyle \Phi^{T}</math> matrix. The dimensionality of <math>\displaystyle \Phi^{T}</math> is M by n.<br />
:Now, we still have n data points, but we define these n data points in terms of a new set of features. So, originally, we define our data points by d features, but now, we define them by M features. And what are those M features telling us? <br />
:Let us look at the first column of <math>\displaystyle \Phi^{T}</math> matrix. The first entry is <math>\displaystyle \phi_1</math> applied to <math>\mathbf{x_1}</math>, and so on, until the last entry is <math>\displaystyle \phi_M</math> applied to <math>\mathbf{x_1}</math>. Suppose each of these <math>\displaystyle \phi_j</math> is defined by<br />
:<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>.<br />
:Then, each <math>\displaystyle \phi_j</math> checks the similarity of the data point with its center. Hence, the new set of features are actually representing M centers in our data set, and for each data point, its new features check how this point is similar to the first center; how it is similar to the second center; and how it is similar to the <math>\displaystyle M^{th}</math> center. And this checking process will apply to all data points. Therefore, feature space gives another representation of our data set. <br />
<br />
</noinclude><br />
<br />
=== Model selection or complexity control for RBF Network - a brief introduction ===<br />
In order to obtain a better fit for the training data, we often want to increase the complexity of our RBF Network. By its construction, we know that to change the complexity of a RBF Network, the only way is to add or decrease the number of basis functions. A large number of basis function yields a more complex network. In theory, if we add enough basis functions, the RFB Network would work for any training; however, it doesn't mean this RBF Network model can generalize well. Therefore, for the purpose of avoiding overfitting problem (see Notes below), we only want to increase the number of basis function to certain point, i.e. its optimal level. <br />
<br />
For the model selection, what we usually do is estimate the training error. After working through the training error, we’ll see that the training error in fact can be decomposed, and one component of training error is called Mean Squared Error (MSE). In the later notes, we will find that our final goal is to get a good estimate of MSE. Moreover, in order to find an optimal model for our data, we select the model with the smallest MSE.<br />
<br />
Now, let us introduce some notations that we will use in the analysis:<br />
*<math>\hat f</math> -- the prediction model estimated by a RBF network from the training data<br />
*<math>\displaystyle f</math> -- the real model (not null), and ideally, we want <math>\hat f</math> to be close to <math>\displaystyle f</math><br />
*<math>\displaystyle err</math> -- the training error<br />
*<math>\displaystyle Err</math> -- the testing error<br />
*<math>\displaystyle MSE</math> -- the Mean Squared Error<br />
<br />
''' Notes '''<br />
<br />
[[File:overfitting.png|350px|thumb|left|Figure 26.2: Overfitting]]<br />
<br />
*Being more complex isn’t always a good thing. Sometime, [http://en.wikipedia.org/wiki/Overfitting overfitting] causes the model to lose its generality. For example in the graph on left hand side, the data points are sampled from the model <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle f(x_i)</math> is a linear function, which is shown by the blue line, and <math>\displaystyle \epsilon_i</math>is additive Gaussian noise from <math>~N(0,\sigma^2)</math>. The red curve displayed in the graph shows the over-fitted model. Clearly, this over-fitted model only works for any training data, and is useless for any further prediction when new data points are introduced.<br />
<br />
> n<-20;<br />
> x<-seq(1,10,length=n);<br />
> alpha<-2.5;<br />
> beta<-1.75;<br />
> y<-alpha+beta*x+rnorm(n);<br />
> plot(y~x, pch=16, lwd=3, cex=0.5, main='Overfitting');<br />
> abline(alpha, beta, col='blue');<br />
> lines(spline(x, y), col = 2);<br />
<br />
*More details on this topic later on.<br />
<br />
<br />
<br />
</noinclude><br />
<br />
<br />
<br />
<br />
<br />
<br />
== '''Model Selection(Stein's Unbiased Risk Estimate)- November 11th, 2009''' ==<br />
<br />
===Model Selection===<br />
<br />
Model selection is a task of selecting a model of optimal complexity for a given data. Learning a radial basis function network from data is a parameter estimation problem. One difficulty with this problem is selecting parameters that show good performance on both training and testing data. In principle, a model is selected to have parameters associated with the best observed performance on training data, although our goal really is to achieve good performance on unseen testing data. Not surprisingly, a model selected on the basis of training data does not necessarily exhibit comparable performance on the testing data. When squared error is used as the performance index, a zero-error model on the training data can always be achieved by using a sufficient number of basis functions.<br />
<br />
<br />
But, training error and testing error do not demonstrate a linear relationship. In particular, a smaller training error does do not necessarily result in a smaller testing error. In practice, one often observes that, up to a certain point, the model error on testing data tends to decrease as the training error decreases. However, if one attempts to decrease the training error too far by increasing model complexity, the testing error often can take a dramatic increase.<br />
<br />
<br />
The basic reason behind this phenomenon is that in the process of minimizing training error, after a certain point, the model begins to over-fit the training set. Over-fitting in this context means fitting the model to training data at the expense of losing generality. In the extreme form, a set of <math>\displaystyle N</math> training data points can be modeled exactly with <math>\displaystyle N</math> radial basis functions. Such a model follows the training data perfectly. However, the model is not representative features of the true underlying data source, and this is why it fails to correctly model new data points.<br />
<br />
<br />
In general, the training error rate will be less than the testing error on the new data. A model typically adapts to the training data, and hence the training error will be an overly optimistic estimate of the testing error. An obvious way to well estimate testing err is to add a penalty term to the training error to compensate. Actually, SURE is developed based on this.<br />
<br />
<br />
<br />
===Stein's unbiased risk estimate (SURE)===<br />
<br />
<br />
====Important Notation====<br />
<br />
Let:<br />
*<math>\hat f(X)</math> denote the ''prediction model'', which is estimated from a training sample by the RBF neural network model.<br />
*<math>\displaystyle f(X)</math> denote the ''true model''.<br />
*<math>\displaystyle err=\sum_{i=1}^N (\hat y_i-y_i)^2 </math> denote the ''training error'',which is the average loss over the training sample.<br />
*<math>\displaystyle Err=\sum_{i=1}^M (\hat y_i-y_i)^2 </math> denote the ''test error'', which is the expected prediction error on an independent test sample.<br />
*<math>\displaystyle MSE=E(\hat f-f)^2</math> denote the ''mean squared error'', where <math>\hat f(X)</math> is the estimated model and <math>\displaystyle f(X)</math> is the true model.<br />
<br />
<br />
<br />
Suppose the observations <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>.We need to estimate <math>\hat f</math> from training data set <math>T=\{(x_i,y_i)\}^2_{i=1}</math>. Let <math>\hat f_i=\hat f_(x_i)</math> and <math>\displaystyle f_i= f(x_i)</math>, then <br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i-\epsilon_i)^2]</math><math>=E[(\hat f_i-f_i)^2]+E[\epsilon_i^2]-2E[\epsilon_i(\hat f_i-f_i)]</math><br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2E[\epsilon_i(\hat f_i-f_i)]</math> <math>\displaystyle (1)</math><br />
<br />
The last term can be written as:<br />
<br />
<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=E[(y_i-f_i)(\hat f_i-f_i)]=cov(y_i,\hat f)</math>, where<math>\displaystyle y_i</math> and <math>\hat f_i</math> both have same mean <math>\displaystyle f_i</math>.<br />
<br />
<br />
<br />
====Stein's Lemma====<br />
<br />
If <math>\,Z</math> is <math>\,N(\mu,\sigma^2)</math> and if <math>\displaystyle g(Z)</math> is weakly differentiable,such that<math>\displaystyle E[\vert g'(Z)\vert]<\infty</math>, then <math>\displaystyle E[g(Z)(Z-\mu)]=\sigma^2E(g'(Z))</math>.<br />
<br />
<br />
According to Stein's Lemma, the last cross term of <math>\displaystyle (1)</math>, <math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]</math> can be written as <math>\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math>. The derivation is as follows.<br />
<br />
<math>\displaystyle Proof</math>:<br />
<math>\displaystyle E[g(Z)(Z-\mu)]=E[(\hat f-f)\epsilon]=\sigma^2E(g'(Z))=\sigma^2 E[\frac {\partial (\hat f-f)}{\partial y_i}]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}-\frac {\partial f}{\partial y_i}]</math><br />
<br />
<br />
Since <math>\displaystyle f</math> is the true model, not the function of the observations <math>\displaystyle y_i</math>, then <math>\frac {\partial f}{\partial y_i}=0</math>.<br />
<br />
So,<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math> <math>\displaystyle (2)</math><br />
<br />
<br />
<br />
====Two Different Cases====<br />
<br />
=====''Case 1''=====<br />
<br />
Consider the case in which a new data point has been introduced to the estimated model, i.e. <math>(x_i,y_i)\not\in\tau</math>; this new point belong to the validation set <math>\displaystyle \nu</math>,i.e.<math>(x_i,y_i)\in\nu</math>. Since <math>\displaystyle y_i</math> is a new point, <math>\hat f</math> and <math>\displaystyle y_i</math> are independent. Therefore <math>\displaystyle cov(y_i,\hat f)=0</math> (or think about <math>\frac{\partial \hat f}{\partial y_i}</math>, when <math>\,y_i</math>is a new point, then it has nothing with <math>\hat f</math> because the estimation of <math>\hat f</math> is from the training data, so <math>\frac{\partial \hat f}{\partial y_i}=0</math> ) and <math>\displaystyle (1)</math> in this case can be written as:<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2</math>. <br />
<br />
This expectation means <math>\frac {1}{m}\sum_{i=1}^m (\hat y_i-y_i)^2 = \frac {1}{m}\sum_{i=1}^m (\hat f_i-f_i)^2+ \sigma^2</math>.<br />
<br />
<math>\sum_{i=1}^m (\hat y_i-y_i)^2 = \sum_{i=1}^m (\hat f_i-f_i)^2+ m\sigma^2</math><br />
<br />
Based on the notation we denote above, then we obtain:<br />
<math>\displaystyle MSE=Err-m\sigma^2</math><br />
<br />
<br />
<br />
This is the justification behind the technique of cross validation. since <math>\displaystyle \sigma^2</math> is constant, to minimize <math>\displaystyle MSE</math> is equal to minimize the test err <math>\displaystyle Err</math>. In cross vaildation to avoid overfitting or underfitting, a validation data set is independent from the estimated model.<br />
<br />
<br />
=====''Case 2''=====<br />
<br />
A more interesting case is the case in which we do not use new data points to assess the performance of the estimated model. and the training data is used for both estimating and assessing a model <math>\hat f_i</math>. In this case the cross term in <math>\displaystyle (1)</math> cannot be ignored because <math>\hat f_i</math> and <math>\displaystyle y_i</math> are not independent. Therefore the cross term can be estimated by Stein's lemma, which was originally proposed to estimated the mean of a Guassian distribution.<br />
<br />
<br />
Suppose <math>(x_i,y_i)\in\tau</math>, then by applying Stein's lemma, we obtain <math>\displaystyle (2)</math> proved above.<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2\sigma^2E[\frac {\partial \hat f}{\partial y_i}]</math>.<br />
<br />
This expectation means <math>\frac {1}{N}\sum_{i=1}^N (\hat y_i-y_i)^2 = \frac {1}{N}\sum_{i=1}^N (\hat f_i-f_i)^2+ \sigma^2-\frac {2}{N}\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<br />
<math>\sum_{i=1}^N (\hat y_i-y_i)^2 = \sum_{i=1}^N (\hat f_i-f_i)^2+ N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<math>\displaystyle err=MSE+N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math><br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math> <math>\displaystyle (3)</math><br />
<br />
In statistics, this is known as [http://www.reference.com/browse/Stein%27s+unbiased+risk+estimate Stein's unbiased risk estimate (SURE)] is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter, and thus cannot be determined completely.<br />
<br />
<br />
<br />
===SURE for RBF Network===<br />
<br />
Based on SURE, the optimum number of basis functions should be assigned to have the minimum generalization err <math>\displaystyle err</math>. Based on the Radial Basis Function Network, by setting <math>\frac{\partial err}{\partial W}</math> equal to zero , we get the least squared solution of<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math>.Then we have <math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}Y = HY</math>,where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}</math>, <math>\,H</math> is the hat matrix for this model.<br />
<br />
<br />
<math>\hat f=\,H_{i1}y_2+\,H_{i2}y_2+\cdots+\,H_{in}y_n</math> <math>\displaystyle (3)</math><br />
<br />
where <math>\,H</math> depends on the input vector <math>\displaystyle x_i</math> but not on <math>\displaystyle y_i</math>. <br />
<br />
By taking the derivative of <math>\hat f_i</math> with respect to <math>\displaystyle y_i</math>, we can easily obtain:<br />
<br />
<math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Now, substituing this into<math>\displaystyle (3)</math> ,we get<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Here, we can tell that <math>\sum_{i=1}^N \,H_{ii}= \,Trace(H)</math>, the sum of the diagonal elements of <math>\,H</math>. Thus, we can obtain the further simplification that <math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1})\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=d</math>, where<math>\displaystyle d</math> is the dimension of <math>\displaystyle \Phi</math>. Since <math>\displaystyle \Phi</math> is a projection of input matrix <math>\,X</math> onto a basis set spanned by <math>\,M</math>, the number of basis functions. If considering intercept, then <math>\,Trace(H)= M+1</math>.<br />
<br />
Then,<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1)</math>.<br />
<br />
===SURE Algorithm===<br />
<br />
<br />
[[File:27.1.jpg|350px|thumb|right|Figure 27.1]]<br />
<br />
We use this method to find the optimum number of basis function by choosing the model with smallest MSE over the set of models considered. Given a set of models <math>\hat f_M(x)</math> indexed by the number of basis functions, <math>\displaystyle err(M)</math>. <br />
<br />
Then, <math>\displaystyle MSE(M)=err(M)-N\sigma^2+2\sigma^2(M+1)</math><br />
<br />
where <math>\displaystyle N</math> is the number of training samples and the noise,<math>\sigma^2</math>, can be estimated from the training data as<br />
<br />
<math>\hat \sigma^2=\frac {1}{N-1}\sum_{i=1}^N (\hat y-y)^2</math>.<br />
<br />
<br />
By applying SURE algorithm to SPECT Heart data, we get the optimal number of basis functions is <math>\displaystyle M=4</math>.<br />
<br />
<br />
Pls take a look at figure 27.1 on the right, which shows that<math>\displaystyle MSE</math> is smallest when <math>\displaystyle M=4</math>.<br />
<br />
<br />
Calculating the SURE value is easy if you have access to <math>\,\sigma</math>.<br />
<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
If <math>\,\sigma</math> is not known, it can be estimated using the error.<br />
<br />
error = (output - expected_output) .^ 2;<br />
sigma = (dot(err, ones(1, num_data_point)) / (num_data_point)) / (num_data_point - 1);<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
=='''SURE for RBF network & Support Vector Machine - November 13th, 2009'''==<br />
<br />
===SURE for RBF network===<br />
<br />
====Minimizing MSE====<br />
<br />
By Stein's unbiased risk estimate (SURE) for Radial Basis Function (RBF) Network<br />
we get:<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1) </math> (28.1)<br />
<br />
*<math>\displaystyle MSE</math>(mean square error)= <math>\sum_{i=1}^N (\hat y_i-y_i)^2 </math><br />
*<math>\displaystyle err</math>(training error)= <math>\sum_{i=1}^N (\hat f_i-f_i)^2 </math><br />
*<math>\displaystyle (M+1) </math>( number of hidden units)= <math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math><br />
<br />
<br />
'''Goal''': To minimize MSE<br />
<br />
1. If <math>\displaystyle \sigma </math> is known, then it is no impact (i.e. a constant),<br />
and we can ignore it. Only need to minimize <math>\displaystyle MSE=err +2\sigma^2(M+1)</math>.<br />
<br />
2. In reality, we do not know <math>\displaystyle \sigma</math>, and it changes when <math>\displaystyle (M+1) </math> changes. However, we can estimate <math>\displaystyle \sigma </math>.<br />
<br />
<math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>. Suppose we do not know the variance of <math>\displaystyle \epsilon</math>. Then, <br />
<br />
<math>\displaystyle \sigma^2=\frac{1}{N-1}\sum_{i=1}^N (\hat y_i-y_i)^2 =\frac{1}{N-1}err</math> (28.2)<br />
<br />
Substitute (28.2) into (28.1), get<br />
<br />
<math>\displaystyle MSE=err-N\frac{1}{N-1}err+2\frac{1}{N-1}err(M+1)</math><br />
<br />
<math>\displaystyle MSE=err(1-\frac{N}{N-1}+\frac{2(M+1)}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{N-1-N+2M+2}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{2M+1}{N-1}) </math> (28.3) <br />
<br />
<br />
[[File:28.1.jpg|350px|thumb|Figure 28.1: MSE vs err]]<br />
<br />
Figure 28.1: the training error will decrease and the MSE will increase when increasing the number of hidden units (i.e. the model is more complex).<br />
<br />
<br />
When the number of hidden units gets larger and larger, the training error will decrease until it approaches to <math>\displaystyle 0 </math>. If training error approaches <math>\displaystyle 0 </math>, then no matter how large <math>\displaystyle (M+1) </math> is, from (28.3) we can see MSE will approaches to <math>\displaystyle 0 </math> as well. However, in fact it does not happen since when training error is close to <math>\displaystyle 0 </math> [http://en.wikipedia.org/wiki/Overfitting overfitting] happens, and MSE will increase instead of being close to <math>\displaystyle 0 </math>. We can see it from the Figure 28.1. <br />
<br />
<br />
We can see <math>\displaystyle \sigma^2 </math> is the average of <math>\displaystyle err </math>. In order to deal with this problem, we can take the average for <math>\displaystyle err</math> of each hidden unit. For example: we can first take 1 hidden unit, and take 10 hidden units in the next.<br />
<br />
We can also see that unlike the classical Cross Validation (CV) or Leave one out (LOO) techniques, the SURE technique does not need to do the validation to find the optimal model. Hence, SURE technique uses less data than CV or LOO. It is suitable for the case that there is not enough data for validation. However, to implement SURE we need to find <math>\frac {\partial \hat f}{\partial y_i}</math>, which may not be trivial for models that do not have a closed-form solution.<br />
<br />
====Kmeans Clustering====<br />
<br />
Discription:<br /> [http://en.wikipedia.org/wiki/K-means_clustering Kmeans clustering] is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.<br />
<br />
*The number of hidden units is same as the number of clusters, which is <math>\displaystyle \phi </math><br />
<br />
*<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, we set it same for all clusters.<br />
<br />
The basic details for <math>K</math>-means clustering are given:<br />
<br />
The <math>K</math> initial centers are randomly chosen from the training data.<br />
<br />
Then the following two steps are iterated alternatively until convergernce.<br />
<br />
1. for each existing center, we reidentify its cluster (every points in this cluster should be closer to this center than others).<br />
<br />
2. compute the mean for each cluster and make it as the new center for each cluster.<br />
<br />
<br />
Example:<br /><br />
Partition data into 2 clusters (2 hidden values)<br />
<br />
<br />
>> X=rand(30,80); <br />
>> [IDX,C,sumD,D]=kmeans(X,2); <br />
>> size(IDX) <br />
>> 30 1<br />
>> size(C) <br />
>> 2 80<br />
>> size(sumD) <br />
>> 2 1<br />
>> c1=sum(IDX==1)<br />
>> 14<br />
>> c2=sum(IDX==2)<br />
>> 16<br />
>> sumD<br />
>> 85.6643<br />
>> 101.0419<br />
>> v1=sumD(1,1)/c1 <br />
>> 6.1189<br />
>> v2=sumD(2,1)/c2 <br />
>> 6.3151 <br />
<br />
<br />
<br />
Comments:<br />
<br />
We create <math>X</math> randomly as a train set with 80 data points and 30 dimensions, and then apply “kmeans” method to separate X into 2 clusters.IDX is a vector contains 1 or 2 which indicates 2 clusters, and its size is 30*1. <math>\displaystyle C </math> is the center (mean) of each cluster with size 2*80; sumD is sum of the square distance between the data points and center of its cluster. The <math>\displaystyle c1 </math> and <math>\displaystyle c2 </math> indicate the number of data points in cluster 1 and 2. <math>\displaystyle V1 </math> is the variance of the first cluster <math>\displaystyle (v1=\sigma_1)</math>; <math>\displaystyle V2 </math> is the variance of the second cluster <math>\displaystyle (v2=\sigma_2)</math>. Now we can get <math>\displaystyle \phi </math>, <math>\displaystyle w </math>, hat matrix <math>\displaystyle H </math> and <math>\displaystyle \hat Y </math> by following equations. Finally, we will get the <math>\displaystyle MSE </math> and predict the test set. <br />
<br />
<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br />
<br />
<math>\displaystyle W = (\phi^{T}\Phi)^{-1}\phi^{T}Y</math><br />
<br />
<math>\displaystyle H = \phi(\Phi^{T}\Phi)^{-1}\phi^{T}</math><br />
<br />
<math>\displaystyle \hat{Y} = \phi W = \phi(\phi^{T}\phi)^{-1}\phi^{T}Y = HY</math><br />
<br />
<br />
<br />
Aside:<br />
<br />
Similar in spirit to <math>K</math>-means, there is EM algorithm with respect to Gaussian mixture model. Generally speaking, the Gaussian mixture model is referred to as a soft clustering while <math>K</math>-means is hard clustering.<br />
<br />
Similar to <math>K</math>-means, the following two steps are iterated alternatively until convergernce.<br />
<br />
E-step, each point is assigned a weight for each cluster based on the likelihood of each of the corresponding Gaussians. Observations is assigned 1 for one cluster if they are closer to the center of that cluster, and is assigned 0 for other clusters. <br />
<br />
M-step, compute the weighted means and covariances and make them as the new means and covariances for every cluster.<br />
<br />
>>[P,mu,phi,1Pxtr]=mdgEM(X,2,200,0);<br />
<br />
===Support Vector Machine===<br />
<br />
====Introduction====<br />
We have seen that linear discriminate analysis and logistic regression both estimate linear decision boundaries in similar but slightly different ways. Separating hyperplane classifiers provide the basis for support vector classifier. It constructs linear decision boundaries that explicitly try to separate the data into different classes as well as possible. The techniques that make the extensions to the nonseparable case, where the classes overlap are generalized to what is known as the support vector machine. It produces nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space.<br />
<br />
Definition: <br /><br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machine (SVM)] are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space, which can be used for classification, regression or other tasks.<br />
<br />
====Optimal Seperating Hyperplane====<br />
<br />
[[File:28.2.jpg|350px|thumb|right|Figure 28.2]]<br />
<br />
Figure 28.2 An example with two classes separable by a hyperplane. The blue line is the least squares solution, which misclassifies one of the training points. Also shown are the black separating hyperplanes found by the [http://en.wikipedia.org/wiki/Perceptron perceptron] learning algorithm with different random starts.<br /><br />
<br />
We can see the data points can be separated by a linear boundary are in two classes in <math>\displaystyle \mathbb{R}^{2} </math>. Suppose a dataset is indeed linearly separable, then there exits infinitely many possible separating hyperplanes including the black lines in the figure as two of them for training data. However, which solution is the best when we introduce the new data. <br /><br />
<br />
Aside: <br /><br />
The blue line is the least squares solution to the problem,obtained by regressing the <math>\displaystyle -1/+1 </math> response <math>\displaystyle Y </math> on <math>\displaystyle X </math> (with intercept); the line is given by<br />
<math>\displaystyle {X:\hat\beta_0+\hat\beta_1X_1+\hat\beta_2X_2=0}</math>.<br />
This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by linear discriminant analysis, in light of its equivalence with linear regression in the two-class case.<br />
<br />
Classifiers such as (28.4) that compute a linear combination of the input features and return the sign were called ''perceptrons'' in the engineering literature in the late 1950s. <br />
<br />
<br />
Identifications:<br />
<br />
*Hyperplane: seprate two classes <br />
<br />
<math>\displaystyle x^{T}\beta+\beta_0=0</math><br />
<br />
*Margin: the distance between the hyperplane and the closest point.<br />
<br />
<math>\displaystyle d_i=x_i^{T}\beta+\beta_0 </math> where <math>\displaystyle i=1,....,N</math><br />
<br />
Note: since distance is positive, if the data is on <math>\displaystyle +1 </math> side the distance is <math>\displaystyle d_i(+1)</math>. If the data is on the <math>\displaystyle -1 </math> side the distance is <math>\displaystyle d_i(-1)</math>.<br />
<br />
*Data points: <math>\displaystyle y_i\in\{-1,+1\}</math>we can classify points as<math>\displaystyle sign\{d_i\}</math> if <math>\displaystyle \beta,\beta_0 </math> is known.<br /><br />
<br />
====Maximum Margin Classfifiers====<br />
Choose the line farthest from both classes or Choose the line which has the maximum distance from the closest point. (i.e maximize the margin)<br /><br />
<br />
<math>\displaystyle Margin=min\{y_id_i\}</math> <math>\displaystyle i=1,2,....,N </math> <br />
where <math>\displaystyle y_i </math> is label and <math>\displaystyle d_i </math> is distance<br /><br />
<br />
[[File:28.3.jpg|350px|thumb|right|Figure 28.3 The linear algebra of a hyperplane]]<br />
<br />
<br />
<br />
Fiqure 28.3 depicts a hyperplane defined by the equation <math>\displaystyle x^{T}\beta+\beta_0=0</math>. Since they are in <math>\displaystyle \mathbb{R}^{2} </math>, the hyperplane is a line.<br /><br />
<br />
<br />
Properties:<br /><br />
<br />
1. <math>\displaystyle \beta </math> is orthogonal to the hyperplane <br /><br />
<br />
Two points<math>\displaystyle x_1,x_2</math> lying on the hyperplane.<br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0\beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}(x_1-x_2)=0</math><br />
<br />
Hence,<math>\displaystyle \beta </math> is orthogonal to <math>\displaystyle (x_1-x_2)</math>, and<math>\displaystyle \beta^*=\frac{\beta}{\|\beta\|} </math> is the vector normal to the hyperplane.<br /><br />
<br />
2. For any point <math>\displaystyle x_0 </math> on the hyperplane, <br />
<br />
<math>\displaystyle \beta^{T}x_0+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1=-\beta_0</math><br />
<br />
<br />
3. The signed distance fo any point <math>\displaystyle x </math> to the hyperplane is: since the length of <math>\displaystyle \beta </math> is not known, let it be unit vector.<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}(x_i-x_0)}{\|\beta\|} </math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i-\beta^{T}x_0}{\|\beta\|} </math><br />
<br />
by property 2<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i+\beta_0}{\|\beta\|} </math><br />
<br />
<br />
<br />
[[File:4.jpg|350px|thumb|right|Figure 28.4]]<br />
<br />
<br />
<math>\displaystyle Margin=min(y_id_i)</math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle Margin=min\{y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\} </math><br />
<br />
Suppose <math>\displaystyle x_i </math> is not on the hyperplan<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)>0 </math><br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq c </math>for <math>\displaystyle c>0 </math><br />
<br />
<br />
<math>\displaystyle y_i(\frac{\beta^{T}x_i}{c}+\frac{\beta_0}{c})\geq1</math> <br />
<br />
This is known as the cnonical representation of the decision hyperplane.<br />
<br />
For <math>\displaystyle \beta^{T} </math> only the direction is important, so <math>\displaystyle \frac{\beta^{T}}{c} </math> does not change its direction, the hyperplance will be the same.<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq1 </math><br />
<br />
<math>\displaystyle y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\geq\frac{1}{\|\beta\|} </math><br />
<br />
<math>\displaystyle Margin=min(\frac{1}{\|\beta\|} )</math><br />
<br />
which is equivalent as <math>\displaystyle Margin=max(\|\beta\| )</math><br />
<br />
<br />
<br />
<br />
Refrence:<br /><br />
Hastie,T.,Tibshirani,R., Friedman,J.,(2008).The Elements of Statistical Learning:129-130<br />
<br />
=='''Optimizing The Support Vector Machine - November 16th, 2009'''==<br />
The Support Vector Machine is used to find a maximum margin hyperplane, assuming the two classes are separable. This margin can be written as <math>\,min\{y_id_i\}</math>, or the distance of each point from the hyperplane, where <math>\,d_i</math> is the distance and <math>\,y_i</math> is used as the sign.<br />
===Maximizing the Support Vector Machine===<br />
<math>\,Margin=min\{y_id_i\}</math> can be rewritten as <math>\,min\left\{\frac{y_i\left(\beta^Tx_i+\beta_0\right)}{|\beta|}\right\}</math>. <br />
<br\>Note that the term <math>\,y_i\left(\beta^Tx_i+\beta_0\right) = 0</math> if <math>\,x_i</math> is on the hyperplane, but <math>\,y_i\left(\beta^Tx_i+\beta_0\right) > 0</math> if <math>\,x_i</math> is <math>\displaystyle not</math> on the hyperplane<br />
<br />
This implies <math>\,\exists C</math> such that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq C</math>.<br />
<br />
Divide through by C to produce <math>\,y_i\left(\frac{\beta^T}{C}x_i + \frac{\beta_0}{C}\right) \geq 1</math>. <br />
<br />
<math>\,\beta, \beta_0</math> compose a hyperplane that can have different values but we care about the direction, dividng through by a constant does not change the hyperplane. Thus, by scaling <math>\,\beta, \beta_0</math> we can eliminate C, so that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq 1</math>. Implying that the lower bound on <math>\,y_i\left(\beta^Tx_i+\beta_0\right)</math> is <math>\displaystyle 1</math><br />
<br />
Now in order to maximize the margin, we simply need to find <math>\,min\left\{\frac{1}{|\beta|}\right\}</math>. <br />
<br />
In other words, find maximum <math>\,|\beta|</math>, s.t. <math>\,min_i\{y_i(\beta^Tx_i+\beta_0)\} = 1</math>.<br />
<br />
Note that we're dealing with the norm of <math>\,\beta</math>. The 1-norm of a vector is simply the sum of the absolute value of each element (also known as the taxicab or Manhattan distance), and is apparently more accurate, but also has a discontinuity in the derivative. 2-norm, the Euclidean norm (the intuitive length of the vector), is easier to work with - that is <math>\,\|\beta\|_2 = (\beta^T\beta)^{1/2}</math>. For convenience, we will maximize <math>\,\frac{1}{2}\|\beta\|_2^2 = \frac{1}{2}\beta^T\beta</math>.<br />
<br />
This is an example of a quadratic programming problem and we will minimize a quadratic function respective to a linear inequality constraints.<br />
<br />
<br />
====Writing Lagrangian Form of Support Vector Machine====<br />
The Lagrangian form is introduced to ensure that the conditions are satisfied, as well as finding an optimal solution. <math>\,\alpha_i</math> are introduced as dual constraints. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.<br />
<br />
<math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math>. To find the optimal value, set the derivative equal to zero.<br />
<br />
<math>\,\frac{\partial L}{\partial \beta} = 0</math>, <math>\,\frac{\partial L}{\partial \beta_0} = 0</math>. Note that <math>\,\frac{\partial L}{\partial \alpha_i}</math> is equivalent to the constraints <math>\left(y_i(\beta^Tx_i+\beta_0)-1\right) \geq 0, \,\forall\, i</math><br />
<br />
First, <math>\,\frac{\partial L}{\partial \beta} = \frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\left\{\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i)+\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0-\frac{\partial}{\partial \beta}\alpha_iy_i\right\}}</math><br />
<br />
: <math>\frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 = \beta</math>.<br />
<br />
: <math>\,\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i) = \alpha_iy_ix_i</math><br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0 = 0</math>. <br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i = 0</math>.<br />
<br />
So this simplifies to <math>\,\frac{\partial L}{\partial \beta} = \beta - \sum_{i=1}^n{\alpha_iy_ix_i} = 0</math>. In other words,<br />
<br />
<math>\,\beta = \sum_{i=1}^n{\alpha_iy_ix_i}</math>, <math>\,\beta^T = \sum_{i=1}^n{\alpha_iy_ix_i^T}</math><br />
<br />
Similarly, <math>\,\frac{\partial L}{\partial \beta_0} = -\sum_{i=1}^n{\alpha_iy_i} = 0</math>.<br />
<br />
This allows us to rewrite the Lagrangian without <math>\,\beta</math>.<br />
<br />
<math>\,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} - \sum_{i=1}^n{\alpha_iy_i(\sum_{j=1}^n{\alpha_jy_jx_j^Tx_i} + \beta_0) - 1}</math>. <br />
<br />
Because <math>\,\sum_{i=1}^n{\alpha_iy_i} = 0</math>, and <math>\,\beta_0</math> is constant, <math>\,\sum_{i=1}^n{\alpha_iy_i\beta_0} = 0</math>. So this simplifies further, to<br />
<br />
<math>L(\alpha) = \,-\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} + \sum_{i=1}^n{\alpha_i}</math><br />
A dual representation of the maximum margin.<br />
<br />
Because <math>\,\alpha_i</math> is the Lagrangian multiplier, <math>\,\alpha_i \geq 0 \forall i</math>.<br />
<br />
This is a much simpler optimization problem.<br />
<br />
=='''The Support Vector Machine algorithm - November 18, 2009'''==<br />
<br />
===Solving the Lagrangian===<br />
<br />
Continuing from the above derivation, we now have the equation that we need to minimize, as well as two constraints.<br />
<br />
The Support Vector Machine problem boils down to:<br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math><br />
:such that <math>\alpha_i \geq 0</math><br />
:and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
<br />
We are looking to minimize <math>\,\alpha</math>, which is our only unknown. Once we know <math>\,\alpha</math>, we can easily find <math>\,\beta</math> and <math>\,\beta_0</math> (see the Support Vector algorithm below for complete details).<br />
<br />
If we examine the Lagrangian equation, we can see that <math>\,\alpha</math> is multiplied by itself; that is, the Lagrangian is quadratic with respect to <math>\,\alpha</math>. Our constraints are linear. This is therefore a problem that can be solved through [http://en.wikipedia.org/wiki/Quadratic_programming quadratic programming] techniques. We will examine how to do this in Matlab shortly.<br />
<br />
We can write the Lagrangian equation in matrix form:<br />
<br />
<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:such that <math>\underline{\alpha} \geq \underline{0}</math><br />
:and <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Where:<br />
* <math>\underline{\alpha}</math> denotes an <math>\,n \times 1</math> vector; <math>\underline{\alpha}^T = [\alpha_1, ..., \alpha_n]</math><br />
* Matrix <math>S = y_iy_jx_i^Tx_j = (y_ix_i)^T(y_ix_i)</math><br />
* <math>\,\underline{0}</math> and <math>\,\underline{1}</math> are vectors containing all 0s or all 1s respectively<br />
<br />
Using this matrix notation, we can use Matlab's build in quadratic programming routine, [http://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/quadprog.html quadprog].<br />
<br />
===Quadprog example===<br />
<br />
Let's use quadprog to find the solution to <math>\,L(\alpha)</math>.<br />
<br />
Matlab's quadprog function minimizes an equation of the following form:<br />
:<math>\min_x\frac{1}{2}x^THx+f^Tx</math><br />
:such that: <math>\,A \cdot x \leq b</math>, <math>\,Aeq \cdot x = beq</math> and <math>\,lb \leq x \leq ub</math><br />
<br />
We can now see why we kept the <math>\frac{1}{2}</math> constant in the original derivation of the equation.<br />
<br />
The function is called as such: <code>x = quadprog(H,f,A,b,Aeq,beq,lb,ub)</code>. The variables correspond to values in the equation above.<br />
<br />
We can now try to find the solution to <math>\,L(\alpha)</math> (though, it should be noted, that in <math>\,L(\alpha)</math> we subtract the first term rather than add it; I'm not sure if this difference would change how we call <code>quadprog</code>).<br />
<br />
We'll use a simple one-dimensional data set, which is essentially y = -1 or 1 + Gaussian noise. (Note: you could easily put the values straight into the quadprog equation; they are separated for clarity)<br />
<br />
x = [mvnrnd([-1],[0.01],100); mvnrnd([1],[0.01],100)]';<br />
y = [-ones(100,1); ones(100,1)];<br />
S = (y*x)' * (y*x);<br />
f = ones(200,1);<br />
A = -ones(1,200);<br />
b = 0;<br />
Aeq = y';<br />
beq = 0;<br />
lb = 0;<br />
ub = []; % There is no upper bound<br />
alpha = quadprog(S,f,A,b,Aeq,beq,lb,ub);<br />
<br />
This gives us the optimal <math>\,\alpha</math>... or at least I think it should, but it does not appear to work for me (that is, despite setting the lower boundary of 0, a number of values are still negative. Whether this is just the nature of <code>quadprog</code> or an error on my part is an exercise for the reader).<br />
<br />
===Examining K.K.T. conditions===<br />
<br />
[http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker conditions] ([http://webrum.uni-mannheim.de/mokuhn/public/KarushKuhnTucker.pdf more info]) give us a closer look into the Lagrangian equation and the associated conditions.<br />
<br />
Suppose we are looking to minimize <math>\,f(x)</math> such that <math>\,g_i(x) \geq 0, \forall{x}</math>. If <math>\,f</math> and <math>\,g</math> are differentiable, then the ''necessary'' conditions for <math>\hat{x}</math> to be local minimum are:<br />
<br />
# At the optimal point, <math>\frac{\partial L}{\partial \hat{x}} = 0</math>; i.e. <math>f'(\hat{x}) - \sum{\alpha_ig'(\hat{x})}=0</math><br />
# <math>\alpha_i \geq 0</math>. (Dual Feasibility) <br />
# <math>\alpha_ig_i(\hat{x}) = 0, \forall{i}</math> (Complementary Slackness)<br />
# <math>g_i(\hat{x}) \geq 0</math> (Primal Feasibility)<br />
<br />
If any of these conditions are violated, then the problem is deemed not feasible.<br />
<br />
These are all trivial except for condition 4. Let's examine it further in our support vector machine problem.<br />
<br />
===Support Vectors===<br />
<br />
Basically, the support vectors are the training points that actually determine the optimal separating hyperplane which we are look for. Also, they are the most difficult points to classify or the most informative for the classification.<br />
<br />
In our case, the <math>g_i(\hat{x})</math> function is:<br />
:<math>\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1</math><br />
<br />
Substituting <math>\,g_i</math> into KKT condition 3, we get <math>\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0</math>. <br\>In order for this condition to be satisfied either <br/><math>\,\alpha_i= 0</math> or <br/><math>\,y_i(\beta^Tx_i+\beta_0=1</math><br />
<br />
All points <math>x_i</math> are either be 1 or greater than 1 distance away from the hyperplane.<br />
<br />
'''Case 1: a point <math>\displaystyle x_i</math> > 1 away from the margin'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) > 1 \Rightarrow \alpha_i = 0</math>.<br />
<br />
If point <math>\, x_i</math> is not on the margin, then the corresponding <math>\,\alpha_i=0</math>.<br />
<br />
'''Case 2: a point <math>\displaystyle x_i</math> 1 away from the margin'''<br />
<br />
If <math>\,\alpha_i > 0 \Rightarrow y_i(\beta^Tx_i+\beta_0) = 1</math> <br />
<br\>If point <math>\, x_i</math> is on the margin, then the corresponding <math>\,\alpha_i>0</math>.<br />
<br />
<br />
<br />
Points on the margin, points with corresponding <math>\,\alpha_i > 0</math>, are called '''''support vectors'''''.<br />
<br />
===Using support vectors===<br />
<br />
Support vectors are important because they allow the support vector machine algorithm to be insensitive to outliers. If <math>\,\alpha = 0</math>, then the cost function is also 0, and won't contribute to the solution of the SVM problem; only points on the margin -- support vectors -- contribute.<br />
<br />
====The support vector machine algorithm====<br />
<br />
# Solve the quadratic programming problem:<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
## Use Matlab's quadprog to find the optimal <math>\,\underline{\alpha}</math><br />
# Find <math>\beta = \sum_{i=1}^n{\alpha_iy_i\underline{x_i}}</math><br />
# Find <math>\,\beta_0</math> by choosing a support vector (a point with <math>\,\alpha_i > 0</math> and solving <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math><br />
<br />
===Example in Matlab===<br />
<br />
The following code, taken verbatim from the lecture, shows how to use Matlab built-in SVM routines (found in the Bioinformatics toolkit) to do classification through support vector machines.<br />
<br />
load 2_3;<br />
[U,Y] = princomp(X');<br />
data = Y(:,1:2);<br />
l = [-ones(1,200) ones(1,200)];<br />
[train,test] = crossvalind('holdOut',400);<br />
% Gives indices of train and test; so, train is a matrix of 0 or 1, 1 where the point should be used as part of the training set<br />
svmStruct = svmtrain(data(train,:), l(train), 'showPlot', true);<br />
<br />
[[File:Svm1.png|frame|center|The plot produced by training on some of the 2_3 data's first two features.]]<br />
<br />
yh = svmclassify(svmStruct, data(test,:), 'showPlot', true);<br />
<br />
[[File:Svm2.png|frame|center|The plot produced by testing some of the 2_3 data.]]</div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=5451stat8412009-11-21T19:46:53Z<p>Ipargaru: /* Support Vectors */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rise of fields such as data-mining, bioinformatics, and machine learning, classification has becomes a fast-developing topic. In the age of information, vast amounts of data are generated constantly, and the goal of classification is to learn from data. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
'''Definition''': The problem of Prediction a discrete random variable <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math> is called Classification.<br />
<br />
In classification,, we attempt to approximate a function <math>\,h</math>, by using a training data set, which will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>d</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data which are identical independent distributions, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, for instance, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability. Intuitively speaking, to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we find <math>\,y\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y|X=x)</math>, and classify <math>\,X</math> into class <math>\,y</math>. In order to calculate the value of <math>\,P(Y=y|X=x)</math>, we use ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posterior probability, <math>\,P(Y=y)</math> as the prior probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two classes, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<sub></sub><br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''3 different approaches to classification''':<br />
<br />
1) Empirical Risk Minimization: Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of <math>\,L(h)</math><br />
<br />
2) Regression: Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3) Density Estimation: estimate <math>\,P(X=x|Y=0)</math> and <math>\,P(X=x|Y=1)</math> (less popular in high-dimension cases)<br />
<br />
<br />
<br />
'''Bayes Classification Rule Optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when<math></math> the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
<br />
:<math>\, h^*(X)= \left\{\begin{matrix} <br />
1 & if P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
Remark:<br />
<br />
1)Bayes classification rule is optimal. Proof:[http://www.ee.columbia.edu/~vittorio/BayesProof.pdf]<br />
<br />
2)We still need any other method, since we cannot define prior probability in realistic.<br />
<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, since it is generally impossible for us to know the prior <math>\,P(Y=1)</math>, and class conditional density <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], tree-augmented naive Bayes (TAN), Bayesian network augmented naive Bayes (BAN) and general Bayesian network (GBN).<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as having objective existence. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayesian and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayesian'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In Bayesian method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
'''Multi-class Classification''':<br />
<br />
Y takes on more than two values.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
''Theorem'': Suppose that <math>\,Y \in \mathcal{Y}= \{1,\dots,k\}</math>, the optimal rule is :<math>\,h^*(X) = \arg\max_{k}{P(Y = k|X = x)}</math><br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
1 Empirical Risk Minimization:Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
<br />
2 Regression:Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3 Density estimation, estimate <math>P(X = x|Y = 0)</math> and <math>P(X = x|Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
It is quadratic because there is no boundaries.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on [http://academicearth.org/lectures/advice-for-applying-machine-learning LDA and QDA] so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
LDA is for classification and FDA is used for feature extraction.<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
<br />
=== logistic function ===<br />
A logistic function or logistic curve is the most common sigmoid curve. <br />
<br />
:<math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
1. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
2. <math>y(0) = \frac{1}{2}</math><br />
<br />
3. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
4. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
5. The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood, using Pr(Y|X). The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Seeing these equations as a weighted least squares problem makes them easier to derivate.<br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== Perceptron (Foundation of Neural Network) ===<br />
<br />
==== Separating Hyperplane Classifiers ====<br />
Separating hyperplane trys to separate the data using linear decision boundaries. When the classes overlap, it can be generalized to support vector machine, which constructs nonlinear boundaries by constructing a linear boundary in an enlarged and transformed feature space.<br />
<br />
==== Perceptron ====<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
Least Squares returns the sign of a linear combination of data points as the class label<br />
<br />
sign(<math>(\underline{\beta}^T \underline{x} + {\beta}_0)) = sign(\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2})</math><br />
<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br />
<br />
<br />
If we find a hyperplane that is not unique between 2 classes, there will be infinitely many solutions obtained from the perceptron algorithm.<br />
<br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
A Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math>, where <math>\,I</math> indicates the sign of the expression and returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math> (initial guess). Its goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach], which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
====Problems with the Algorithm and Issues Affecting Convergence====<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. This problem can be eliminated by using basis expansions technique. To be specific, we try to find a hyperplane not in the original space, but in the enlarged space obtained by using some basis functions.<br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#A perfect separation is not always available even desirable. If observations comes from different classes sharing the same imput, the classification model seems to be overfitting and will generally have poor predictive performance.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
<br />
</ref>.<br />
====Comment on gradient descent algorithm====<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction in which the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and you initially stand in the middle, then you will finally arrive at the saddle point (local minimum) and get stuck there.<br />
<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over <math>\,i</math> (all data points). Actually, this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent) known as Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>\,{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
*A Perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection with branches that "fan out" onto as many connections as desired, each carrying the same signal - the processing element output signal. <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref> It is a multistage regression or classification model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
A regression problem typically has only one unit in the output layer. In a k-class classification problem, there are usually k units in the output layer that each represent the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1).<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classification by NN. <br />
<br />
In perceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] and is not continuous at 0. Thus, we replace it by a smooth function <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it the '''activation function'''.<br />
<br>The choice of this function <math>\displaystyle \sigma </math> is determined by the properties of the data and the assumed distribution of target variables, but for multiple binary classification problems the logistic function, also known as inverse-logit, is often used: <br />
<math>\sigma(a)=\frac {1}{1+e^{-a}}</math><br />
<br />
There are some important properties for the activation function.<br />
<br />
# Activation function is nonlinear. It can be shown that if the activation function of the hidden units is linear, a three-layer neural network is equivalent to a two layer one. <br />
# Activation function saturate, which means there are maximum and minimum output value. This property ensures that the weights are bounded and therefore the searching time is limited. <br />
# Activation function is continuous and smooth.<br />
# Activation function is monotonic. This property is not necessary, since we know that RBF networks is also a kind of power model. <br />
<br />
'''Note:''' A key difference between a perceptron and a neural network is that a neural network uses continuous nonlinearities in the units, for the purpose of differentiation, whereas the perceptron often uses a non-differentiable activation function. The neural network function is differentiable with respect to the network parameters so that a gradient descent method can be used in training. Moreover, a perceptron is a linear classifier, whereas a neural network, by combining layers of perceptrons, is able to classify non-linear problems through proper training.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes into the perceptron, to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural network is called [http://en.wikipedia.org/wiki/Feedforward_neural_network Feed-Forward Neural Network]. Applications to Feed-Forward Neural Networks include data reduction, speech recognition, sensor signal processing, and ECG abnormality detection, to name a few. <ref>J. Annema, Feed-Forward Neural Networks, (Springer 1995), pp. 9 </ref><br />
<br />
===Back-propagation===<br />
For a while, the Neural Network model was just an idea, since there were no algorithms for training the model until 1986, when Geoffrey Hinton <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> devised an algorithm called '''back-propagation''' [http://en.wikipedia.org/wiki/Backpropagation#Algorithm]. After that, a number of other training algorithms and various configurations of neural networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied a gradient descent algorithm for optimizing weights. Back-propagation uses this idea of gradient descent to train a neural network based on the chain rule in calculus. <br />
<br />
Assume that the last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
For simplicity, there is only 1 unit at the end and assume for the moment we are doing regression.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each unit we have a function <math>\displaystyle z_i=\sigma(a_i)</math> that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classification and the resulting classification output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
''i j are reversed!''<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\delta_j \cdot z_l</math><br />
<br />
Note that a change in <math>\,a_j</math> causes changes in all <math>\,a_i</math> in the next layer on which the error is based, so we need to sum over i in the chain:<br />
<math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} =\sum_i \delta_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\,\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math> Using the activation function <math>\,\sigma(\cdot)</math><br />
<br />
So <math>\delta_j = \sum_i \delta_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \delta_i \cdot u_{ij}</math><br />
<br />
We can propagate the error calculated in the output back through the previous layers and adjust weights to minimize error.<br />
<br />
==Neural Networks (NN) - October 30, 2009 ==<br />
<br />
=== Back-propagation ===<br />
The idea is that we first feed an input from the training set to the Neural Network, then find the error rate at the output and then we propagate the error to previous layers and for each edge of weight <math>\,u_{ij}</math> we find <math>\frac{\partial \mathrm{err}}{\partial u_{ij}}</math>. Having the error rates at hand we adjust the weight of each edge by taking steps proportional to the negative of the gradient to decrease the error at output. The next step is to apply the next input from the training set and go through the described adjustment procedure.<br />
The overview of Back-propagation algorithm:<br />
#Feed a point <math>\,x</math> in the training set to the network, and find the output of all the nodes.<br />
#Evaluate <math>\,\delta_k=y_k-\hat{y_k}</math> for all output units, where <math>y_k</math> is the expected output and <math>\hat{y_k}</math> is the real output.<br />
#By propagating to the previous layers evaluate all <math>\,\delta_j</math>s for hidden units: <math>\,\delta_j=\sigma'(a_j)\sum_i \delta_i u_{ij}</math> where <math>i</math> is associated to the previous layer.<br />
#Using <math>\frac{\partial \mathrm{err}}{\partial u_{jl}} = \delta_j\cdot z_l</math> find all the derivatives.<br />
#Adjust each weight by taking steps proportional to the negative of the gradient: <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} -\rho \frac{\partial \mathrm{err}}{u_{ij}}</math><br />
#Feed the next point in the training set and repeat the above steps.<br />
==== How to initialize the weights ====<br />
This still leaves the question of how to initialize the weights <math>\,u_{ij}, w_i</math>. The method of choosing weights mentioned in class was to randomize the weights before the first step. This is not likely to be near the optimal solution in every case, but is simple to implement. To be more specific, random values near zero will be a good choice for the initial weights(usually from [-1,1]). In this case, the model evolves from a nearly linear one to a nonlinear one as we desired. An alternative is to use an orthogonal least squares method to find the initial weights <ref>http://www.mitpressjournals.org/doi/abs/10.1162/neco.1995.7.5.982</ref>. Regression is performed on the weights and output by using a linear approximation of <math>\,\sigma(a_i)</math>, and finds optimal weights in the linear model. Back propagation is used afterward to find the optimal solution, since the NN is non-linear.<br />
<br />
==== How to set learning rates ====<br />
The learning rate <math>\,\rho</math> is usually a constant. <br />
<br />
If we use On-line learning, as a form of stochastic approximation process, <math>\,\rho</math> should decrease as the iteration increase.<br />
<br />
<br />
Choosing too large learning rate may cause the unstability of the system, while the too small learning rate may lead to a very slow convergence rate(very long time in learning phase). However, the advantage of small learning rate is that it can guarantee the convergence. Thus, generally, it is better to choose a relatively small learning rate to ensure the stability. Usually, choose <math>\,\rho</math> between 0.01 and 0.7<br />
<br />
==== How to determine the number of hidden units ====<br />
<br />
Here we will mainly discuss how to estimate the number of hidden units at very beginning. Obviously, we should adjust it to be more precise using CV, LOO or other complexity control methods. <br />
<br />
Basically, if the patterns are well separated, few hidden units are fairly enough. If the patterns are drawn from some highly complicated mixture models, more hidden units are really needed. <br />
<br />
Actually, the number of hidden units determines the size of the model, and therefore the total number of the weights in the model. Typically speaking, the number of weights should not be larger than the number of training data, say N. Thus, sometimes, N/10 is a good choice. However, in pratice, many well performed models will use more hidden units.<br />
<br />
=== Dimensionality reduction application ===<br />
[[File:NN-bottelneck.png|350px|thumb|right|Figure 1: Bottleneck configuration for applying dimensionality reduction.]]<br />
One possible application of Neural Networks is to perform dimensionality reduction, like other techniques, e.g., PCA, MDS, LLE and Isomap.<br />
<br />
Consider the following configuration as shown in figure 1:<br />
As we go forward in layers of this Neural Network, the number of nodes is reduced, until we reach a layer with the number of nodes representing the desired dimensionality. However, note that at the very first few layers the number of nodes may not be strictly decreasing, as long as finally it can reach a layer with less nodes. From now on in the Neural Network <br />
the previous layers are mirrored. So at the output layer we have the same number of states as we have in the input layer. Now note that if we feed the network with each point and get an output approximately equal to the fed input, that means at the output the same input is reconstructed from the middle layer units. So the output of the middle layer units can represent the input with less dimensions.<br />
<br />
To train this Neural Network, we feed the network with a training point and through back propagation we adjust the network weights based on the error between the input layer and the reconstruction at the output layer. Our low dimensional mapping will be the observed output from the middle layer. Data reconstruction consists of putting the low dimensional data through the second half of the network.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
== Complexity Control October 30, 2009 ==<br />
<br />
[[File:overfitting-model.png|500px|thumb|right|Figure 2. The overfitting model passes through all the points of the training set, but has poor predictive power for new points.<br />
In exchange the line model has some error on the training points but has extracted the main characteristic of the training points, and has good predictive power.]]<br />
There are [http://academicearth.org/lectures/underfitting-and-overfitting two issues] that we have to avoid in Machine Learning:<br />
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]<br />
#Underfitting<br />
<br />
Overfitting occurs when our model is heavily complex with so many degrees of freedom, that we can learn every detail of the training set. Such a model will have very high precision on the training set but will show very poor ability to predict outcomes of new instances, especially outside the domain of the training set.<br />
<br />
In a Neural Network if the depth is too much, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will show a very precise outcome of the training set but will not be able to generalize the commonality of the training set to predict the outcome of new cases.<br />
<br />
Underfitting occurs when the model we picked to describe the data is not complex enough, and has high error rate on the training set.<br />
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.<br />
<br />
'''Example'''<br />
#Consider the example showed in the figure. We have a training set and we want to find a model which fits it the best. We can find a polynomial of high degree which almost passes through all the points in the training set. But, in fact the training set is coming from a line model. Now the problem is although the complex model has less error on the training set it diverges from the line in other ranges which we have no training point. Because of that the high degree polynomail has very poor predictive result on test cases. This is an example of overfitting model.<br />
#Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem.<br />
#Consider a simple classification example. If our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is a yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features typical of bananas, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie., we have overfit the data. This occurs when we have considered so many features that we have perfectly described the existing bananas, but if presented with a new banana of slightly different shape than the existing bananas, for example, it cannot be detected. This is the tradeoff; what is the right level of complexity?<br />
<br />
== Complexity Control - Nov 2, 2009 ==<br />
<br />
Overfitting occurs when the model becomes too complex and underfitting occurs when it is not complex enough, both of which are not desirable. To control complexity, it is necessary to make assumptions for the model before fitting the data. Assumptions that we can make for a model are with polynomials or a neural network. There are other ways as well.<br />
<br />
[[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 1: An example of a model with a family of polynomials]]<br />
We do not want a model to get too complex, so we control it by making an assumption on the model. With complexity control, we want a model or a classifier with a low error rate.<br />
<br />
=== '''How do we choose a good classifier?''' ===<br />
<br />
Our goal is to find a classifier that minimizes the true error rate. <br />
Recall the empirical error rate<br />
<br />
<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math><br />
<br />
<math>\,h</math> is a classifier and we want to minimize the error rate. So we apply <math>\displaystyle x_1</math> to <math>\displaystyle x_n</math> to <math>\displaystyle h</math>, and take the average to get the empirical true error rate estimation of probability that <br />
<math>h(x_{i}) \neq y_{i}</math>.<br />
<br />
<span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 2]]</span><br />
There is a downward bias to this estimate meaning that it is always less than the true error rate. <br />
<br />
If there is a change in our complexity from low to high, our error rate is always decreasing. When we apply our model to the test data, our error rate will start to decrease to a point, but then it will increase since the model hasn't seen it before. This can be explained since training error will decrease when we fit the model better by increasing its complexity, but as we have seen, this complex model will not generalize well, resulting in a larger test error. <br />
<br />
We use our test data (from the test sample line shown on Figure 2) to get our empirical error rate.<br />
Right complexity is defined as where error rate of the test data is minimum; and this is one idea behind complexity control.<br />
<br />
<br />
<br />
[[File:Bias.jpg|200px|thumb|left|Figure 3]]<br />
<br />
We assume that we have samples <math>\,X_1, . . . ,X_n</math> that follow some (possibly unknown) distribution. We want to estimate a parameter <math>\,f</math> of the unknown distribution. This parameter may be the mean <math>\,E(X_i)</math>, the variance <math>\,var(X_i)</math> or some other quantity.<br />
<br />
The unknown parameter <math>\,f</math> is a fixed real number <math>f\in R</math>. To estimate it, we use an estimator which is a<br />
function of our observations, <math>\hat{f}(X_1,...,X_n)</math>. <br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f</math><br />
<br />
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]</math><br />
<br />
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math><br />
<br />
One property we desire of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>.<br />
However, there is a more important property for an estimator than just being unbiased: the mean squared error. In statistics, there are problems for which it may be good to use an estimator with a small bias. In some cases, an estimator with a small bias may have lesser mean squared error or be median-unbiased (rather than mean-unbiased, the standard unbiasedness property). The property of median-unbiasedness is invariant under transformations while the property of mean-unbiasedness may be lost under nonlinear transformations. For example, while using an unbiased estimator with large mean square error to estimate the parameter, we highly risk a big error. In contrast, a biased estimator with small mean square error will well improve the precision of our prediction.<br />
<br />
Hence, our goal is to minimize <math>MSE (\hat{f})</math>.<br />
<br />
From figure 3, we can see that the relationship of the three parameters is:<br />
<math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.<br />
<br />
A Test error is a good estimation on MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias.<br />
<br />
<br />
Referring to Figure 2, overfitting happens after the point where training data (training sample line) starts to decrease and test data (test sample line) starts to increase. There are 2 main approaches to avoid overfitting:<br />
<br />
1. Estimating error rate<br />
<br />
<math>\hookrightarrow</math> Empirical training error is not a good estimation<br />
<br />
<math>\hookrightarrow</math> Empirical test error is a better estimation<br />
<br />
<math>\hookrightarrow</math> Cross-Validation is fast<br />
<br />
<math>\hookrightarrow</math> Computing error bound (analytically) using some probability inequality.<br />
<br />
We will not discuss computing the error bound in class; however, a popular method for doing this computation is called VC Dimension (short for Vapnik–Chervonenkis Dimension). Information can be found from [http://www.autonlab.org/tutorials/vcdim.html Andrew Moore] and [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.7171&rep=rep1&type=pdf Steve Gunn].<br />
<br />
2. Regularization<br />
<br />
<math>\hookrightarrow</math> Use of shrinkage method<br />
<br />
<math>\hookrightarrow</math> Decrease the chance of overfitting by controlling the weights<br />
<br />
=== '''Example of under and overfitting in R''' ===<br />
<br />
To give further intuition of over and underfitting, consider this example. A simple quadratic data set with some random noise is generated, and then polynomials of varying degrees are fitted. The errors for the training set and a test set are calculated.<br />
[[File:Curvefitting-rex2.png|250px|thumb|right|Polynomial fits to curved data set.]]<br />
<br />
>> x <- rnorm(200,0,1)<br />
>> y <- x^2-0.5*x+rnorm(200,0,0.3)<br />
>> xtest <- rnorm(50,1,1)<br />
>> ytest <- xtest^2-0.5*xtest+rnorm(50,0,0.3)<br />
>> p1 <- lm(y~x)<br />
>> p2 <- lm(y ~ poly(x,2))<br />
>> pn <- lm(y ~ poly(x,10))<br />
>> psi <- lm(y~I(sin(x))+I(cos(x)))<br />
<br />
: <code>x</code> values for the training set are based on a <math>\,N(0,1)</math> distribution, while the test set has a <math>\,N(1,1)</math> distribution. <code>y</code> values are determined by <math>\,y = x^2 + x + N(0,0.3)</math>, a quadratic function with some random variation. Polynomial least square fits of degree 1, 2, and 10 are calculated, as well as a fit of <math>\,sin(x)+cos(x)</math>.<br />
<br />
>> > # calculate the mean squared error of degree 1 poly<br />
>> > sum((y-predict(p1,data.frame(x)))^2)/length(y)<br />
>> [1] 1.576042<br />
>> > sum((ytest-predict(p1,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 7.727615<br />
: Training and test mean squared errors for the linear fit. These are both quite high - and since the data is non-linear, the different mean value of the test data increases the error quite a bit.<br />
>> > # calculate the mean squared error of degree 2 poly<br />
>> > sum((y-predict(p2,data.frame(x)))^2)/length(y)<br />
>> [1] 0.08608467<br />
>> > sum((ytest-predict(p2,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 0.08407432<br />
: This fit is far better - and there is not much difference between the training and test error, either.<br />
>> > # calculate the mean squared error of degree 10 poly<br />
>> > sum((y-predict(pn,data.frame(x)))^2)/length(y)<br />
>> [1] 0.07967558<br />
>> > sum((ytest-predict(pn,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 156.7139<br />
: With a high-degree polynomial, the training error continues to decrease, but not by much - and the test set error has risen again. The overfitting makes it a poor predictor. As the degree of the polynomial rises further, the accuracy of the computer becomes an issue - and a good fit is not even consistently produced for the training data.<br />
>> > # calculate mse of sin/cos fit<br />
>> > sum((y-predict(psi,data.frame(x)))^2)/length(y)<br />
>> [1] 0.1105446<br />
>> > sum((ytest-predict(psi,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 1.320404<br />
: Fitting a function of the form sin(x)+cos(x) works pretty well on the training set, but because it is not the real underlying function, it fails on test data which doesn't lie on the same domain.<br />
<br />
== ''' Cross-Validation (CV) - Introduction ''' ==<br />
<br />
[[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]]<br />
Cross-Validation is used to estimate the error rate of a classifier with respect to test data rather than data used in the model. Here is a general introduction to CV:<br />
<br />
<math>\hookrightarrow</math> We have a set of collected data for which we know the proper labels<br />
<br />
<math>\hookrightarrow</math> We divide it into 2 parts, Training data (T) and Validation data (V)<br />
<br />
<math>\hookrightarrow</math> For our calculation, we pretend that we do not know the label of V and we use data in T to train the classifier<br />
<br />
<math>\hookrightarrow</math> We estimate an empirical error rate on V since the model hasn't seen V yet and we know the proper label of all elements in V to know how many were misclassified.<br />
<br />
CV has different implementations which can reduce the variance of the calculated error rate, but sometimes with a tradeoff of a higher calculation time.<br />
<br />
== ''' Complexity Control - Nov 4, 2009''' ==<br />
<br />
== Cross-validation ==<br />
[[File:Cross-validation.png|350px|thumb|right|Figure 1: Classical/Standard cross-validation]]<br />
Cross-validation is the simplest and most widely used method to estimate the true error. It comes from the observation that although training error always decreases with the increasing complexity of the model, the test error starts to increase from a certain point, which is noted as overfitting (see [[#prediction-error|figure 2]] above). Since test error estimates MSE (mean square error) best, people came up with the idea to divide the data set into three parts: training set, validation set, and test set. training set is used to build the model, validation set is used to deside the parameters and the optimal model, and the test set is used to estimate the performance of the chosen model. A classical division is 50% for training set, and 25% each for validation set and test set. All of them are randomly selected from the original data set. <br />
<br />
Then, we only use the part of our data marked as the "training set" to train our algorithm, while keeping the remaining marked as the "validation set" untouched. As a result, the validation set will be totally unknown to the trained model. The error rate is then estimated by:<br />
<br />
<math>\hat L(h) = \frac{1}{|\nu|}\sum_{X_i \in \nu}(h(x_i) \neq y_i)</math>, where <math>\,|\nu|</math> is the cardinality of the validation set.<br />
<br />
When we change the complexity, the error generated by the validation set will have the same behavior as the test set, so we are able to choose the best parameters to get the lowest error.<br />
<br />
<br />
=== K-fold Cross-validation ===<br />
[[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]]<br />
Above is the simplest form of complexity control. However, in reality, it may be hard to collect data ??and we usually suffer from the curse of dimensionality??, and a larger data set may be hard to come by. Consequently, we may not be able to afford to sacrifice part of the limited resources. In this case we use another method that addresses this problem, K-fold cross-validation. We divide the data set into <math>\,K</math> subsets roughly equal in size. The usual choice is <math>\,K = 10</math>.<br />
<br />
Generally, how to choose <math>\,K</math>:<br />
<br />
if <math>\,K=n</math>, leave one out, low bias, high variance. Each subset contains a single element, so the model is trained with all except one point, and then validated using that point.<br />
<br />
if <math>\,K=2</math>, say 2-fold, 5-fold, high bias, low variance. Each subset contains approximately <math>\,\frac{1}{2}</math> or <math>\,\frac{1}{5}</math> of the data.<br />
<br />
For every <math>\,k</math>th <math>( \,k \in [ 1, K ] )</math> part, we use the other <math>\,K-1</math> parts to fit the model and test on the <math>\,k</math>th part to estimate the prediction error <math>\hat L_k</math>, where<br />
<br />
<math>\hat L(h) = \frac{1}{K}\sum_{k=1}^K\hat L_k</math><br />
<br />
For example, suppose we want to fit a polynomial model to the data set and split the set into four equal subsets as shown in Figure 2. First we choose the degree to be 1, i.e. a linear model. Next we use the first three sets as training sets and the last as validation set, then the 1st, 2nd, 4th subsets as training set and the 3rd as validation set, so on and so forth until all the subsets have been the validation set once (all observations are used for both training and validation). After we get <math>\hat L_1, \hat L_2, \hat L_3, \hat L_4</math>, we can calculate the average <math>\hat L</math> for degree 1 model. Similarly, we can estimate the error for n degree model and generate a simulating curve. Now we are able to choose the right degree which corresponds to the minimum error. Also, we can use this method to find the optimal unit number of hidden layers of neural networks. We can begin with 1 unit number, then 2, 3 and so on and so forth. Then find the unit number of hidden layers with lowest average error.<br />
<br />
=== Generalized Cross-validation ===<br />
If the vector of observed values is denoted by <math>\mathbf{y}</math>, and the vector of fitted values by <math>\hat\mathbf{y}</math>.<br />
<br />
<math>\mathbf{y} = \mathbf{H}\hat\mathbf{y}</math>, <br />
<br />
where the hat matrix is given by<br />
<br />
<math>\mathbf{H} = \mathbf{X}( \mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X}^{T}</math>,<br />
<br />
<math> \frac{1}{N}\sum_{i=1}^{N}[y_{i} - \hat f^{-i}(\mathbf{x}_{i})]^{2}=\frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-\mathbf{H}_{ii}}]^{2}</math>,<br />
<br />
Then the GCV approximation is given by<br />
<br />
<math> GCV(\hat f) = \frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-trace(\mathbf{H})/N}]^{2}</math>,<br />
<br />
Thus, one of the biggest advantages of the GCV is that the trace is more easily computed.<br />
<br />
=== Leave-one-out Cross-validation ===<br />
Leave-one-out cross-validation involves using all but one data point in the original training data set to train our model, then using the data point that we initially left out as a means to estimate true error. But repeating this process for every data point in our original data set, we can obtain a good estimation of true error.<br />
<br />
In other words, leave-one-out cross-validation is k-fold cross-validation in which we set the subset number <math>\,K</math> to be the cardinality of the whole data set.<br />
<br />
In the above example, we can see that k-fold cross-validation can be computationally expensive: for every possible value of the parameter, we must train the model <math>\,K</math> times. This deficiency is even more obvious in leave-one-out cross-validation, where we must train the model <math>\,n</math> times, where <math>\,n</math> is the number of data point in the data set.<br />
<br />
Fortunately, when adding data points to the classifier is reversible, calculating the difference between two classifiers is computationally more efficient than calculating the two classifiers separately. So, if the classifier on all the data points is known, we simply undo the changes from a data point <math>\,K</math> times to calculate the leave-one-out cross-validation error rate.<br />
<br />
== Regularization for Neural Network — Weight Decay ==<br />
[[File:figure 2.png|350px|thumb|right|Figure 1: activation function]]<br />
Weight decay training is suggested as an implementation for achieving a robust neural network which is insensitive to noise. Since the number of hidden layers in NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting.<br />
<br />
It can be seen from Figure 1 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, we can avoid overfitting by constraining the weights to be small. This gives us a hint to initialize the random weights to be close to zero.<br />
<br />
Formally, we penalize nonlinear weights by adding a penalty term in the error function. Now the regularized error function becomes:<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}|w_i|^2 + \sum_{jk}|u_{jk}|^2)</math>, where <math>\,err</math> is the original error in back-propagation; <math>\,w_i</math> is the weights of the output layer; <math>\,u_{jk}</math> is the weights of the hidden layers.<br />
<br />
Usually, too large <math>\,\lambda</math> will make the weights <math>\,w_i</math> and <math>\,u_{jk}</math> too small. We can use cross validation to estimate <math>\,\lambda</math>.<br />
<br />
A similar penalty, weight elimination, is given by,<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}\frac{|w_i|^2}{1 + |w_i|^2} + \sum_{jk}\frac{|u_{jk}|^2}{1+|u_{jk}|^2})</math>.<br />
<br />
As in back-propagation, we take partial derivative with respect to the weights:<br />
<br />
<math>\frac{\partial REG}{\partial w_i} = \frac{\partial err}{\partial w_i} + 2\lambda w_i</math><br />
<br />
<math>\frac{\partial REG}{\partial u_{jk}} = \frac{\partial err}{\partial u_{jk}} + 2\lambda u_{jk}</math><br />
<br />
<math>w^{new} \leftarrow w^{old} - \rho\left(\frac{\partial err}{\partial w} + 2\lambda w\right)</math><br />
<br />
<math>u^{new} \leftarrow u^{old} - \rho\left(\frac{\partial err}{\partial u} + 2\lambda u\right)</math><br />
<br />
Note that here <math>\,\lambda</math> serves as a trade-off parameter, tuning between the error rate and the linearity. Actually, we may also set <math>\,\lambda</math> by cross-validation. The tuning parameter is important since weights of zero will lead to zero derivatives and the algorithm will not change. On the other hand, starting with weights that are too large means starting with a nonlinear model which can often lead to poor solutions. <ref>Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning (Springer 2009) pp.398</ref><br />
<br />
== Radial Basis Function (RBF) Networks - November 6, 2009 ==<br />
<br />
[[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]]<br />
<br />
=== Introduction === <br />
<br />
A Radial Basis Function (RBF) network [http://en.wikipedia.org/wiki/Radial_basis_function_network] is a type of artificial neural network with an output layer and a single hidden layer, with weights from the hidden layer to the output layer, and can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. One choice that has been widely used is that of radial basis functions, which have the property that each basis function depends only on the radial distance (typically Euclidean) from a center <math>\displaystyle\mu_{j}</math>, so that <math>\phi_{j}(x)= h({\Vert x - \mu_{j}\Vert})</math>.<br />
<br />
<br />
The output of an RBF network can be expressed as a weighted sum of its radial basis functions as follows:<br />
<br />
<math>\hat y_{k} = \sum_{j=1}^M\phi_{j}(x) w_{jk}</math><br />
<br />
The radial basis function is: <br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br /><br />
(Gaussian without a normalization constant)<br /><br /><br />
'''note:'''The hidden layer has a variable number of neurons (the optimal number is determined by the training process). As usual the more neurons in the hidden layer, the higher the model complexity. Each neuron consists of a radial basis function centered on a point with the same dimensions as the input data. The radii of the RBF functions may be different. The centers and radii can be determined through clustering or an EM algorithm. When the x vector is given from the input layer, the hidden neuron computes the radial distance from the neuron’s center point and then applies RBF function to this distance. The resulting value is passed to the the output layer and weighed together to form the output. <br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat{Y}_{n,k} = \left[ \begin{matrix}<br />
\hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\<br />
\hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\<br />
\vdots &\vdots & \ddots & \vdots \\<br />
\hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k}<br />
\end{matrix}\right] </math> is the matrix of output variables. <br />
<br />
:<math>\Phi_{n,m} = \left[ \begin{matrix}<br />
\phi_{1,1} & \phi_{1,2} & \cdots & \phi_{1,m} \\<br />
\phi_{2,1} & \phi_{2,2} & \cdots & \phi_{2,m} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
\phi_{n,1} & \phi_{n,2} & \cdots & \phi_{n,m}<br />
\end{matrix}\right] </math> is the matrix of Radial Basis Functions.<br />
<br />
:<math>W_{m,k} = \left[ \begin{matrix}<br />
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\<br />
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
w_{m,1} & w_{m,2} & \cdots & w_{m,k}<br />
\end{matrix}\right] </math> is the matrix of weights.<br />
<br />
Here, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>k = 1</math>, <math>\hat Y</math> and <math>W</math> are column vectors.<br />
<br />
''related reading'':<br />
<br />
Introduction of the Radial Basis Function (RBF) Networks [http://axiom.anu.edu.au/~daa/courses/GSAC6017/rbf.pdf]<br />
<br />
Radial Basis Function (RBF) Networks [http://documents.wolfram.com/applications/neuralnetworks/index6.html] [http://lcn.epfl.ch/tutorial/english/rbf/html/index.html]<br />
<br />
=== Estimation of weight matrix W ===<br />
<br />
We minimize the training error, <math>\Vert Y - \hat{Y}\Vert^2</math> in order to find <math>\,W</math>.<br /><br /><br />
From a previous result in linear algebra we know that <br />
<br />
<math>\Vert A \Vert^2 = Tr(A^{T}A)</math><br />
<br />
Thus we have a problem similar to linear regression:<br />
<br />
<math>\ err = \Vert Y - \Phi W\Vert^{2} = Tr[(Y - \Phi W)^{T}(Y - \Phi W)]</math><br />
<br />
<math>\ err = Tr[Y^{T}Y - Y^{T}\Phi W - W^{T} \Phi^{T} Y + W^{T}\Phi^{T} \Phi W]</math><br />
<br />
<br />
==== Useful properties of matrix differentiation ====<br />
<br />
<br />
<math>\frac{\partial Tr(AX)}{\partial X} = A^{T}</math><br />
<br />
<math>\frac{\partial Tr(X^{T}A)}{\partial X} = A</math><br />
<br />
<math>\frac{\partial Tr(X^{T}AX)}{\partial X} = (A^{T} + A)X</math><br />
<br />
==== Solving for W ====<br />
<br />
We find the minimum over <math>\,W</math> by setting <math>\frac{\partial err}{\partial W}</math> equal to zero and using the aforementioned properties of matrix differentiation.<br />
<br />
<math>\frac{\partial err}{\partial W} = 0</math><br />
<br />
<math>\ 0 - \Phi^{T}Y - \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ -2 \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math><br />
<br />
<math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY</math><br />
<br />
where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}</math><br />
<br />
<math>\,H</math> is the hat matrix for this model. This gives us a nice results since the solution has a closed form and we do not have to worry about convexity problems in this case.<br />
<br />
=== Including an additional bias ===<br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat Y = \left[ \begin{matrix}<br />
y_{11} & y_{12} & \cdots & y_{1k} \\<br />
y_{21} & y_{22} & \cdots & y_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
y_{n1} & y_{n2} & \cdots & y_{nk}<br />
\end{matrix}\right] </math> is the matrix(n by k) of output variables.<br />
<br />
:<math>\Phi = \left[ \begin{matrix}<br />
\phi_{10} &\phi_{11} & \phi_{12} & \cdots & \phi_{1M} \\<br />
\phi_{20} & \phi_{21} & \phi_{22} & \cdots & \phi_{2M} \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{n0} &\phi_{n1} & \phi_{n2} & \cdots & \phi_{nM}<br />
\end{matrix}\right] </math> is the matrix(n by M+1) of Radial Basis Functions.<br />
<br />
:<math>W = \left[ \begin{matrix}<br />
w_{01} & w_{02} & \cdots & w_{0k} \\<br />
w_{11} & w_{12} & \cdots & w_{1k} \\<br />
w_{21} & w_{22} & \cdots & w_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
w_{M1} & w_{M2} & \cdots & w_{Mk}<br />
\end{matrix}\right] </math> is the matrix(M+1 by k) of weights.<br />
<br />
where the extra basis function <math>\Phi_{0}</math> is set to 1.<br />
<br />
==== Normalized RBF ====<br />
<br />
In addition to the above unnormalized architecture, the normalized RBF can be represented as:<br />
<br />
<math>\hat{y}_{k}(X) = \frac{\sum_{j=1}^{M} w_{jk}\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math><br /><br /><br />
<br />
<br />
Actually, <math>\Phi^{\ast}_{j}(X) = \frac{\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math> is known as a normalized radial basis function. Giving the familiar form,<br /><br />
<br />
<math>\hat{y}_{k}(X) = \sum_{j=1}^{M} w_{jk}\Phi^{\ast}_{j}(X)</math><br /><br /><br />
<br />
=== Conceptualizing RBF networks ===<br />
<br />
In the past, we have classified data using models that were explicitly linear, quadratic, or otherwise definite. In RBF networks, like in Neural Networks, we can fit an arbitrary model. How can we do this without changing the equations being used?<br />
<br />
Recall a [[#Trick:_Using_LDA_to_do_QDA_-_October_7.2C_2009|trick]] that was discussed in the October 7 lecture: if we add new features to our original data set, we can project into higher dimensions, use a linear algorithm, and get a quadratic result by collapsing to a lower dimension afterward. In RBF networks, something similar can happen.<br />
<br />
Think of <math>\,\Phi</math>, our matrix of radial basis functions, as a feature space of the input. Each hidden unit, then, can be thought to represent a feature; we can see that, if there are more hidden units than input units, we can essentially project to a higher-dimensional space, as we did in our earlier trick. However, this does not mean that an RBF network will actually do this, it is merely a way to convince yourself that RBF networks (and neural networks) can fit arbitrary models. Nevertheless, it is also noticed that just because of such great power, the problem of overfitting appears more important. We have to control its complexity so that it does not fit to an arbitrary training model but to a general one.<br />
<br />
=== RBF networks for classification -- a probabilistic paradigm ===<br />
<br />
[[File:Rbf_graphical_model.png|350px|thumb|left|Figure 1: RBF graphical model]]<br />
<br />
An RBF network is akin to fitting a Gaussian mixture model to data. We assume that each class can be modelled by a single function <math>\,\phi</math> and data is generated by a mixture model. According to Bayes Rule,<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(x|y_{k})*Pr(y_{k})}{Pr(x)}</math><br />
<br />
While all classifiers that we have seen thus far in the course have been in discriminative form, the RBF network is a generative model that can be represented using a directed graph.<br />
<br />
We can replace the class conditional density in the above conditional probability expression by marginalizing <math>\,x</math> over <math>\,j</math>:<br />
<math>\Pr(x|y_{k}) = \sum_{j} Pr(x|j)*Pr(j|y_{k})</math><br />
<br />
<br />
<br />
<br/><br/><br />
*'''Note''' We made the assumption that each class can be modelled by a single function <math>\displaystyle\Phi</math> and that the data was generated by a mixture model. The Gaussian mixture model has the form:<br />
<math>f(x)=\sum_{m=1}^M \alpha_m \phi(x;\mu_m,\Sigma_m)</math> where <math>\displaystyle\alpha_m</math> are mixing proportions, <math>\displaystyle\sum_m \alpha_m=1</math>, and <math>\displaystyle\mu_m</math> and <math>\displaystyle\Sigma_m</math> are the mean and covariance of each Gaussian density respectively. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009), pp. 214. </ref> The generative model in Figure 1 shows graphically how each Gaussian in the mixture model is chosen to sample from.<br />
<br />
== '''Radial Basis Function (RBF) Networks - November 9th, 2009''' ==<br />
<br />
=== RBF Network for classification (A probabilistic point of view) ===<br />
Using RBF Network to do classification, we usually treat it as regression problem. We want to set a threshold to decide what the data’s class membership is. However, to find some insight to the classification problem of what we are doing in terms of RBF Network, we often think of mixture models and make certain assumptions.<br />
<br />
[[File:RBF.png|350px|thumb|right|Figure 26.1: RBF Network Classification Demo]] <br />
<br />
We assume, as we can see in the graph on the right hand side, that we have three random variables, <math>\displaystyle y_k</math>, <math>\displaystyle j</math>, and <math>\displaystyle x</math> where <math>\displaystyle y_k</math> denotes class <math>\,k</math>, <math>\displaystyle x</math> is what we observed here, and <math>\displaystyle j</math> is some hidden random variable. There is a process that there are different classes, and each class can trigger a different hidden random variable <math>\displaystyle j</math>. To understand this, we can assume that, for instance, this random variable <math>\displaystyle j</math> has a Gaussian distribution (it could have any other distribution as well) and that all the <math>\displaystyle j</math>’s have the same distribution (Gaussian), but with different parameters. From each Gaussian distribution triggered by each class, we are going to sample some data points. Therefore, in the end, we are going to get a set of data, which are not strictly Gaussian, but are actually a mixture of Gaussians.<br />
<br />
Again, we look at the posterior distribution from [http://en.wikipedia.org/wiki/Bayes'_theorem Bayes' Rule].<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(X = x | Y = y_{k})*Pr(Y = y_{k})}{Pr(X = x)}</math><br />
<br />
Since we made the assumption that the data has been generated from a mixture model, we can estimate this conditional probability by<br />
<br />
<math>\Pr(X = x | Y = y_{k}) = \sum_{j} Pr(X = x | j)*Pr(j | Y = y_{k})</math>, <br />
<br />
which is the class conditional distribution (or probability) of the mixture model. Note, here, if we only have a simple model from <math>\displaystyle y_k</math> to <math>\displaystyle x</math>, then we won’t have this summation.<br />
<br />
We can substitute this class conditional distribution into Bayes' formula. We can see that the posterior of class <math>\displaystyle k</math> is the summation over <math>\displaystyle j</math> of the probability of <math>\displaystyle x</math> given <math>\displaystyle j</math> times the probability of <math>\displaystyle j</math> given <math>\displaystyle y_k</math>, times the prior distribution of class <math>\displaystyle k</math>, and lastly divided by the marginal probability of <math>\displaystyle x</math>. That is,<br />
<br />
<math>\Pr(y_k | x) = \frac {\sum_{j} Pr(x | j)*Pr(j | y_{k})*Pr(y_{k})}{Pr(x)}</math>.<br />
<br />
Since, the prior probability of class <math>\displaystyle k</math>, <math>\displaystyle Pr(y_{k})</math>, does not have an index of <math>\displaystyle j</math>, it can be taken out of the summation. This yields,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)}</math>.<br />
<br />
We multiply this by <math>\displaystyle 1 = \frac {Pr(j)}{Pr(j)}</math>. Then, it becomes,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)} * \frac {Pr(j)}{Pr(j)}</math>.<br />
<br />
Next, note that <math>\displaystyle Pr(j | x) = \frac {Pr(x | j)*Pr(j)}{Pr(x)}</math>, and <math>\displaystyle Pr(y_k | j) = \frac {Pr(j | y_k)*Pr(y_k)}{Pr(j)}</math>. Then rearranging the terms, we finally have the posterior:<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} Pr(j | x)Pr(y_k | j)</math>.<br />
<br />
Interestingly, this is just the product of the posterior of the two functions that are summed.<br />
<br />
==== Interpretation of RBF Network classification ====<br />
<br />
[[File:2.png|350px|thumb|right|Figure 26.1.2(2): RBF Nerwork ]]<br />
<br />
We want to relate the results that we derived above to our RBF Network. In a RBF Network, as we can see on the right hand side, we have a set of data, <math>\displaystyle x_1</math> to <math>\displaystyle x_d</math>, and the hidden basis function, <math>\displaystyle \phi_{1}</math> to <math>\displaystyle \phi_{M}</math>, and then we have some output, <math>\displaystyle y_1</math> to <math>\displaystyle y_k</math>. Also, we have weights from the hidden layer to output layer. The output is just the linear sum of <math>\displaystyle \phi</math>’s. <br />
<br />
Now consider probability of <math>\displaystyle j</math> given <math>\displaystyle x</math> to be <math>\displaystyle \phi</math>, and the probability of <math>\displaystyle y_k</math> given <math>\displaystyle j</math> to be the weights <math>\displaystyle w_{jk}</math>, then the posterior can be written as,<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math>.<br />
<br />
[[File:3.png|350px|thumb|left|Figure 26.1.2(1): Gaussian mixture ]]<br />
<br />
Now, let us look at an example in one dimensional case. Suppose,<br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, and <math>\displaystyle j</math> is from 1 to 2. <br />
<br />
We know that <math>\displaystyle \phi</math> is a radial basis function. It's as if we put some Gaussian over data. And for each Gaussian, we consider the center <math>\displaystyle \mu</math>. Then, what <math>\displaystyle \phi</math> computes is the similarity of any data point to the center. <br />
<br />
We can see the graph on the left which plots the density of <math>\displaystyle \phi_{1}</math> and <math>\displaystyle \phi_{2}</math>. Take <math>\displaystyle \phi_{1}</math> for instance, if the point gets far from the center <math>\displaystyle \mu_{1}</math>, then it will reduce <math>\displaystyle \phi_{1}</math> to become nearly zero. Remember that, we can usually find a non-linear regression or classification of input space by doing a linear one in some extended space or some feature space (more details in Aside). Here, the <math>\displaystyle \phi</math>’s actually produce that feature space. <br />
<br />
So, one way to look at this is that this <math>\displaystyle \phi</math> is telling us that given an input, how likely the probability of presence of a particular feature is. Say, for example, we define the features as the centers of these Gaussian distributions. Then, this <math>\displaystyle \phi</math> function somehow computes the possibility given certain data points, of this kind of feature appearing. If the data point is right at the center, then the value of that <math>\displaystyle \phi</math> would be one, i.e. the probability is 1. If the point is far from the center, then the probability (<math>\displaystyle \phi</math> function value) will be close to zero, that is, it’s less likely. Therefore, we can treat <math>\displaystyle Pr(j | x)</math> as the probability of a particular feature given data. <br />
<br />
When we have those features, then <math>\displaystyle y</math> is the linear combination of the features. Hence, any of the weights <math>\displaystyle w</math>, which is equal to <math>\displaystyle Pr(y_k | j)</math>, tells us how likely this particular <math>\displaystyle y</math> will appear given those features. Therefore, the weight <math>\displaystyle w_{jk}</math> shows the probability of class membership given feature. <br />
<br />
Hence, we have found a probabilistic point of view to look at RBF Network!<br />
<br />
*'''Note''' There are some inconsistencies with this probabilistic point of view. There are no restrictions that force <math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math> to be between 0 and 1. So if least squares is used to solve this, <math>\displaystyle w_{jk}</math> cannot be interpreted as a probability. <br />
<br />
<br />
''' Aside '''<br />
*Feature Space:<br />
:One way to produce a feature space is LDA<br />
:Suppose, we have n data points <math>\mathbf{x}_1</math> to <math>\mathbf{x}_n </math>. Each data point has d features. And these n data points consist of the <math>X</math> matrix, <br />
:<math>X = \left[ \begin{matrix}<br />
x_{11} & x_{21} & \cdots & x_{n1} \\<br />
x_{12} & x_{22} & \cdots & x_{n2} \\<br />
\vdots & & \ddots & \vdots \\<br />
x_{1d} & x_{2d} & \cdots & x_{nd}<br />
\end{matrix}\right] </math><br />
:Also, we have feature space,<br />
:<math>\Phi^{T} = \left[ \begin{matrix}<br />
\phi_{1}(\mathbf{x_1}) & \phi_{1}(\mathbf{x_2})& \cdots & \phi_{1}(\mathbf{x_n})\\<br />
\phi_{2}(\mathbf{x_1})& \phi_{2}(\mathbf{x_2})& \cdots & \phi_{2}(\mathbf{x_n}) \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{M}(\mathbf{x_1}) & \phi_{M}(\mathbf{x_2}) & \cdots & \phi_{M}(\mathbf{x_n})<br />
\end{matrix}\right] </math> <br />
:If we want to solve a regression problem for the input data, we don’t perform Least Square on this <math>\displaystyle X</math> matrix, we do Least Square on the feature space, i.e. on the <math>\displaystyle \Phi^{T}</math> matrix. The dimensionality of <math>\displaystyle \Phi^{T}</math> is M by n.<br />
:Now, we still have n data points, but we define these n data points in terms of a new set of features. So, originally, we define our data points by d features, but now, we define them by M features. And what are those M features telling us? <br />
:Let us look at the first column of <math>\displaystyle \Phi^{T}</math> matrix. The first entry is <math>\displaystyle \phi_1</math> applied to <math>\mathbf{x_1}</math>, and so on, until the last entry is <math>\displaystyle \phi_M</math> applied to <math>\mathbf{x_1}</math>. Suppose each of these <math>\displaystyle \phi_j</math> is defined by<br />
:<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>.<br />
:Then, each <math>\displaystyle \phi_j</math> checks the similarity of the data point with its center. Hence, the new set of features are actually representing M centers in our data set, and for each data point, its new features check how this point is similar to the first center; how it is similar to the second center; and how it is similar to the <math>\displaystyle M^{th}</math> center. And this checking process will apply to all data points. Therefore, feature space gives another representation of our data set. <br />
<br />
</noinclude><br />
<br />
=== Model selection or complexity control for RBF Network - a brief introduction ===<br />
In order to obtain a better fit for the training data, we often want to increase the complexity of our RBF Network. By its construction, we know that to change the complexity of a RBF Network, the only way is to add or decrease the number of basis functions. A large number of basis function yields a more complex network. In theory, if we add enough basis functions, the RFB Network would work for any training; however, it doesn't mean this RBF Network model can generalize well. Therefore, for the purpose of avoiding overfitting problem (see Notes below), we only want to increase the number of basis function to certain point, i.e. its optimal level. <br />
<br />
For the model selection, what we usually do is estimate the training error. After working through the training error, we’ll see that the training error in fact can be decomposed, and one component of training error is called Mean Squared Error (MSE). In the later notes, we will find that our final goal is to get a good estimate of MSE. Moreover, in order to find an optimal model for our data, we select the model with the smallest MSE.<br />
<br />
Now, let us introduce some notations that we will use in the analysis:<br />
*<math>\hat f</math> -- the prediction model estimated by a RBF network from the training data<br />
*<math>\displaystyle f</math> -- the real model (not null), and ideally, we want <math>\hat f</math> to be close to <math>\displaystyle f</math><br />
*<math>\displaystyle err</math> -- the training error<br />
*<math>\displaystyle Err</math> -- the testing error<br />
*<math>\displaystyle MSE</math> -- the Mean Squared Error<br />
<br />
''' Notes '''<br />
<br />
[[File:overfitting.png|350px|thumb|left|Figure 26.2: Overfitting]]<br />
<br />
*Being more complex isn’t always a good thing. Sometime, [http://en.wikipedia.org/wiki/Overfitting overfitting] causes the model to lose its generality. For example in the graph on left hand side, the data points are sampled from the model <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle f(x_i)</math> is a linear function, which is shown by the blue line, and <math>\displaystyle \epsilon_i</math>is additive Gaussian noise from <math>~N(0,\sigma^2)</math>. The red curve displayed in the graph shows the over-fitted model. Clearly, this over-fitted model only works for any training data, and is useless for any further prediction when new data points are introduced.<br />
<br />
> n<-20;<br />
> x<-seq(1,10,length=n);<br />
> alpha<-2.5;<br />
> beta<-1.75;<br />
> y<-alpha+beta*x+rnorm(n);<br />
> plot(y~x, pch=16, lwd=3, cex=0.5, main='Overfitting');<br />
> abline(alpha, beta, col='blue');<br />
> lines(spline(x, y), col = 2);<br />
<br />
*More details on this topic later on.<br />
<br />
<br />
<br />
</noinclude><br />
<br />
<br />
<br />
<br />
<br />
<br />
== '''Model Selection(Stein's Unbiased Risk Estimate)- November 11th, 2009''' ==<br />
<br />
===Model Selection===<br />
<br />
Model selection is a task of selecting a model of optimal complexity for a given data. Learning a radial basis function network from data is a parameter estimation problem. One difficulty with this problem is selecting parameters that show good performance on both training and testing data. In principle, a model is selected to have parameters associated with the best observed performance on training data, although our goal really is to achieve good performance on unseen testing data. Not surprisingly, a model selected on the basis of training data does not necessarily exhibit comparable performance on the testing data. When squared error is used as the performance index, a zero-error model on the training data can always be achieved by using a sufficient number of basis functions.<br />
<br />
<br />
But, training error and testing error do not demonstrate a linear relationship. In particular, a smaller training error does do not necessarily result in a smaller testing error. In practice, one often observes that, up to a certain point, the model error on testing data tends to decrease as the training error decreases. However, if one attempts to decrease the training error too far by increasing model complexity, the testing error often can take a dramatic increase.<br />
<br />
<br />
The basic reason behind this phenomenon is that in the process of minimizing training error, after a certain point, the model begins to over-fit the training set. Over-fitting in this context means fitting the model to training data at the expense of losing generality. In the extreme form, a set of <math>\displaystyle N</math> training data points can be modeled exactly with <math>\displaystyle N</math> radial basis functions. Such a model follows the training data perfectly. However, the model is not representative features of the true underlying data source, and this is why it fails to correctly model new data points.<br />
<br />
<br />
In general, the training error rate will be less than the testing error on the new data. A model typically adapts to the training data, and hence the training error will be an overly optimistic estimate of the testing error. An obvious way to well estimate testing err is to add a penalty term to the training error to compensate. Actually, SURE is developed based on this.<br />
<br />
<br />
<br />
===Stein's unbiased risk estimate (SURE)===<br />
<br />
<br />
====Important Notation====<br />
<br />
Let:<br />
*<math>\hat f(X)</math> denote the ''prediction model'', which is estimated from a training sample by the RBF neural network model.<br />
*<math>\displaystyle f(X)</math> denote the ''true model''.<br />
*<math>\displaystyle err=\sum_{i=1}^N (\hat y_i-y_i)^2 </math> denote the ''training error'',which is the average loss over the training sample.<br />
*<math>\displaystyle Err=\sum_{i=1}^M (\hat y_i-y_i)^2 </math> denote the ''test error'', which is the expected prediction error on an independent test sample.<br />
*<math>\displaystyle MSE=E(\hat f-f)^2</math> denote the ''mean squared error'', where <math>\hat f(X)</math> is the estimated model and <math>\displaystyle f(X)</math> is the true model.<br />
<br />
<br />
<br />
Suppose the observations <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>.We need to estimate <math>\hat f</math> from training data set <math>T=\{(x_i,y_i)\}^2_{i=1}</math>. Let <math>\hat f_i=\hat f_(x_i)</math> and <math>\displaystyle f_i= f(x_i)</math>, then <br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i-\epsilon_i)^2]</math><math>=E[(\hat f_i-f_i)^2]+E[\epsilon_i^2]-2E[\epsilon_i(\hat f_i-f_i)]</math><br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2E[\epsilon_i(\hat f_i-f_i)]</math> <math>\displaystyle (1)</math><br />
<br />
The last term can be written as:<br />
<br />
<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=E[(y_i-f_i)(\hat f_i-f_i)]=cov(y_i,\hat f)</math>, where<math>\displaystyle y_i</math> and <math>\hat f_i</math> both have same mean <math>\displaystyle f_i</math>.<br />
<br />
<br />
<br />
====Stein's Lemma====<br />
<br />
If <math>\,Z</math> is <math>\,N(\mu,\sigma^2)</math> and if <math>\displaystyle g(Z)</math> is weakly differentiable,such that<math>\displaystyle E[\vert g'(Z)\vert]<\infty</math>, then <math>\displaystyle E[g(Z)(Z-\mu)]=\sigma^2E(g'(Z))</math>.<br />
<br />
<br />
According to Stein's Lemma, the last cross term of <math>\displaystyle (1)</math>, <math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]</math> can be written as <math>\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math>. The derivation is as follows.<br />
<br />
<math>\displaystyle Proof</math>:<br />
<math>\displaystyle E[g(Z)(Z-\mu)]=E[(\hat f-f)\epsilon]=\sigma^2E(g'(Z))=\sigma^2 E[\frac {\partial (\hat f-f)}{\partial y_i}]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}-\frac {\partial f}{\partial y_i}]</math><br />
<br />
<br />
Since <math>\displaystyle f</math> is the true model, not the function of the observations <math>\displaystyle y_i</math>, then <math>\frac {\partial f}{\partial y_i}=0</math>.<br />
<br />
So,<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math> <math>\displaystyle (2)</math><br />
<br />
<br />
<br />
====Two Different Cases====<br />
<br />
=====''Case 1''=====<br />
<br />
Consider the case in which a new data point has been introduced to the estimated model, i.e. <math>(x_i,y_i)\not\in\tau</math>; this new point belong to the validation set <math>\displaystyle \nu</math>,i.e.<math>(x_i,y_i)\in\nu</math>. Since <math>\displaystyle y_i</math> is a new point, <math>\hat f</math> and <math>\displaystyle y_i</math> are independent. Therefore <math>\displaystyle cov(y_i,\hat f)=0</math> (or think about <math>\frac{\partial \hat f}{\partial y_i}</math>, when <math>\,y_i</math>is a new point, then it has nothing with <math>\hat f</math> because the estimation of <math>\hat f</math> is from the training data, so <math>\frac{\partial \hat f}{\partial y_i}=0</math> ) and <math>\displaystyle (1)</math> in this case can be written as:<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2</math>. <br />
<br />
This expectation means <math>\frac {1}{m}\sum_{i=1}^m (\hat y_i-y_i)^2 = \frac {1}{m}\sum_{i=1}^m (\hat f_i-f_i)^2+ \sigma^2</math>.<br />
<br />
<math>\sum_{i=1}^m (\hat y_i-y_i)^2 = \sum_{i=1}^m (\hat f_i-f_i)^2+ m\sigma^2</math><br />
<br />
Based on the notation we denote above, then we obtain:<br />
<math>\displaystyle MSE=Err-m\sigma^2</math><br />
<br />
<br />
<br />
This is the justification behind the technique of cross validation. since <math>\displaystyle \sigma^2</math> is constant, to minimize <math>\displaystyle MSE</math> is equal to minimize the test err <math>\displaystyle Err</math>. In cross vaildation to avoid overfitting or underfitting, a validation data set is independent from the estimated model.<br />
<br />
<br />
=====''Case 2''=====<br />
<br />
A more interesting case is the case in which we do not use new data points to assess the performance of the estimated model. and the training data is used for both estimating and assessing a model <math>\hat f_i</math>. In this case the cross term in <math>\displaystyle (1)</math> cannot be ignored because <math>\hat f_i</math> and <math>\displaystyle y_i</math> are not independent. Therefore the cross term can be estimated by Stein's lemma, which was originally proposed to estimated the mean of a Guassian distribution.<br />
<br />
<br />
Suppose <math>(x_i,y_i)\in\tau</math>, then by applying Stein's lemma, we obtain <math>\displaystyle (2)</math> proved above.<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2\sigma^2E[\frac {\partial \hat f}{\partial y_i}]</math>.<br />
<br />
This expectation means <math>\frac {1}{N}\sum_{i=1}^N (\hat y_i-y_i)^2 = \frac {1}{N}\sum_{i=1}^N (\hat f_i-f_i)^2+ \sigma^2-\frac {2}{N}\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<br />
<math>\sum_{i=1}^N (\hat y_i-y_i)^2 = \sum_{i=1}^N (\hat f_i-f_i)^2+ N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<math>\displaystyle err=MSE+N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math><br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math> <math>\displaystyle (3)</math><br />
<br />
In statistics, this is known as [http://www.reference.com/browse/Stein%27s+unbiased+risk+estimate Stein's unbiased risk estimate (SURE)] is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter, and thus cannot be determined completely.<br />
<br />
<br />
<br />
===SURE for RBF Network===<br />
<br />
Based on SURE, the optimum number of basis functions should be assigned to have the minimum generalization err <math>\displaystyle err</math>. Based on the Radial Basis Function Network, by setting <math>\frac{\partial err}{\partial W}</math> equal to zero , we get the least squared solution of<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math>.Then we have <math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}Y = HY</math>,where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}</math>, <math>\,H</math> is the hat matrix for this model.<br />
<br />
<br />
<math>\hat f=\,H_{i1}y_2+\,H_{i2}y_2+\cdots+\,H_{in}y_n</math> <math>\displaystyle (3)</math><br />
<br />
where <math>\,H</math> depends on the input vector <math>\displaystyle x_i</math> but not on <math>\displaystyle y_i</math>. <br />
<br />
By taking the derivative of <math>\hat f_i</math> with respect to <math>\displaystyle y_i</math>, we can easily obtain:<br />
<br />
<math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Now, substituing this into<math>\displaystyle (3)</math> ,we get<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Here, we can tell that <math>\sum_{i=1}^N \,H_{ii}= \,Trace(H)</math>, the sum of the diagonal elements of <math>\,H</math>. Thus, we can obtain the further simplification that <math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1})\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=d</math>, where<math>\displaystyle d</math> is the dimension of <math>\displaystyle \Phi</math>. Since <math>\displaystyle \Phi</math> is a projection of input matrix <math>\,X</math> onto a basis set spanned by <math>\,M</math>, the number of basis functions. If considering intercept, then <math>\,Trace(H)= M+1</math>.<br />
<br />
Then,<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1)</math>.<br />
<br />
===SURE Algorithm===<br />
<br />
<br />
[[File:27.1.jpg|350px|thumb|right|Figure 27.1]]<br />
<br />
We use this method to find the optimum number of basis function by choosing the model with smallest MSE over the set of models considered. Given a set of models <math>\hat f_M(x)</math> indexed by the number of basis functions, <math>\displaystyle err(M)</math>. <br />
<br />
Then, <math>\displaystyle MSE(M)=err(M)-N\sigma^2+2\sigma^2(M+1)</math><br />
<br />
where <math>\displaystyle N</math> is the number of training samples and the noise,<math>\sigma^2</math>, can be estimated from the training data as<br />
<br />
<math>\hat \sigma^2=\frac {1}{N-1}\sum_{i=1}^N (\hat y-y)^2</math>.<br />
<br />
<br />
By applying SURE algorithm to SPECT Heart data, we get the optimal number of basis functions is <math>\displaystyle M=4</math>.<br />
<br />
<br />
Pls take a look at figure 27.1 on the right, which shows that<math>\displaystyle MSE</math> is smallest when <math>\displaystyle M=4</math>.<br />
<br />
<br />
Calculating the SURE value is easy if you have access to <math>\,\sigma</math>.<br />
<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
If <math>\,\sigma</math> is not known, it can be estimated using the error.<br />
<br />
error = (output - expected_output) .^ 2;<br />
sigma = (dot(err, ones(1, num_data_point)) / (num_data_point)) / (num_data_point - 1);<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
=='''SURE for RBF network & Support Vector Machine - November 13th, 2009'''==<br />
<br />
===SURE for RBF network===<br />
<br />
====Minimizing MSE====<br />
<br />
By Stein's unbiased risk estimate (SURE) for Radial Basis Function (RBF) Network<br />
we get:<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1) </math> (28.1)<br />
<br />
*<math>\displaystyle MSE</math>(mean square error)= <math>\sum_{i=1}^N (\hat y_i-y_i)^2 </math><br />
*<math>\displaystyle err</math>(training error)= <math>\sum_{i=1}^N (\hat f_i-f_i)^2 </math><br />
*<math>\displaystyle (M+1) </math>( number of hidden units)= <math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math><br />
<br />
<br />
'''Goal''': To minimize MSE<br />
<br />
1. If <math>\displaystyle \sigma </math> is known, then it is no impact (i.e. a constant),<br />
and we can ignore it. Only need to minimize <math>\displaystyle MSE=err +2\sigma^2(M+1)</math>.<br />
<br />
2. In reality, we do not know <math>\displaystyle \sigma</math>, and it changes when <math>\displaystyle (M+1) </math> changes. However, we can estimate <math>\displaystyle \sigma </math>.<br />
<br />
<math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>. Suppose we do not know the variance of <math>\displaystyle \epsilon</math>. Then, <br />
<br />
<math>\displaystyle \sigma^2=\frac{1}{N-1}\sum_{i=1}^N (\hat y_i-y_i)^2 =\frac{1}{N-1}err</math> (28.2)<br />
<br />
Substitute (28.2) into (28.1), get<br />
<br />
<math>\displaystyle MSE=err-N\frac{1}{N-1}err+2\frac{1}{N-1}err(M+1)</math><br />
<br />
<math>\displaystyle MSE=err(1-\frac{N}{N-1}+\frac{2(M+1)}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{N-1-N+2M+2}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{2M+1}{N-1}) </math> (28.3) <br />
<br />
<br />
[[File:28.1.jpg|350px|thumb|Figure 28.1: MSE vs err]]<br />
<br />
Figure 28.1: the training error will decrease and the MSE will increase when increasing the number of hidden units (i.e. the model is more complex).<br />
<br />
<br />
When the number of hidden units gets larger and larger, the training error will decrease until it approaches to <math>\displaystyle 0 </math>. If training error approaches <math>\displaystyle 0 </math>, then no matter how large <math>\displaystyle (M+1) </math> is, from (28.3) we can see MSE will approaches to <math>\displaystyle 0 </math> as well. However, in fact it does not happen since when training error is close to <math>\displaystyle 0 </math> [http://en.wikipedia.org/wiki/Overfitting overfitting] happens, and MSE will increase instead of being close to <math>\displaystyle 0 </math>. We can see it from the Figure 28.1. <br />
<br />
<br />
We can see <math>\displaystyle \sigma^2 </math> is the average of <math>\displaystyle err </math>. In order to deal with this problem, we can take the average for <math>\displaystyle err</math> of each hidden unit. For example: we can first take 1 hidden unit, and take 10 hidden units in the next.<br />
<br />
We can also see that unlike the classical Cross Validation (CV) or Leave one out (LOO) techniques, the SURE technique does not need to do the validation to find the optimal model. Hence, SURE technique uses less data than CV or LOO. It is suitable for the case that there is not enough data for validation. However, to implement SURE we need to find <math>\frac {\partial \hat f}{\partial y_i}</math>, which may not be trivial for models that do not have a closed-form solution.<br />
<br />
====Kmeans Clustering====<br />
<br />
Discription:<br /> [http://en.wikipedia.org/wiki/K-means_clustering Kmeans clustering] is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.<br />
<br />
*The number of hidden units is same as the number of clusters, which is <math>\displaystyle \phi </math><br />
<br />
*<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, we set it same for all clusters.<br />
<br />
The basic details for <math>K</math>-means clustering are given:<br />
<br />
The <math>K</math> initial centers are randomly chosen from the training data.<br />
<br />
Then the following two steps are iterated alternatively until convergernce.<br />
<br />
1. for each existing center, we reidentify its cluster (every points in this cluster should be closer to this center than others).<br />
<br />
2. compute the mean for each cluster and make it as the new center for each cluster.<br />
<br />
<br />
Example:<br /><br />
Partition data into 2 clusters (2 hidden values)<br />
<br />
<br />
>> X=rand(30,80); <br />
>> [IDX,C,sumD,D]=kmeans(X,2); <br />
>> size(IDX) <br />
>> 30 1<br />
>> size(C) <br />
>> 2 80<br />
>> size(sumD) <br />
>> 2 1<br />
>> c1=sum(IDX==1)<br />
>> 14<br />
>> c2=sum(IDX==2)<br />
>> 16<br />
>> sumD<br />
>> 85.6643<br />
>> 101.0419<br />
>> v1=sumD(1,1)/c1 <br />
>> 6.1189<br />
>> v2=sumD(2,1)/c2 <br />
>> 6.3151 <br />
<br />
<br />
<br />
Comments:<br />
<br />
We create <math>X</math> randomly as a train set with 80 data points and 30 dimensions, and then apply “kmeans” method to separate X into 2 clusters.IDX is a vector contains 1 or 2 which indicates 2 clusters, and its size is 30*1. <math>\displaystyle C </math> is the center (mean) of each cluster with size 2*80; sumD is sum of the square distance between the data points and center of its cluster. The <math>\displaystyle c1 </math> and <math>\displaystyle c2 </math> indicate the number of data points in cluster 1 and 2. <math>\displaystyle V1 </math> is the variance of the first cluster <math>\displaystyle (v1=\sigma_1)</math>; <math>\displaystyle V2 </math> is the variance of the second cluster <math>\displaystyle (v2=\sigma_2)</math>. Now we can get <math>\displaystyle \phi </math>, <math>\displaystyle w </math>, hat matrix <math>\displaystyle H </math> and <math>\displaystyle \hat Y </math> by following equations. Finally, we will get the <math>\displaystyle MSE </math> and predict the test set. <br />
<br />
<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br />
<br />
<math>\displaystyle W = (\phi^{T}\Phi)^{-1}\phi^{T}Y</math><br />
<br />
<math>\displaystyle H = \phi(\Phi^{T}\Phi)^{-1}\phi^{T}</math><br />
<br />
<math>\displaystyle \hat{Y} = \phi W = \phi(\phi^{T}\phi)^{-1}\phi^{T}Y = HY</math><br />
<br />
<br />
<br />
Aside:<br />
<br />
Similar in spirit to <math>K</math>-means, there is EM algorithm with respect to Gaussian mixture model. Generally speaking, the Gaussian mixture model is referred to as a soft clustering while <math>K</math>-means is hard clustering.<br />
<br />
Similar to <math>K</math>-means, the following two steps are iterated alternatively until convergernce.<br />
<br />
E-step, each point is assigned a weight for each cluster based on the likelihood of each of the corresponding Gaussians. Observations is assigned 1 for one cluster if they are closer to the center of that cluster, and is assigned 0 for other clusters. <br />
<br />
M-step, compute the weighted means and covariances and make them as the new means and covariances for every cluster.<br />
<br />
>>[P,mu,phi,1Pxtr]=mdgEM(X,2,200,0);<br />
<br />
===Support Vector Machine===<br />
<br />
====Introduction====<br />
We have seen that linear discriminate analysis and logistic regression both estimate linear decision boundaries in similar but slightly different ways. Separating hyperplane classifiers provide the basis for support vector classifier. It constructs linear decision boundaries that explicitly try to separate the data into different classes as well as possible. The techniques that make the extensions to the nonseparable case, where the classes overlap are generalized to what is known as the support vector machine. It produces nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space.<br />
<br />
Definition: <br /><br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machine (SVM)] are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space, which can be used for classification, regression or other tasks.<br />
<br />
====Optimal Seperating Hyperplane====<br />
<br />
[[File:28.2.jpg|350px|thumb|right|Figure 28.2]]<br />
<br />
Figure 28.2 An example with two classes separable by a hyperplane. The blue line is the least squares solution, which misclassifies one of the training points. Also shown are the black separating hyperplanes found by the [http://en.wikipedia.org/wiki/Perceptron perceptron] learning algorithm with different random starts.<br /><br />
<br />
We can see the data points can be separated by a linear boundary are in two classes in <math>\displaystyle \mathbb{R}^{2} </math>. Suppose a dataset is indeed linearly separable, then there exits infinitely many possible separating hyperplanes including the black lines in the figure as two of them for training data. However, which solution is the best when we introduce the new data. <br /><br />
<br />
Aside: <br /><br />
The blue line is the least squares solution to the problem,obtained by regressing the <math>\displaystyle -1/+1 </math> response <math>\displaystyle Y </math> on <math>\displaystyle X </math> (with intercept); the line is given by<br />
<math>\displaystyle {X:\hat\beta_0+\hat\beta_1X_1+\hat\beta_2X_2=0}</math>.<br />
This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by linear discriminant analysis, in light of its equivalence with linear regression in the two-class case.<br />
<br />
Classifiers such as (28.4) that compute a linear combination of the input features and return the sign were called ''perceptrons'' in the engineering literature in the late 1950s. <br />
<br />
<br />
Identifications:<br />
<br />
*Hyperplane: seprate two classes <br />
<br />
<math>\displaystyle x^{T}\beta+\beta_0=0</math><br />
<br />
*Margin: the distance between the hyperplane and the closest point.<br />
<br />
<math>\displaystyle d_i=x_i^{T}\beta+\beta_0 </math> where <math>\displaystyle i=1,....,N</math><br />
<br />
Note: since distance is positive, if the data is on <math>\displaystyle +1 </math> side the distance is <math>\displaystyle d_i(+1)</math>. If the data is on the <math>\displaystyle -1 </math> side the distance is <math>\displaystyle d_i(-1)</math>.<br />
<br />
*Data points: <math>\displaystyle y_i\in\{-1,+1\}</math>we can classify points as<math>\displaystyle sign\{d_i\}</math> if <math>\displaystyle \beta,\beta_0 </math> is known.<br /><br />
<br />
====Maximum Margin Classfifiers====<br />
Choose the line farthest from both classes or Choose the line which has the maximum distance from the closest point. (i.e maximize the margin)<br /><br />
<br />
<math>\displaystyle Margin=min\{y_id_i\}</math> <math>\displaystyle i=1,2,....,N </math> <br />
where <math>\displaystyle y_i </math> is label and <math>\displaystyle d_i </math> is distance<br /><br />
<br />
[[File:28.3.jpg|350px|thumb|right|Figure 28.3 The linear algebra of a hyperplane]]<br />
<br />
<br />
<br />
Fiqure 28.3 depicts a hyperplane defined by the equation <math>\displaystyle x^{T}\beta+\beta_0=0</math>. Since they are in <math>\displaystyle \mathbb{R}^{2} </math>, the hyperplane is a line.<br /><br />
<br />
<br />
Properties:<br /><br />
<br />
1. <math>\displaystyle \beta </math> is orthogonal to the hyperplane <br /><br />
<br />
Two points<math>\displaystyle x_1,x_2</math> lying on the hyperplane.<br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0\beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}(x_1-x_2)=0</math><br />
<br />
Hence,<math>\displaystyle \beta </math> is orthogonal to <math>\displaystyle (x_1-x_2)</math>, and<math>\displaystyle \beta^*=\frac{\beta}{\|\beta\|} </math> is the vector normal to the hyperplane.<br /><br />
<br />
2. For any point <math>\displaystyle x_0 </math> on the hyperplane, <br />
<br />
<math>\displaystyle \beta^{T}x_0+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1=-\beta_0</math><br />
<br />
<br />
3. The signed distance fo any point <math>\displaystyle x </math> to the hyperplane is: since the length of <math>\displaystyle \beta </math> is not known, let it be unit vector.<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}(x_i-x_0)}{\|\beta\|} </math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i-\beta^{T}x_0}{\|\beta\|} </math><br />
<br />
by property 2<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i+\beta_0}{\|\beta\|} </math><br />
<br />
<br />
<br />
[[File:4.jpg|350px|thumb|right|Figure 28.4]]<br />
<br />
<br />
<math>\displaystyle Margin=min(y_id_i)</math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle Margin=min\{y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\} </math><br />
<br />
Suppose <math>\displaystyle x_i </math> is not on the hyperplan<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)>0 </math><br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq c </math>for <math>\displaystyle c>0 </math><br />
<br />
<br />
<math>\displaystyle y_i(\frac{\beta^{T}x_i}{c}+\frac{\beta_0}{c})\geq1</math> <br />
<br />
This is known as the cnonical representation of the decision hyperplane.<br />
<br />
For <math>\displaystyle \beta^{T} </math> only the direction is important, so <math>\displaystyle \frac{\beta^{T}}{c} </math> does not change its direction, the hyperplance will be the same.<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq1 </math><br />
<br />
<math>\displaystyle y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\geq\frac{1}{\|\beta\|} </math><br />
<br />
<math>\displaystyle Margin=min(\frac{1}{\|\beta\|} )</math><br />
<br />
which is equivalent as <math>\displaystyle Margin=max(\|\beta\| )</math><br />
<br />
<br />
<br />
<br />
Refrence:<br /><br />
Hastie,T.,Tibshirani,R., Friedman,J.,(2008).The Elements of Statistical Learning:129-130<br />
<br />
=='''Optimizing The Support Vector Machine - November 16th, 2009'''==<br />
The Support Vector Machine is used to find a maximum margin hyperplane, assuming the two classes are separable. This margin can be written as <math>\,min\{y_id_i\}</math>, or the distance of each point from the hyperplane, where <math>\,d_i</math> is the distance and <math>\,y_i</math> is used as the sign.<br />
===Maximizing the Support Vector Machine===<br />
<math>\,Margin=min\{y_id_i\}</math> can be rewritten as <math>\,min\left\{\frac{y_i\left(\beta^Tx_i+\beta_0\right)}{|\beta|}\right\}</math>. <br />
<br\>Note that the term <math>\,y_i\left(\beta^Tx_i+\beta_0\right) = 0</math> if <math>\,x_i</math> is on the hyperplane, but <math>\,y_i\left(\beta^Tx_i+\beta_0\right) > 0</math> if <math>\,x_i</math> is <math>\displaystyle not</math> on the hyperplane<br />
<br />
This implies <math>\,\exists C</math> such that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq C</math>.<br />
<br />
Divide through by C to produce <math>\,y_i\left(\frac{\beta^T}{C}x_i + \frac{\beta_0}{C}\right) \geq 1</math>. <br />
<br />
<math>\,\beta, \beta_0</math> compose a hyperplane that can have different values but we care about the direction, dividng through by a constant does not change the hyperplane. Thus, by scaling <math>\,\beta, \beta_0</math> we can eliminate C, so that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq 1</math>. Implying that the lower bound on <math>\,y_i\left(\beta^Tx_i+\beta_0\right)</math> is <math>\displaystyle 1</math><br />
<br />
Now in order to maximize the margin, we simply need to find <math>\,min\left\{\frac{1}{|\beta|}\right\}</math>. <br />
<br />
In other words, find maximum <math>\,|\beta|</math>, s.t. <math>\,min_i\{y_i(\beta^Tx_i+\beta_0)\} = 1</math>.<br />
<br />
Note that we're dealing with the norm of <math>\,\beta</math>. The 1-norm of a vector is simply the sum of the absolute value of each element (also known as the taxicab or Manhattan distance), and is apparently more accurate, but also has a discontinuity in the derivative. 2-norm, the Euclidean norm (the intuitive length of the vector), is easier to work with - that is <math>\,\|\beta\|_2 = (\beta^T\beta)^{1/2}</math>. For convenience, we will maximize <math>\,\frac{1}{2}\|\beta\|_2^2 = \frac{1}{2}\beta^T\beta</math>.<br />
<br />
This is an example of a quadratic programming problem and we will minimize a quadratic function respective to a linear inequality constraints.<br />
<br />
<br />
====Writing Lagrangian Form of Support Vector Machine====<br />
The Lagrangian form is introduced to ensure that the conditions are satisfied, as well as finding an optimal solution. <math>\,\alpha_i</math> are introduced as dual constraints. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.<br />
<br />
<math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math>. To find the optimal value, set the derivative equal to zero.<br />
<br />
<math>\,\frac{\partial L}{\partial \beta} = 0</math>, <math>\,\frac{\partial L}{\partial \beta_0} = 0</math>. Note that <math>\,\frac{\partial L}{\partial \alpha_i}</math> is equivalent to the constraints <math>\left(y_i(\beta^Tx_i+\beta_0)-1\right) \geq 0, \,\forall\, i</math><br />
<br />
First, <math>\,\frac{\partial L}{\partial \beta} = \frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\left\{\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i)+\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0-\frac{\partial}{\partial \beta}\alpha_iy_i\right\}}</math><br />
<br />
: <math>\frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 = \beta</math>.<br />
<br />
: <math>\,\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i) = \alpha_iy_ix_i</math><br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0 = 0</math>. <br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i = 0</math>.<br />
<br />
So this simplifies to <math>\,\frac{\partial L}{\partial \beta} = \beta - \sum_{i=1}^n{\alpha_iy_ix_i} = 0</math>. In other words,<br />
<br />
<math>\,\beta = \sum_{i=1}^n{\alpha_iy_ix_i}</math>, <math>\,\beta^T = \sum_{i=1}^n{\alpha_iy_ix_i^T}</math><br />
<br />
Similarly, <math>\,\frac{\partial L}{\partial \beta_0} = -\sum_{i=1}^n{\alpha_iy_i} = 0</math>.<br />
<br />
This allows us to rewrite the Lagrangian without <math>\,\beta</math>.<br />
<br />
<math>\,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} - \sum_{i=1}^n{\alpha_iy_i(\sum_{j=1}^n{\alpha_jy_jx_j^Tx_i} + \beta_0) - 1}</math>. <br />
<br />
Because <math>\,\sum_{i=1}^n{\alpha_iy_i} = 0</math>, and <math>\,\beta_0</math> is constant, <math>\,\sum_{i=1}^n{\alpha_iy_i\beta_0} = 0</math>. So this simplifies further, to<br />
<br />
<math>L(\alpha) = \,-\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} + \sum_{i=1}^n{\alpha_i}</math><br />
A dual representation of the maximum margin.<br />
<br />
Because <math>\,\alpha_i</math> is the Lagrangian multiplier, <math>\,\alpha_i \geq 0 \forall i</math>.<br />
<br />
This is a much simpler optimization problem.<br />
<br />
=='''The Support Vector Machine algorithm - November 18, 2009'''==<br />
<br />
===Solving the Lagrangian===<br />
<br />
Continuing from the above derivation, we now have the equation that we need to minimize, as well as two constraints.<br />
<br />
The Support Vector Machine problem boils down to:<br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math><br />
:such that <math>\alpha_i \geq 0</math><br />
:and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
<br />
We are looking to minimize <math>\,\alpha</math>, which is our only unknown. Once we know <math>\,\alpha</math>, we can easily find <math>\,\beta</math> and <math>\,\beta_0</math> (see the Support Vector algorithm below for complete details).<br />
<br />
If we examine the Lagrangian equation, we can see that <math>\,\alpha</math> is multiplied by itself; that is, the Lagrangian is quadratic with respect to <math>\,\alpha</math>. Our constraints are linear. This is therefore a problem that can be solved through [http://en.wikipedia.org/wiki/Quadratic_programming quadratic programming] techniques. We will examine how to do this in Matlab shortly.<br />
<br />
We can write the Lagrangian equation in matrix form:<br />
<br />
<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:such that <math>\underline{\alpha} \geq \underline{0}</math><br />
:and <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Where:<br />
* <math>\underline{\alpha}</math> denotes an <math>\,n \times 1</math> vector; <math>\underline{\alpha}^T = [\alpha_1, ..., \alpha_n]</math><br />
* Matrix <math>S = y_iy_jx_i^Tx_j = (y_ix_i)^T(y_ix_i)</math><br />
* <math>\,\underline{0}</math> and <math>\,\underline{1}</math> are vectors containing all 0s or all 1s respectively<br />
<br />
Using this matrix notation, we can use Matlab's build in quadratic programming routine, [http://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/quadprog.html quadprog].<br />
<br />
===Quadprog example===<br />
<br />
Let's use quadprog to find the solution to <math>\,L(\alpha)</math>.<br />
<br />
Matlab's quadprog function minimizes an equation of the following form:<br />
:<math>\min_x\frac{1}{2}x^THx+f^Tx</math><br />
:such that: <math>\,A \cdot x \leq b</math>, <math>\,Aeq \cdot x = beq</math> and <math>\,lb \leq x \leq ub</math><br />
<br />
We can now see why we kept the <math>\frac{1}{2}</math> constant in the original derivation of the equation.<br />
<br />
The function is called as such: <code>x = quadprog(H,f,A,b,Aeq,beq,lb,ub)</code>. The variables correspond to values in the equation above.<br />
<br />
We can now try to find the solution to <math>\,L(\alpha)</math> (though, it should be noted, that in <math>\,L(\alpha)</math> we subtract the first term rather than add it; I'm not sure if this difference would change how we call <code>quadprog</code>).<br />
<br />
We'll use a simple one-dimensional data set, which is essentially y = -1 or 1 + Gaussian noise. (Note: you could easily put the values straight into the quadprog equation; they are separated for clarity)<br />
<br />
x = [mvnrnd([-1],[0.01],100); mvnrnd([1],[0.01],100)]';<br />
y = [-ones(100,1); ones(100,1)];<br />
S = (y*x)' * (y*x);<br />
f = ones(200,1);<br />
A = -ones(1,200);<br />
b = 0;<br />
Aeq = y';<br />
beq = 0;<br />
lb = 0;<br />
ub = []; % There is no upper bound<br />
alpha = quadprog(S,f,A,b,Aeq,beq,lb,ub);<br />
<br />
This gives us the optimal <math>\,\alpha</math>... or at least I think it should, but it does not appear to work for me (that is, despite setting the lower boundary of 0, a number of values are still negative. Whether this is just the nature of <code>quadprog</code> or an error on my part is an exercise for the reader).<br />
<br />
===Examining K.K.T. conditions===<br />
<br />
[http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker conditions] ([http://webrum.uni-mannheim.de/mokuhn/public/KarushKuhnTucker.pdf more info]) give us a closer look into the Lagrangian equation and the associated conditions.<br />
<br />
Suppose we are looking to minimize <math>\,f(x)</math> such that <math>\,g_i(x) \geq 0, \forall{x}</math>. If <math>\,f</math> and <math>\,g</math> are differentiable, then the ''necessary'' conditions for <math>\hat{x}</math> to be local minimum are:<br />
<br />
# At the optimal point, <math>\frac{\partial L}{\partial \hat{x}} = 0</math>; i.e. <math>f'(\hat{x}) - \sum{\alpha_ig'(\hat{x})}=0</math><br />
# <math>\alpha_i \geq 0</math>. (Dual Feasibility) <br />
# <math>\alpha_ig_i(\hat{x}) = 0, \forall{i}</math> (Complementary Slackness)<br />
# <math>g_i(\hat{x}) \geq 0</math> (Primal Feasibility)<br />
<br />
If any of these conditions are violated, then the problem is deemed not feasible.<br />
<br />
These are all trivial except for condition 4. Let's examine it further in our support vector machine problem.<br />
<br />
===Support Vectors===<br />
<br />
Basically, the support vectors are the training points that actually determine the optimal separating hyperplane which we are look for. Also, they are the most difficult points to classify or the most informative for the classification.<br />
<br />
In our case, the <math>g_i(\hat{x})</math> function is:<br />
:<math>\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1</math><br />
<br />
Substituting <math>\,g_i</math> into KKT condition 3, we get <math>\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0</math>. <br\>In order for this condition to be satisfied either <br/><math>\,\alpha_i= 0</math> or <br/><math>\,y_i(\beta^Tx_i+\beta_0=1</math><br />
<br />
All points <math>x_i</math> are either be 1 or greater than 1 distance away from the hyperplane.<br />
<br />
'''Case 1: a point <math>\displaystyle x_i</math> > 1 away from the margin'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) > 1 \Rightarrow \alpha_i = 0</math>.<br />
<br />
If point <math>\, x_i</math> is not on the margin, then the corresponding <math>\,\alpha_i=0</math>.<br />
<br />
'''Case 2: a point <math>\displaystyle x_i</math> 1 away from the margin'''<br />
<br />
If <math>\,\alpha_i > 0 \Rightarrow y_i(\beta^Tx_i+\beta_0) = 1</math> <br />
<br\>If point <math>\, x_i</math> is on the margin, then the corresponding <math>\,\alpha_i>0</math>.<br />
<br />
<br />
<br />
Points on the margin, points with corresponding <math>\,\alpha_i > 0</math>, are called <math>\displaystyle support vectors</math>.<br />
<br />
===Using support vectors===<br />
<br />
Support vectors are important because they allow the support vector machine algorithm to be insensitive to outliers. If <math>\,\alpha = 0</math>, then the cost function is also 0, and won't contribute to the solution of the SVM problem; only points on the margin -- support vectors -- contribute.<br />
<br />
====The support vector machine algorithm====<br />
<br />
# Solve the quadratic programming problem:<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
## Use Matlab's quadprog to find the optimal <math>\,\underline{\alpha}</math><br />
# Find <math>\beta = \sum_{i=1}^n{\alpha_iy_i\underline{x_i}}</math><br />
# Find <math>\,\beta_0</math> by choosing a support vector (a point with <math>\,\alpha_i > 0</math> and solving <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math><br />
<br />
===Example in Matlab===<br />
<br />
The following code, taken verbatim from the lecture, shows how to use Matlab built-in SVM routines (found in the Bioinformatics toolkit) to do classification through support vector machines.<br />
<br />
load 2_3;<br />
[U,Y] = princomp(X');<br />
data = Y(:,1:2);<br />
l = [-ones(1,200) ones(1,200)];<br />
[train,test] = crossvalind('holdOut',400);<br />
% Gives indices of train and test; so, train is a matrix of 0 or 1, 1 where the point should be used as part of the training set<br />
svmStruct = svmtrain(data(train,:), l(train), 'showPlot', true);<br />
<br />
[[File:Svm1.png|frame|center|The plot produced by training on some of the 2_3 data's first two features.]]<br />
<br />
yh = svmclassify(svmStruct, data(test,:), 'showPlot', true);<br />
<br />
[[File:Svm2.png|frame|center|The plot produced by testing some of the 2_3 data.]]</div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=5450stat8412009-11-21T19:43:28Z<p>Ipargaru: /* Support Vectors */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rise of fields such as data-mining, bioinformatics, and machine learning, classification has becomes a fast-developing topic. In the age of information, vast amounts of data are generated constantly, and the goal of classification is to learn from data. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
'''Definition''': The problem of Prediction a discrete random variable <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math> is called Classification.<br />
<br />
In classification,, we attempt to approximate a function <math>\,h</math>, by using a training data set, which will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>d</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data which are identical independent distributions, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, for instance, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability. Intuitively speaking, to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we find <math>\,y\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y|X=x)</math>, and classify <math>\,X</math> into class <math>\,y</math>. In order to calculate the value of <math>\,P(Y=y|X=x)</math>, we use ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posterior probability, <math>\,P(Y=y)</math> as the prior probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two classes, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<sub></sub><br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''3 different approaches to classification''':<br />
<br />
1) Empirical Risk Minimization: Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of <math>\,L(h)</math><br />
<br />
2) Regression: Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3) Density Estimation: estimate <math>\,P(X=x|Y=0)</math> and <math>\,P(X=x|Y=1)</math> (less popular in high-dimension cases)<br />
<br />
<br />
<br />
'''Bayes Classification Rule Optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when<math></math> the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
<br />
:<math>\, h^*(X)= \left\{\begin{matrix} <br />
1 & if P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
Remark:<br />
<br />
1)Bayes classification rule is optimal. Proof:[http://www.ee.columbia.edu/~vittorio/BayesProof.pdf]<br />
<br />
2)We still need any other method, since we cannot define prior probability in realistic.<br />
<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, since it is generally impossible for us to know the prior <math>\,P(Y=1)</math>, and class conditional density <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], tree-augmented naive Bayes (TAN), Bayesian network augmented naive Bayes (BAN) and general Bayesian network (GBN).<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as having objective existence. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayesian and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayesian'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In Bayesian method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
'''Multi-class Classification''':<br />
<br />
Y takes on more than two values.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
''Theorem'': Suppose that <math>\,Y \in \mathcal{Y}= \{1,\dots,k\}</math>, the optimal rule is :<math>\,h^*(X) = \arg\max_{k}{P(Y = k|X = x)}</math><br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
1 Empirical Risk Minimization:Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
<br />
2 Regression:Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3 Density estimation, estimate <math>P(X = x|Y = 0)</math> and <math>P(X = x|Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
It is quadratic because there is no boundaries.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on [http://academicearth.org/lectures/advice-for-applying-machine-learning LDA and QDA] so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
LDA is for classification and FDA is used for feature extraction.<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
<br />
=== logistic function ===<br />
A logistic function or logistic curve is the most common sigmoid curve. <br />
<br />
:<math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
1. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
2. <math>y(0) = \frac{1}{2}</math><br />
<br />
3. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
4. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
5. The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood, using Pr(Y|X). The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Seeing these equations as a weighted least squares problem makes them easier to derivate.<br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== Perceptron (Foundation of Neural Network) ===<br />
<br />
==== Separating Hyperplane Classifiers ====<br />
Separating hyperplane trys to separate the data using linear decision boundaries. When the classes overlap, it can be generalized to support vector machine, which constructs nonlinear boundaries by constructing a linear boundary in an enlarged and transformed feature space.<br />
<br />
==== Perceptron ====<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
Least Squares returns the sign of a linear combination of data points as the class label<br />
<br />
sign(<math>(\underline{\beta}^T \underline{x} + {\beta}_0)) = sign(\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2})</math><br />
<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br />
<br />
<br />
If we find a hyperplane that is not unique between 2 classes, there will be infinitely many solutions obtained from the perceptron algorithm.<br />
<br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
A Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math>, where <math>\,I</math> indicates the sign of the expression and returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math> (initial guess). Its goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach], which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
====Problems with the Algorithm and Issues Affecting Convergence====<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. This problem can be eliminated by using basis expansions technique. To be specific, we try to find a hyperplane not in the original space, but in the enlarged space obtained by using some basis functions.<br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#A perfect separation is not always available even desirable. If observations comes from different classes sharing the same imput, the classification model seems to be overfitting and will generally have poor predictive performance.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
<br />
</ref>.<br />
====Comment on gradient descent algorithm====<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction in which the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and you initially stand in the middle, then you will finally arrive at the saddle point (local minimum) and get stuck there.<br />
<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over <math>\,i</math> (all data points). Actually, this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent) known as Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>\,{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
*A Perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection with branches that "fan out" onto as many connections as desired, each carrying the same signal - the processing element output signal. <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref> It is a multistage regression or classification model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
A regression problem typically has only one unit in the output layer. In a k-class classification problem, there are usually k units in the output layer that each represent the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1).<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classification by NN. <br />
<br />
In perceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] and is not continuous at 0. Thus, we replace it by a smooth function <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it the '''activation function'''.<br />
<br>The choice of this function <math>\displaystyle \sigma </math> is determined by the properties of the data and the assumed distribution of target variables, but for multiple binary classification problems the logistic function, also known as inverse-logit, is often used: <br />
<math>\sigma(a)=\frac {1}{1+e^{-a}}</math><br />
<br />
There are some important properties for the activation function.<br />
<br />
# Activation function is nonlinear. It can be shown that if the activation function of the hidden units is linear, a three-layer neural network is equivalent to a two layer one. <br />
# Activation function saturate, which means there are maximum and minimum output value. This property ensures that the weights are bounded and therefore the searching time is limited. <br />
# Activation function is continuous and smooth.<br />
# Activation function is monotonic. This property is not necessary, since we know that RBF networks is also a kind of power model. <br />
<br />
'''Note:''' A key difference between a perceptron and a neural network is that a neural network uses continuous nonlinearities in the units, for the purpose of differentiation, whereas the perceptron often uses a non-differentiable activation function. The neural network function is differentiable with respect to the network parameters so that a gradient descent method can be used in training. Moreover, a perceptron is a linear classifier, whereas a neural network, by combining layers of perceptrons, is able to classify non-linear problems through proper training.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes into the perceptron, to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural network is called [http://en.wikipedia.org/wiki/Feedforward_neural_network Feed-Forward Neural Network]. Applications to Feed-Forward Neural Networks include data reduction, speech recognition, sensor signal processing, and ECG abnormality detection, to name a few. <ref>J. Annema, Feed-Forward Neural Networks, (Springer 1995), pp. 9 </ref><br />
<br />
===Back-propagation===<br />
For a while, the Neural Network model was just an idea, since there were no algorithms for training the model until 1986, when Geoffrey Hinton <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> devised an algorithm called '''back-propagation''' [http://en.wikipedia.org/wiki/Backpropagation#Algorithm]. After that, a number of other training algorithms and various configurations of neural networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied a gradient descent algorithm for optimizing weights. Back-propagation uses this idea of gradient descent to train a neural network based on the chain rule in calculus. <br />
<br />
Assume that the last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
For simplicity, there is only 1 unit at the end and assume for the moment we are doing regression.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each unit we have a function <math>\displaystyle z_i=\sigma(a_i)</math> that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classification and the resulting classification output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
''i j are reversed!''<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\delta_j \cdot z_l</math><br />
<br />
Note that a change in <math>\,a_j</math> causes changes in all <math>\,a_i</math> in the next layer on which the error is based, so we need to sum over i in the chain:<br />
<math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} =\sum_i \delta_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\,\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math> Using the activation function <math>\,\sigma(\cdot)</math><br />
<br />
So <math>\delta_j = \sum_i \delta_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \delta_i \cdot u_{ij}</math><br />
<br />
We can propagate the error calculated in the output back through the previous layers and adjust weights to minimize error.<br />
<br />
==Neural Networks (NN) - October 30, 2009 ==<br />
<br />
=== Back-propagation ===<br />
The idea is that we first feed an input from the training set to the Neural Network, then find the error rate at the output and then we propagate the error to previous layers and for each edge of weight <math>\,u_{ij}</math> we find <math>\frac{\partial \mathrm{err}}{\partial u_{ij}}</math>. Having the error rates at hand we adjust the weight of each edge by taking steps proportional to the negative of the gradient to decrease the error at output. The next step is to apply the next input from the training set and go through the described adjustment procedure.<br />
The overview of Back-propagation algorithm:<br />
#Feed a point <math>\,x</math> in the training set to the network, and find the output of all the nodes.<br />
#Evaluate <math>\,\delta_k=y_k-\hat{y_k}</math> for all output units, where <math>y_k</math> is the expected output and <math>\hat{y_k}</math> is the real output.<br />
#By propagating to the previous layers evaluate all <math>\,\delta_j</math>s for hidden units: <math>\,\delta_j=\sigma'(a_j)\sum_i \delta_i u_{ij}</math> where <math>i</math> is associated to the previous layer.<br />
#Using <math>\frac{\partial \mathrm{err}}{\partial u_{jl}} = \delta_j\cdot z_l</math> find all the derivatives.<br />
#Adjust each weight by taking steps proportional to the negative of the gradient: <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} -\rho \frac{\partial \mathrm{err}}{u_{ij}}</math><br />
#Feed the next point in the training set and repeat the above steps.<br />
==== How to initialize the weights ====<br />
This still leaves the question of how to initialize the weights <math>\,u_{ij}, w_i</math>. The method of choosing weights mentioned in class was to randomize the weights before the first step. This is not likely to be near the optimal solution in every case, but is simple to implement. To be more specific, random values near zero will be a good choice for the initial weights(usually from [-1,1]). In this case, the model evolves from a nearly linear one to a nonlinear one as we desired. An alternative is to use an orthogonal least squares method to find the initial weights <ref>http://www.mitpressjournals.org/doi/abs/10.1162/neco.1995.7.5.982</ref>. Regression is performed on the weights and output by using a linear approximation of <math>\,\sigma(a_i)</math>, and finds optimal weights in the linear model. Back propagation is used afterward to find the optimal solution, since the NN is non-linear.<br />
<br />
==== How to set learning rates ====<br />
The learning rate <math>\,\rho</math> is usually a constant. <br />
<br />
If we use On-line learning, as a form of stochastic approximation process, <math>\,\rho</math> should decrease as the iteration increase.<br />
<br />
<br />
Choosing too large learning rate may cause the unstability of the system, while the too small learning rate may lead to a very slow convergence rate(very long time in learning phase). However, the advantage of small learning rate is that it can guarantee the convergence. Thus, generally, it is better to choose a relatively small learning rate to ensure the stability. Usually, choose <math>\,\rho</math> between 0.01 and 0.7<br />
<br />
==== How to determine the number of hidden units ====<br />
<br />
Here we will mainly discuss how to estimate the number of hidden units at very beginning. Obviously, we should adjust it to be more precise using CV, LOO or other complexity control methods. <br />
<br />
Basically, if the patterns are well separated, few hidden units are fairly enough. If the patterns are drawn from some highly complicated mixture models, more hidden units are really needed. <br />
<br />
Actually, the number of hidden units determines the size of the model, and therefore the total number of the weights in the model. Typically speaking, the number of weights should not be larger than the number of training data, say N. Thus, sometimes, N/10 is a good choice. However, in pratice, many well performed models will use more hidden units.<br />
<br />
=== Dimensionality reduction application ===<br />
[[File:NN-bottelneck.png|350px|thumb|right|Figure 1: Bottleneck configuration for applying dimensionality reduction.]]<br />
One possible application of Neural Networks is to perform dimensionality reduction, like other techniques, e.g., PCA, MDS, LLE and Isomap.<br />
<br />
Consider the following configuration as shown in figure 1:<br />
As we go forward in layers of this Neural Network, the number of nodes is reduced, until we reach a layer with the number of nodes representing the desired dimensionality. However, note that at the very first few layers the number of nodes may not be strictly decreasing, as long as finally it can reach a layer with less nodes. From now on in the Neural Network <br />
the previous layers are mirrored. So at the output layer we have the same number of states as we have in the input layer. Now note that if we feed the network with each point and get an output approximately equal to the fed input, that means at the output the same input is reconstructed from the middle layer units. So the output of the middle layer units can represent the input with less dimensions.<br />
<br />
To train this Neural Network, we feed the network with a training point and through back propagation we adjust the network weights based on the error between the input layer and the reconstruction at the output layer. Our low dimensional mapping will be the observed output from the middle layer. Data reconstruction consists of putting the low dimensional data through the second half of the network.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
== Complexity Control October 30, 2009 ==<br />
<br />
[[File:overfitting-model.png|500px|thumb|right|Figure 2. The overfitting model passes through all the points of the training set, but has poor predictive power for new points.<br />
In exchange the line model has some error on the training points but has extracted the main characteristic of the training points, and has good predictive power.]]<br />
There are [http://academicearth.org/lectures/underfitting-and-overfitting two issues] that we have to avoid in Machine Learning:<br />
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]<br />
#Underfitting<br />
<br />
Overfitting occurs when our model is heavily complex with so many degrees of freedom, that we can learn every detail of the training set. Such a model will have very high precision on the training set but will show very poor ability to predict outcomes of new instances, especially outside the domain of the training set.<br />
<br />
In a Neural Network if the depth is too much, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will show a very precise outcome of the training set but will not be able to generalize the commonality of the training set to predict the outcome of new cases.<br />
<br />
Underfitting occurs when the model we picked to describe the data is not complex enough, and has high error rate on the training set.<br />
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.<br />
<br />
'''Example'''<br />
#Consider the example showed in the figure. We have a training set and we want to find a model which fits it the best. We can find a polynomial of high degree which almost passes through all the points in the training set. But, in fact the training set is coming from a line model. Now the problem is although the complex model has less error on the training set it diverges from the line in other ranges which we have no training point. Because of that the high degree polynomail has very poor predictive result on test cases. This is an example of overfitting model.<br />
#Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem.<br />
#Consider a simple classification example. If our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is a yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features typical of bananas, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie., we have overfit the data. This occurs when we have considered so many features that we have perfectly described the existing bananas, but if presented with a new banana of slightly different shape than the existing bananas, for example, it cannot be detected. This is the tradeoff; what is the right level of complexity?<br />
<br />
== Complexity Control - Nov 2, 2009 ==<br />
<br />
Overfitting occurs when the model becomes too complex and underfitting occurs when it is not complex enough, both of which are not desirable. To control complexity, it is necessary to make assumptions for the model before fitting the data. Assumptions that we can make for a model are with polynomials or a neural network. There are other ways as well.<br />
<br />
[[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 1: An example of a model with a family of polynomials]]<br />
We do not want a model to get too complex, so we control it by making an assumption on the model. With complexity control, we want a model or a classifier with a low error rate.<br />
<br />
=== '''How do we choose a good classifier?''' ===<br />
<br />
Our goal is to find a classifier that minimizes the true error rate. <br />
Recall the empirical error rate<br />
<br />
<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math><br />
<br />
<math>\,h</math> is a classifier and we want to minimize the error rate. So we apply <math>\displaystyle x_1</math> to <math>\displaystyle x_n</math> to <math>\displaystyle h</math>, and take the average to get the empirical true error rate estimation of probability that <br />
<math>h(x_{i}) \neq y_{i}</math>.<br />
<br />
<span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 2]]</span><br />
There is a downward bias to this estimate meaning that it is always less than the true error rate. <br />
<br />
If there is a change in our complexity from low to high, our error rate is always decreasing. When we apply our model to the test data, our error rate will start to decrease to a point, but then it will increase since the model hasn't seen it before. This can be explained since training error will decrease when we fit the model better by increasing its complexity, but as we have seen, this complex model will not generalize well, resulting in a larger test error. <br />
<br />
We use our test data (from the test sample line shown on Figure 2) to get our empirical error rate.<br />
Right complexity is defined as where error rate of the test data is minimum; and this is one idea behind complexity control.<br />
<br />
<br />
<br />
[[File:Bias.jpg|200px|thumb|left|Figure 3]]<br />
<br />
We assume that we have samples <math>\,X_1, . . . ,X_n</math> that follow some (possibly unknown) distribution. We want to estimate a parameter <math>\,f</math> of the unknown distribution. This parameter may be the mean <math>\,E(X_i)</math>, the variance <math>\,var(X_i)</math> or some other quantity.<br />
<br />
The unknown parameter <math>\,f</math> is a fixed real number <math>f\in R</math>. To estimate it, we use an estimator which is a<br />
function of our observations, <math>\hat{f}(X_1,...,X_n)</math>. <br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f</math><br />
<br />
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]</math><br />
<br />
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math><br />
<br />
One property we desire of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>.<br />
However, there is a more important property for an estimator than just being unbiased: the mean squared error. In statistics, there are problems for which it may be good to use an estimator with a small bias. In some cases, an estimator with a small bias may have lesser mean squared error or be median-unbiased (rather than mean-unbiased, the standard unbiasedness property). The property of median-unbiasedness is invariant under transformations while the property of mean-unbiasedness may be lost under nonlinear transformations. For example, while using an unbiased estimator with large mean square error to estimate the parameter, we highly risk a big error. In contrast, a biased estimator with small mean square error will well improve the precision of our prediction.<br />
<br />
Hence, our goal is to minimize <math>MSE (\hat{f})</math>.<br />
<br />
From figure 3, we can see that the relationship of the three parameters is:<br />
<math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.<br />
<br />
A Test error is a good estimation on MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias.<br />
<br />
<br />
Referring to Figure 2, overfitting happens after the point where training data (training sample line) starts to decrease and test data (test sample line) starts to increase. There are 2 main approaches to avoid overfitting:<br />
<br />
1. Estimating error rate<br />
<br />
<math>\hookrightarrow</math> Empirical training error is not a good estimation<br />
<br />
<math>\hookrightarrow</math> Empirical test error is a better estimation<br />
<br />
<math>\hookrightarrow</math> Cross-Validation is fast<br />
<br />
<math>\hookrightarrow</math> Computing error bound (analytically) using some probability inequality.<br />
<br />
We will not discuss computing the error bound in class; however, a popular method for doing this computation is called VC Dimension (short for Vapnik–Chervonenkis Dimension). Information can be found from [http://www.autonlab.org/tutorials/vcdim.html Andrew Moore] and [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.7171&rep=rep1&type=pdf Steve Gunn].<br />
<br />
2. Regularization<br />
<br />
<math>\hookrightarrow</math> Use of shrinkage method<br />
<br />
<math>\hookrightarrow</math> Decrease the chance of overfitting by controlling the weights<br />
<br />
=== '''Example of under and overfitting in R''' ===<br />
<br />
To give further intuition of over and underfitting, consider this example. A simple quadratic data set with some random noise is generated, and then polynomials of varying degrees are fitted. The errors for the training set and a test set are calculated.<br />
[[File:Curvefitting-rex2.png|250px|thumb|right|Polynomial fits to curved data set.]]<br />
<br />
>> x <- rnorm(200,0,1)<br />
>> y <- x^2-0.5*x+rnorm(200,0,0.3)<br />
>> xtest <- rnorm(50,1,1)<br />
>> ytest <- xtest^2-0.5*xtest+rnorm(50,0,0.3)<br />
>> p1 <- lm(y~x)<br />
>> p2 <- lm(y ~ poly(x,2))<br />
>> pn <- lm(y ~ poly(x,10))<br />
>> psi <- lm(y~I(sin(x))+I(cos(x)))<br />
<br />
: <code>x</code> values for the training set are based on a <math>\,N(0,1)</math> distribution, while the test set has a <math>\,N(1,1)</math> distribution. <code>y</code> values are determined by <math>\,y = x^2 + x + N(0,0.3)</math>, a quadratic function with some random variation. Polynomial least square fits of degree 1, 2, and 10 are calculated, as well as a fit of <math>\,sin(x)+cos(x)</math>.<br />
<br />
>> > # calculate the mean squared error of degree 1 poly<br />
>> > sum((y-predict(p1,data.frame(x)))^2)/length(y)<br />
>> [1] 1.576042<br />
>> > sum((ytest-predict(p1,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 7.727615<br />
: Training and test mean squared errors for the linear fit. These are both quite high - and since the data is non-linear, the different mean value of the test data increases the error quite a bit.<br />
>> > # calculate the mean squared error of degree 2 poly<br />
>> > sum((y-predict(p2,data.frame(x)))^2)/length(y)<br />
>> [1] 0.08608467<br />
>> > sum((ytest-predict(p2,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 0.08407432<br />
: This fit is far better - and there is not much difference between the training and test error, either.<br />
>> > # calculate the mean squared error of degree 10 poly<br />
>> > sum((y-predict(pn,data.frame(x)))^2)/length(y)<br />
>> [1] 0.07967558<br />
>> > sum((ytest-predict(pn,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 156.7139<br />
: With a high-degree polynomial, the training error continues to decrease, but not by much - and the test set error has risen again. The overfitting makes it a poor predictor. As the degree of the polynomial rises further, the accuracy of the computer becomes an issue - and a good fit is not even consistently produced for the training data.<br />
>> > # calculate mse of sin/cos fit<br />
>> > sum((y-predict(psi,data.frame(x)))^2)/length(y)<br />
>> [1] 0.1105446<br />
>> > sum((ytest-predict(psi,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 1.320404<br />
: Fitting a function of the form sin(x)+cos(x) works pretty well on the training set, but because it is not the real underlying function, it fails on test data which doesn't lie on the same domain.<br />
<br />
== ''' Cross-Validation (CV) - Introduction ''' ==<br />
<br />
[[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]]<br />
Cross-Validation is used to estimate the error rate of a classifier with respect to test data rather than data used in the model. Here is a general introduction to CV:<br />
<br />
<math>\hookrightarrow</math> We have a set of collected data for which we know the proper labels<br />
<br />
<math>\hookrightarrow</math> We divide it into 2 parts, Training data (T) and Validation data (V)<br />
<br />
<math>\hookrightarrow</math> For our calculation, we pretend that we do not know the label of V and we use data in T to train the classifier<br />
<br />
<math>\hookrightarrow</math> We estimate an empirical error rate on V since the model hasn't seen V yet and we know the proper label of all elements in V to know how many were misclassified.<br />
<br />
CV has different implementations which can reduce the variance of the calculated error rate, but sometimes with a tradeoff of a higher calculation time.<br />
<br />
== ''' Complexity Control - Nov 4, 2009''' ==<br />
<br />
== Cross-validation ==<br />
[[File:Cross-validation.png|350px|thumb|right|Figure 1: Classical/Standard cross-validation]]<br />
Cross-validation is the simplest and most widely used method to estimate the true error. It comes from the observation that although training error always decreases with the increasing complexity of the model, the test error starts to increase from a certain point, which is noted as overfitting (see [[#prediction-error|figure 2]] above). Since test error estimates MSE (mean square error) best, people came up with the idea to divide the data set into three parts: training set, validation set, and test set. training set is used to build the model, validation set is used to deside the parameters and the optimal model, and the test set is used to estimate the performance of the chosen model. A classical division is 50% for training set, and 25% each for validation set and test set. All of them are randomly selected from the original data set. <br />
<br />
Then, we only use the part of our data marked as the "training set" to train our algorithm, while keeping the remaining marked as the "validation set" untouched. As a result, the validation set will be totally unknown to the trained model. The error rate is then estimated by:<br />
<br />
<math>\hat L(h) = \frac{1}{|\nu|}\sum_{X_i \in \nu}(h(x_i) \neq y_i)</math>, where <math>\,|\nu|</math> is the cardinality of the validation set.<br />
<br />
When we change the complexity, the error generated by the validation set will have the same behavior as the test set, so we are able to choose the best parameters to get the lowest error.<br />
<br />
<br />
=== K-fold Cross-validation ===<br />
[[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]]<br />
Above is the simplest form of complexity control. However, in reality, it may be hard to collect data ??and we usually suffer from the curse of dimensionality??, and a larger data set may be hard to come by. Consequently, we may not be able to afford to sacrifice part of the limited resources. In this case we use another method that addresses this problem, K-fold cross-validation. We divide the data set into <math>\,K</math> subsets roughly equal in size. The usual choice is <math>\,K = 10</math>.<br />
<br />
Generally, how to choose <math>\,K</math>:<br />
<br />
if <math>\,K=n</math>, leave one out, low bias, high variance. Each subset contains a single element, so the model is trained with all except one point, and then validated using that point.<br />
<br />
if <math>\,K=2</math>, say 2-fold, 5-fold, high bias, low variance. Each subset contains approximately <math>\,\frac{1}{2}</math> or <math>\,\frac{1}{5}</math> of the data.<br />
<br />
For every <math>\,k</math>th <math>( \,k \in [ 1, K ] )</math> part, we use the other <math>\,K-1</math> parts to fit the model and test on the <math>\,k</math>th part to estimate the prediction error <math>\hat L_k</math>, where<br />
<br />
<math>\hat L(h) = \frac{1}{K}\sum_{k=1}^K\hat L_k</math><br />
<br />
For example, suppose we want to fit a polynomial model to the data set and split the set into four equal subsets as shown in Figure 2. First we choose the degree to be 1, i.e. a linear model. Next we use the first three sets as training sets and the last as validation set, then the 1st, 2nd, 4th subsets as training set and the 3rd as validation set, so on and so forth until all the subsets have been the validation set once (all observations are used for both training and validation). After we get <math>\hat L_1, \hat L_2, \hat L_3, \hat L_4</math>, we can calculate the average <math>\hat L</math> for degree 1 model. Similarly, we can estimate the error for n degree model and generate a simulating curve. Now we are able to choose the right degree which corresponds to the minimum error. Also, we can use this method to find the optimal unit number of hidden layers of neural networks. We can begin with 1 unit number, then 2, 3 and so on and so forth. Then find the unit number of hidden layers with lowest average error.<br />
<br />
=== Generalized Cross-validation ===<br />
If the vector of observed values is denoted by <math>\mathbf{y}</math>, and the vector of fitted values by <math>\hat\mathbf{y}</math>.<br />
<br />
<math>\mathbf{y} = \mathbf{H}\hat\mathbf{y}</math>, <br />
<br />
where the hat matrix is given by<br />
<br />
<math>\mathbf{H} = \mathbf{X}( \mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X}^{T}</math>,<br />
<br />
<math> \frac{1}{N}\sum_{i=1}^{N}[y_{i} - \hat f^{-i}(\mathbf{x}_{i})]^{2}=\frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-\mathbf{H}_{ii}}]^{2}</math>,<br />
<br />
Then the GCV approximation is given by<br />
<br />
<math> GCV(\hat f) = \frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-trace(\mathbf{H})/N}]^{2}</math>,<br />
<br />
Thus, one of the biggest advantages of the GCV is that the trace is more easily computed.<br />
<br />
=== Leave-one-out Cross-validation ===<br />
Leave-one-out cross-validation involves using all but one data point in the original training data set to train our model, then using the data point that we initially left out as a means to estimate true error. But repeating this process for every data point in our original data set, we can obtain a good estimation of true error.<br />
<br />
In other words, leave-one-out cross-validation is k-fold cross-validation in which we set the subset number <math>\,K</math> to be the cardinality of the whole data set.<br />
<br />
In the above example, we can see that k-fold cross-validation can be computationally expensive: for every possible value of the parameter, we must train the model <math>\,K</math> times. This deficiency is even more obvious in leave-one-out cross-validation, where we must train the model <math>\,n</math> times, where <math>\,n</math> is the number of data point in the data set.<br />
<br />
Fortunately, when adding data points to the classifier is reversible, calculating the difference between two classifiers is computationally more efficient than calculating the two classifiers separately. So, if the classifier on all the data points is known, we simply undo the changes from a data point <math>\,K</math> times to calculate the leave-one-out cross-validation error rate.<br />
<br />
== Regularization for Neural Network — Weight Decay ==<br />
[[File:figure 2.png|350px|thumb|right|Figure 1: activation function]]<br />
Weight decay training is suggested as an implementation for achieving a robust neural network which is insensitive to noise. Since the number of hidden layers in NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting.<br />
<br />
It can be seen from Figure 1 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, we can avoid overfitting by constraining the weights to be small. This gives us a hint to initialize the random weights to be close to zero.<br />
<br />
Formally, we penalize nonlinear weights by adding a penalty term in the error function. Now the regularized error function becomes:<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}|w_i|^2 + \sum_{jk}|u_{jk}|^2)</math>, where <math>\,err</math> is the original error in back-propagation; <math>\,w_i</math> is the weights of the output layer; <math>\,u_{jk}</math> is the weights of the hidden layers.<br />
<br />
Usually, too large <math>\,\lambda</math> will make the weights <math>\,w_i</math> and <math>\,u_{jk}</math> too small. We can use cross validation to estimate <math>\,\lambda</math>.<br />
<br />
A similar penalty, weight elimination, is given by,<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}\frac{|w_i|^2}{1 + |w_i|^2} + \sum_{jk}\frac{|u_{jk}|^2}{1+|u_{jk}|^2})</math>.<br />
<br />
As in back-propagation, we take partial derivative with respect to the weights:<br />
<br />
<math>\frac{\partial REG}{\partial w_i} = \frac{\partial err}{\partial w_i} + 2\lambda w_i</math><br />
<br />
<math>\frac{\partial REG}{\partial u_{jk}} = \frac{\partial err}{\partial u_{jk}} + 2\lambda u_{jk}</math><br />
<br />
<math>w^{new} \leftarrow w^{old} - \rho\left(\frac{\partial err}{\partial w} + 2\lambda w\right)</math><br />
<br />
<math>u^{new} \leftarrow u^{old} - \rho\left(\frac{\partial err}{\partial u} + 2\lambda u\right)</math><br />
<br />
Note that here <math>\,\lambda</math> serves as a trade-off parameter, tuning between the error rate and the linearity. Actually, we may also set <math>\,\lambda</math> by cross-validation. The tuning parameter is important since weights of zero will lead to zero derivatives and the algorithm will not change. On the other hand, starting with weights that are too large means starting with a nonlinear model which can often lead to poor solutions. <ref>Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning (Springer 2009) pp.398</ref><br />
<br />
== Radial Basis Function (RBF) Networks - November 6, 2009 ==<br />
<br />
[[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]]<br />
<br />
=== Introduction === <br />
<br />
A Radial Basis Function (RBF) network [http://en.wikipedia.org/wiki/Radial_basis_function_network] is a type of artificial neural network with an output layer and a single hidden layer, with weights from the hidden layer to the output layer, and can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. One choice that has been widely used is that of radial basis functions, which have the property that each basis function depends only on the radial distance (typically Euclidean) from a center <math>\displaystyle\mu_{j}</math>, so that <math>\phi_{j}(x)= h({\Vert x - \mu_{j}\Vert})</math>.<br />
<br />
<br />
The output of an RBF network can be expressed as a weighted sum of its radial basis functions as follows:<br />
<br />
<math>\hat y_{k} = \sum_{j=1}^M\phi_{j}(x) w_{jk}</math><br />
<br />
The radial basis function is: <br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br /><br />
(Gaussian without a normalization constant)<br /><br /><br />
'''note:'''The hidden layer has a variable number of neurons (the optimal number is determined by the training process). As usual the more neurons in the hidden layer, the higher the model complexity. Each neuron consists of a radial basis function centered on a point with the same dimensions as the input data. The radii of the RBF functions may be different. The centers and radii can be determined through clustering or an EM algorithm. When the x vector is given from the input layer, the hidden neuron computes the radial distance from the neuron’s center point and then applies RBF function to this distance. The resulting value is passed to the the output layer and weighed together to form the output. <br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat{Y}_{n,k} = \left[ \begin{matrix}<br />
\hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\<br />
\hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\<br />
\vdots &\vdots & \ddots & \vdots \\<br />
\hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k}<br />
\end{matrix}\right] </math> is the matrix of output variables. <br />
<br />
:<math>\Phi_{n,m} = \left[ \begin{matrix}<br />
\phi_{1,1} & \phi_{1,2} & \cdots & \phi_{1,m} \\<br />
\phi_{2,1} & \phi_{2,2} & \cdots & \phi_{2,m} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
\phi_{n,1} & \phi_{n,2} & \cdots & \phi_{n,m}<br />
\end{matrix}\right] </math> is the matrix of Radial Basis Functions.<br />
<br />
:<math>W_{m,k} = \left[ \begin{matrix}<br />
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\<br />
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
w_{m,1} & w_{m,2} & \cdots & w_{m,k}<br />
\end{matrix}\right] </math> is the matrix of weights.<br />
<br />
Here, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>k = 1</math>, <math>\hat Y</math> and <math>W</math> are column vectors.<br />
<br />
''related reading'':<br />
<br />
Introduction of the Radial Basis Function (RBF) Networks [http://axiom.anu.edu.au/~daa/courses/GSAC6017/rbf.pdf]<br />
<br />
Radial Basis Function (RBF) Networks [http://documents.wolfram.com/applications/neuralnetworks/index6.html] [http://lcn.epfl.ch/tutorial/english/rbf/html/index.html]<br />
<br />
=== Estimation of weight matrix W ===<br />
<br />
We minimize the training error, <math>\Vert Y - \hat{Y}\Vert^2</math> in order to find <math>\,W</math>.<br /><br /><br />
From a previous result in linear algebra we know that <br />
<br />
<math>\Vert A \Vert^2 = Tr(A^{T}A)</math><br />
<br />
Thus we have a problem similar to linear regression:<br />
<br />
<math>\ err = \Vert Y - \Phi W\Vert^{2} = Tr[(Y - \Phi W)^{T}(Y - \Phi W)]</math><br />
<br />
<math>\ err = Tr[Y^{T}Y - Y^{T}\Phi W - W^{T} \Phi^{T} Y + W^{T}\Phi^{T} \Phi W]</math><br />
<br />
<br />
==== Useful properties of matrix differentiation ====<br />
<br />
<br />
<math>\frac{\partial Tr(AX)}{\partial X} = A^{T}</math><br />
<br />
<math>\frac{\partial Tr(X^{T}A)}{\partial X} = A</math><br />
<br />
<math>\frac{\partial Tr(X^{T}AX)}{\partial X} = (A^{T} + A)X</math><br />
<br />
==== Solving for W ====<br />
<br />
We find the minimum over <math>\,W</math> by setting <math>\frac{\partial err}{\partial W}</math> equal to zero and using the aforementioned properties of matrix differentiation.<br />
<br />
<math>\frac{\partial err}{\partial W} = 0</math><br />
<br />
<math>\ 0 - \Phi^{T}Y - \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ -2 \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math><br />
<br />
<math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY</math><br />
<br />
where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}</math><br />
<br />
<math>\,H</math> is the hat matrix for this model. This gives us a nice results since the solution has a closed form and we do not have to worry about convexity problems in this case.<br />
<br />
=== Including an additional bias ===<br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat Y = \left[ \begin{matrix}<br />
y_{11} & y_{12} & \cdots & y_{1k} \\<br />
y_{21} & y_{22} & \cdots & y_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
y_{n1} & y_{n2} & \cdots & y_{nk}<br />
\end{matrix}\right] </math> is the matrix(n by k) of output variables.<br />
<br />
:<math>\Phi = \left[ \begin{matrix}<br />
\phi_{10} &\phi_{11} & \phi_{12} & \cdots & \phi_{1M} \\<br />
\phi_{20} & \phi_{21} & \phi_{22} & \cdots & \phi_{2M} \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{n0} &\phi_{n1} & \phi_{n2} & \cdots & \phi_{nM}<br />
\end{matrix}\right] </math> is the matrix(n by M+1) of Radial Basis Functions.<br />
<br />
:<math>W = \left[ \begin{matrix}<br />
w_{01} & w_{02} & \cdots & w_{0k} \\<br />
w_{11} & w_{12} & \cdots & w_{1k} \\<br />
w_{21} & w_{22} & \cdots & w_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
w_{M1} & w_{M2} & \cdots & w_{Mk}<br />
\end{matrix}\right] </math> is the matrix(M+1 by k) of weights.<br />
<br />
where the extra basis function <math>\Phi_{0}</math> is set to 1.<br />
<br />
==== Normalized RBF ====<br />
<br />
In addition to the above unnormalized architecture, the normalized RBF can be represented as:<br />
<br />
<math>\hat{y}_{k}(X) = \frac{\sum_{j=1}^{M} w_{jk}\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math><br /><br /><br />
<br />
<br />
Actually, <math>\Phi^{\ast}_{j}(X) = \frac{\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math> is known as a normalized radial basis function. Giving the familiar form,<br /><br />
<br />
<math>\hat{y}_{k}(X) = \sum_{j=1}^{M} w_{jk}\Phi^{\ast}_{j}(X)</math><br /><br /><br />
<br />
=== Conceptualizing RBF networks ===<br />
<br />
In the past, we have classified data using models that were explicitly linear, quadratic, or otherwise definite. In RBF networks, like in Neural Networks, we can fit an arbitrary model. How can we do this without changing the equations being used?<br />
<br />
Recall a [[#Trick:_Using_LDA_to_do_QDA_-_October_7.2C_2009|trick]] that was discussed in the October 7 lecture: if we add new features to our original data set, we can project into higher dimensions, use a linear algorithm, and get a quadratic result by collapsing to a lower dimension afterward. In RBF networks, something similar can happen.<br />
<br />
Think of <math>\,\Phi</math>, our matrix of radial basis functions, as a feature space of the input. Each hidden unit, then, can be thought to represent a feature; we can see that, if there are more hidden units than input units, we can essentially project to a higher-dimensional space, as we did in our earlier trick. However, this does not mean that an RBF network will actually do this, it is merely a way to convince yourself that RBF networks (and neural networks) can fit arbitrary models. Nevertheless, it is also noticed that just because of such great power, the problem of overfitting appears more important. We have to control its complexity so that it does not fit to an arbitrary training model but to a general one.<br />
<br />
=== RBF networks for classification -- a probabilistic paradigm ===<br />
<br />
[[File:Rbf_graphical_model.png|350px|thumb|left|Figure 1: RBF graphical model]]<br />
<br />
An RBF network is akin to fitting a Gaussian mixture model to data. We assume that each class can be modelled by a single function <math>\,\phi</math> and data is generated by a mixture model. According to Bayes Rule,<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(x|y_{k})*Pr(y_{k})}{Pr(x)}</math><br />
<br />
While all classifiers that we have seen thus far in the course have been in discriminative form, the RBF network is a generative model that can be represented using a directed graph.<br />
<br />
We can replace the class conditional density in the above conditional probability expression by marginalizing <math>\,x</math> over <math>\,j</math>:<br />
<math>\Pr(x|y_{k}) = \sum_{j} Pr(x|j)*Pr(j|y_{k})</math><br />
<br />
<br />
<br />
<br/><br/><br />
*'''Note''' We made the assumption that each class can be modelled by a single function <math>\displaystyle\Phi</math> and that the data was generated by a mixture model. The Gaussian mixture model has the form:<br />
<math>f(x)=\sum_{m=1}^M \alpha_m \phi(x;\mu_m,\Sigma_m)</math> where <math>\displaystyle\alpha_m</math> are mixing proportions, <math>\displaystyle\sum_m \alpha_m=1</math>, and <math>\displaystyle\mu_m</math> and <math>\displaystyle\Sigma_m</math> are the mean and covariance of each Gaussian density respectively. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009), pp. 214. </ref> The generative model in Figure 1 shows graphically how each Gaussian in the mixture model is chosen to sample from.<br />
<br />
== '''Radial Basis Function (RBF) Networks - November 9th, 2009''' ==<br />
<br />
=== RBF Network for classification (A probabilistic point of view) ===<br />
Using RBF Network to do classification, we usually treat it as regression problem. We want to set a threshold to decide what the data’s class membership is. However, to find some insight to the classification problem of what we are doing in terms of RBF Network, we often think of mixture models and make certain assumptions.<br />
<br />
[[File:RBF.png|350px|thumb|right|Figure 26.1: RBF Network Classification Demo]] <br />
<br />
We assume, as we can see in the graph on the right hand side, that we have three random variables, <math>\displaystyle y_k</math>, <math>\displaystyle j</math>, and <math>\displaystyle x</math> where <math>\displaystyle y_k</math> denotes class <math>\,k</math>, <math>\displaystyle x</math> is what we observed here, and <math>\displaystyle j</math> is some hidden random variable. There is a process that there are different classes, and each class can trigger a different hidden random variable <math>\displaystyle j</math>. To understand this, we can assume that, for instance, this random variable <math>\displaystyle j</math> has a Gaussian distribution (it could have any other distribution as well) and that all the <math>\displaystyle j</math>’s have the same distribution (Gaussian), but with different parameters. From each Gaussian distribution triggered by each class, we are going to sample some data points. Therefore, in the end, we are going to get a set of data, which are not strictly Gaussian, but are actually a mixture of Gaussians.<br />
<br />
Again, we look at the posterior distribution from [http://en.wikipedia.org/wiki/Bayes'_theorem Bayes' Rule].<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(X = x | Y = y_{k})*Pr(Y = y_{k})}{Pr(X = x)}</math><br />
<br />
Since we made the assumption that the data has been generated from a mixture model, we can estimate this conditional probability by<br />
<br />
<math>\Pr(X = x | Y = y_{k}) = \sum_{j} Pr(X = x | j)*Pr(j | Y = y_{k})</math>, <br />
<br />
which is the class conditional distribution (or probability) of the mixture model. Note, here, if we only have a simple model from <math>\displaystyle y_k</math> to <math>\displaystyle x</math>, then we won’t have this summation.<br />
<br />
We can substitute this class conditional distribution into Bayes' formula. We can see that the posterior of class <math>\displaystyle k</math> is the summation over <math>\displaystyle j</math> of the probability of <math>\displaystyle x</math> given <math>\displaystyle j</math> times the probability of <math>\displaystyle j</math> given <math>\displaystyle y_k</math>, times the prior distribution of class <math>\displaystyle k</math>, and lastly divided by the marginal probability of <math>\displaystyle x</math>. That is,<br />
<br />
<math>\Pr(y_k | x) = \frac {\sum_{j} Pr(x | j)*Pr(j | y_{k})*Pr(y_{k})}{Pr(x)}</math>.<br />
<br />
Since, the prior probability of class <math>\displaystyle k</math>, <math>\displaystyle Pr(y_{k})</math>, does not have an index of <math>\displaystyle j</math>, it can be taken out of the summation. This yields,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)}</math>.<br />
<br />
We multiply this by <math>\displaystyle 1 = \frac {Pr(j)}{Pr(j)}</math>. Then, it becomes,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)} * \frac {Pr(j)}{Pr(j)}</math>.<br />
<br />
Next, note that <math>\displaystyle Pr(j | x) = \frac {Pr(x | j)*Pr(j)}{Pr(x)}</math>, and <math>\displaystyle Pr(y_k | j) = \frac {Pr(j | y_k)*Pr(y_k)}{Pr(j)}</math>. Then rearranging the terms, we finally have the posterior:<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} Pr(j | x)Pr(y_k | j)</math>.<br />
<br />
Interestingly, this is just the product of the posterior of the two functions that are summed.<br />
<br />
==== Interpretation of RBF Network classification ====<br />
<br />
[[File:2.png|350px|thumb|right|Figure 26.1.2(2): RBF Nerwork ]]<br />
<br />
We want to relate the results that we derived above to our RBF Network. In a RBF Network, as we can see on the right hand side, we have a set of data, <math>\displaystyle x_1</math> to <math>\displaystyle x_d</math>, and the hidden basis function, <math>\displaystyle \phi_{1}</math> to <math>\displaystyle \phi_{M}</math>, and then we have some output, <math>\displaystyle y_1</math> to <math>\displaystyle y_k</math>. Also, we have weights from the hidden layer to output layer. The output is just the linear sum of <math>\displaystyle \phi</math>’s. <br />
<br />
Now consider probability of <math>\displaystyle j</math> given <math>\displaystyle x</math> to be <math>\displaystyle \phi</math>, and the probability of <math>\displaystyle y_k</math> given <math>\displaystyle j</math> to be the weights <math>\displaystyle w_{jk}</math>, then the posterior can be written as,<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math>.<br />
<br />
[[File:3.png|350px|thumb|left|Figure 26.1.2(1): Gaussian mixture ]]<br />
<br />
Now, let us look at an example in one dimensional case. Suppose,<br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, and <math>\displaystyle j</math> is from 1 to 2. <br />
<br />
We know that <math>\displaystyle \phi</math> is a radial basis function. It's as if we put some Gaussian over data. And for each Gaussian, we consider the center <math>\displaystyle \mu</math>. Then, what <math>\displaystyle \phi</math> computes is the similarity of any data point to the center. <br />
<br />
We can see the graph on the left which plots the density of <math>\displaystyle \phi_{1}</math> and <math>\displaystyle \phi_{2}</math>. Take <math>\displaystyle \phi_{1}</math> for instance, if the point gets far from the center <math>\displaystyle \mu_{1}</math>, then it will reduce <math>\displaystyle \phi_{1}</math> to become nearly zero. Remember that, we can usually find a non-linear regression or classification of input space by doing a linear one in some extended space or some feature space (more details in Aside). Here, the <math>\displaystyle \phi</math>’s actually produce that feature space. <br />
<br />
So, one way to look at this is that this <math>\displaystyle \phi</math> is telling us that given an input, how likely the probability of presence of a particular feature is. Say, for example, we define the features as the centers of these Gaussian distributions. Then, this <math>\displaystyle \phi</math> function somehow computes the possibility given certain data points, of this kind of feature appearing. If the data point is right at the center, then the value of that <math>\displaystyle \phi</math> would be one, i.e. the probability is 1. If the point is far from the center, then the probability (<math>\displaystyle \phi</math> function value) will be close to zero, that is, it’s less likely. Therefore, we can treat <math>\displaystyle Pr(j | x)</math> as the probability of a particular feature given data. <br />
<br />
When we have those features, then <math>\displaystyle y</math> is the linear combination of the features. Hence, any of the weights <math>\displaystyle w</math>, which is equal to <math>\displaystyle Pr(y_k | j)</math>, tells us how likely this particular <math>\displaystyle y</math> will appear given those features. Therefore, the weight <math>\displaystyle w_{jk}</math> shows the probability of class membership given feature. <br />
<br />
Hence, we have found a probabilistic point of view to look at RBF Network!<br />
<br />
*'''Note''' There are some inconsistencies with this probabilistic point of view. There are no restrictions that force <math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math> to be between 0 and 1. So if least squares is used to solve this, <math>\displaystyle w_{jk}</math> cannot be interpreted as a probability. <br />
<br />
<br />
''' Aside '''<br />
*Feature Space:<br />
:One way to produce a feature space is LDA<br />
:Suppose, we have n data points <math>\mathbf{x}_1</math> to <math>\mathbf{x}_n </math>. Each data point has d features. And these n data points consist of the <math>X</math> matrix, <br />
:<math>X = \left[ \begin{matrix}<br />
x_{11} & x_{21} & \cdots & x_{n1} \\<br />
x_{12} & x_{22} & \cdots & x_{n2} \\<br />
\vdots & & \ddots & \vdots \\<br />
x_{1d} & x_{2d} & \cdots & x_{nd}<br />
\end{matrix}\right] </math><br />
:Also, we have feature space,<br />
:<math>\Phi^{T} = \left[ \begin{matrix}<br />
\phi_{1}(\mathbf{x_1}) & \phi_{1}(\mathbf{x_2})& \cdots & \phi_{1}(\mathbf{x_n})\\<br />
\phi_{2}(\mathbf{x_1})& \phi_{2}(\mathbf{x_2})& \cdots & \phi_{2}(\mathbf{x_n}) \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{M}(\mathbf{x_1}) & \phi_{M}(\mathbf{x_2}) & \cdots & \phi_{M}(\mathbf{x_n})<br />
\end{matrix}\right] </math> <br />
:If we want to solve a regression problem for the input data, we don’t perform Least Square on this <math>\displaystyle X</math> matrix, we do Least Square on the feature space, i.e. on the <math>\displaystyle \Phi^{T}</math> matrix. The dimensionality of <math>\displaystyle \Phi^{T}</math> is M by n.<br />
:Now, we still have n data points, but we define these n data points in terms of a new set of features. So, originally, we define our data points by d features, but now, we define them by M features. And what are those M features telling us? <br />
:Let us look at the first column of <math>\displaystyle \Phi^{T}</math> matrix. The first entry is <math>\displaystyle \phi_1</math> applied to <math>\mathbf{x_1}</math>, and so on, until the last entry is <math>\displaystyle \phi_M</math> applied to <math>\mathbf{x_1}</math>. Suppose each of these <math>\displaystyle \phi_j</math> is defined by<br />
:<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>.<br />
:Then, each <math>\displaystyle \phi_j</math> checks the similarity of the data point with its center. Hence, the new set of features are actually representing M centers in our data set, and for each data point, its new features check how this point is similar to the first center; how it is similar to the second center; and how it is similar to the <math>\displaystyle M^{th}</math> center. And this checking process will apply to all data points. Therefore, feature space gives another representation of our data set. <br />
<br />
</noinclude><br />
<br />
=== Model selection or complexity control for RBF Network - a brief introduction ===<br />
In order to obtain a better fit for the training data, we often want to increase the complexity of our RBF Network. By its construction, we know that to change the complexity of a RBF Network, the only way is to add or decrease the number of basis functions. A large number of basis function yields a more complex network. In theory, if we add enough basis functions, the RFB Network would work for any training; however, it doesn't mean this RBF Network model can generalize well. Therefore, for the purpose of avoiding overfitting problem (see Notes below), we only want to increase the number of basis function to certain point, i.e. its optimal level. <br />
<br />
For the model selection, what we usually do is estimate the training error. After working through the training error, we’ll see that the training error in fact can be decomposed, and one component of training error is called Mean Squared Error (MSE). In the later notes, we will find that our final goal is to get a good estimate of MSE. Moreover, in order to find an optimal model for our data, we select the model with the smallest MSE.<br />
<br />
Now, let us introduce some notations that we will use in the analysis:<br />
*<math>\hat f</math> -- the prediction model estimated by a RBF network from the training data<br />
*<math>\displaystyle f</math> -- the real model (not null), and ideally, we want <math>\hat f</math> to be close to <math>\displaystyle f</math><br />
*<math>\displaystyle err</math> -- the training error<br />
*<math>\displaystyle Err</math> -- the testing error<br />
*<math>\displaystyle MSE</math> -- the Mean Squared Error<br />
<br />
''' Notes '''<br />
<br />
[[File:overfitting.png|350px|thumb|left|Figure 26.2: Overfitting]]<br />
<br />
*Being more complex isn’t always a good thing. Sometime, [http://en.wikipedia.org/wiki/Overfitting overfitting] causes the model to lose its generality. For example in the graph on left hand side, the data points are sampled from the model <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle f(x_i)</math> is a linear function, which is shown by the blue line, and <math>\displaystyle \epsilon_i</math>is additive Gaussian noise from <math>~N(0,\sigma^2)</math>. The red curve displayed in the graph shows the over-fitted model. Clearly, this over-fitted model only works for any training data, and is useless for any further prediction when new data points are introduced.<br />
<br />
> n<-20;<br />
> x<-seq(1,10,length=n);<br />
> alpha<-2.5;<br />
> beta<-1.75;<br />
> y<-alpha+beta*x+rnorm(n);<br />
> plot(y~x, pch=16, lwd=3, cex=0.5, main='Overfitting');<br />
> abline(alpha, beta, col='blue');<br />
> lines(spline(x, y), col = 2);<br />
<br />
*More details on this topic later on.<br />
<br />
<br />
<br />
</noinclude><br />
<br />
<br />
<br />
<br />
<br />
<br />
== '''Model Selection(Stein's Unbiased Risk Estimate)- November 11th, 2009''' ==<br />
<br />
===Model Selection===<br />
<br />
Model selection is a task of selecting a model of optimal complexity for a given data. Learning a radial basis function network from data is a parameter estimation problem. One difficulty with this problem is selecting parameters that show good performance on both training and testing data. In principle, a model is selected to have parameters associated with the best observed performance on training data, although our goal really is to achieve good performance on unseen testing data. Not surprisingly, a model selected on the basis of training data does not necessarily exhibit comparable performance on the testing data. When squared error is used as the performance index, a zero-error model on the training data can always be achieved by using a sufficient number of basis functions.<br />
<br />
<br />
But, training error and testing error do not demonstrate a linear relationship. In particular, a smaller training error does do not necessarily result in a smaller testing error. In practice, one often observes that, up to a certain point, the model error on testing data tends to decrease as the training error decreases. However, if one attempts to decrease the training error too far by increasing model complexity, the testing error often can take a dramatic increase.<br />
<br />
<br />
The basic reason behind this phenomenon is that in the process of minimizing training error, after a certain point, the model begins to over-fit the training set. Over-fitting in this context means fitting the model to training data at the expense of losing generality. In the extreme form, a set of <math>\displaystyle N</math> training data points can be modeled exactly with <math>\displaystyle N</math> radial basis functions. Such a model follows the training data perfectly. However, the model is not representative features of the true underlying data source, and this is why it fails to correctly model new data points.<br />
<br />
<br />
In general, the training error rate will be less than the testing error on the new data. A model typically adapts to the training data, and hence the training error will be an overly optimistic estimate of the testing error. An obvious way to well estimate testing err is to add a penalty term to the training error to compensate. Actually, SURE is developed based on this.<br />
<br />
<br />
<br />
===Stein's unbiased risk estimate (SURE)===<br />
<br />
<br />
====Important Notation====<br />
<br />
Let:<br />
*<math>\hat f(X)</math> denote the ''prediction model'', which is estimated from a training sample by the RBF neural network model.<br />
*<math>\displaystyle f(X)</math> denote the ''true model''.<br />
*<math>\displaystyle err=\sum_{i=1}^N (\hat y_i-y_i)^2 </math> denote the ''training error'',which is the average loss over the training sample.<br />
*<math>\displaystyle Err=\sum_{i=1}^M (\hat y_i-y_i)^2 </math> denote the ''test error'', which is the expected prediction error on an independent test sample.<br />
*<math>\displaystyle MSE=E(\hat f-f)^2</math> denote the ''mean squared error'', where <math>\hat f(X)</math> is the estimated model and <math>\displaystyle f(X)</math> is the true model.<br />
<br />
<br />
<br />
Suppose the observations <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>.We need to estimate <math>\hat f</math> from training data set <math>T=\{(x_i,y_i)\}^2_{i=1}</math>. Let <math>\hat f_i=\hat f_(x_i)</math> and <math>\displaystyle f_i= f(x_i)</math>, then <br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i-\epsilon_i)^2]</math><math>=E[(\hat f_i-f_i)^2]+E[\epsilon_i^2]-2E[\epsilon_i(\hat f_i-f_i)]</math><br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2E[\epsilon_i(\hat f_i-f_i)]</math> <math>\displaystyle (1)</math><br />
<br />
The last term can be written as:<br />
<br />
<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=E[(y_i-f_i)(\hat f_i-f_i)]=cov(y_i,\hat f)</math>, where<math>\displaystyle y_i</math> and <math>\hat f_i</math> both have same mean <math>\displaystyle f_i</math>.<br />
<br />
<br />
<br />
====Stein's Lemma====<br />
<br />
If <math>\,Z</math> is <math>\,N(\mu,\sigma^2)</math> and if <math>\displaystyle g(Z)</math> is weakly differentiable,such that<math>\displaystyle E[\vert g'(Z)\vert]<\infty</math>, then <math>\displaystyle E[g(Z)(Z-\mu)]=\sigma^2E(g'(Z))</math>.<br />
<br />
<br />
According to Stein's Lemma, the last cross term of <math>\displaystyle (1)</math>, <math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]</math> can be written as <math>\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math>. The derivation is as follows.<br />
<br />
<math>\displaystyle Proof</math>:<br />
<math>\displaystyle E[g(Z)(Z-\mu)]=E[(\hat f-f)\epsilon]=\sigma^2E(g'(Z))=\sigma^2 E[\frac {\partial (\hat f-f)}{\partial y_i}]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}-\frac {\partial f}{\partial y_i}]</math><br />
<br />
<br />
Since <math>\displaystyle f</math> is the true model, not the function of the observations <math>\displaystyle y_i</math>, then <math>\frac {\partial f}{\partial y_i}=0</math>.<br />
<br />
So,<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math> <math>\displaystyle (2)</math><br />
<br />
<br />
<br />
====Two Different Cases====<br />
<br />
=====''Case 1''=====<br />
<br />
Consider the case in which a new data point has been introduced to the estimated model, i.e. <math>(x_i,y_i)\not\in\tau</math>; this new point belong to the validation set <math>\displaystyle \nu</math>,i.e.<math>(x_i,y_i)\in\nu</math>. Since <math>\displaystyle y_i</math> is a new point, <math>\hat f</math> and <math>\displaystyle y_i</math> are independent. Therefore <math>\displaystyle cov(y_i,\hat f)=0</math> (or think about <math>\frac{\partial \hat f}{\partial y_i}</math>, when <math>\,y_i</math>is a new point, then it has nothing with <math>\hat f</math> because the estimation of <math>\hat f</math> is from the training data, so <math>\frac{\partial \hat f}{\partial y_i}=0</math> ) and <math>\displaystyle (1)</math> in this case can be written as:<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2</math>. <br />
<br />
This expectation means <math>\frac {1}{m}\sum_{i=1}^m (\hat y_i-y_i)^2 = \frac {1}{m}\sum_{i=1}^m (\hat f_i-f_i)^2+ \sigma^2</math>.<br />
<br />
<math>\sum_{i=1}^m (\hat y_i-y_i)^2 = \sum_{i=1}^m (\hat f_i-f_i)^2+ m\sigma^2</math><br />
<br />
Based on the notation we denote above, then we obtain:<br />
<math>\displaystyle MSE=Err-m\sigma^2</math><br />
<br />
<br />
<br />
This is the justification behind the technique of cross validation. since <math>\displaystyle \sigma^2</math> is constant, to minimize <math>\displaystyle MSE</math> is equal to minimize the test err <math>\displaystyle Err</math>. In cross vaildation to avoid overfitting or underfitting, a validation data set is independent from the estimated model.<br />
<br />
<br />
=====''Case 2''=====<br />
<br />
A more interesting case is the case in which we do not use new data points to assess the performance of the estimated model. and the training data is used for both estimating and assessing a model <math>\hat f_i</math>. In this case the cross term in <math>\displaystyle (1)</math> cannot be ignored because <math>\hat f_i</math> and <math>\displaystyle y_i</math> are not independent. Therefore the cross term can be estimated by Stein's lemma, which was originally proposed to estimated the mean of a Guassian distribution.<br />
<br />
<br />
Suppose <math>(x_i,y_i)\in\tau</math>, then by applying Stein's lemma, we obtain <math>\displaystyle (2)</math> proved above.<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2\sigma^2E[\frac {\partial \hat f}{\partial y_i}]</math>.<br />
<br />
This expectation means <math>\frac {1}{N}\sum_{i=1}^N (\hat y_i-y_i)^2 = \frac {1}{N}\sum_{i=1}^N (\hat f_i-f_i)^2+ \sigma^2-\frac {2}{N}\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<br />
<math>\sum_{i=1}^N (\hat y_i-y_i)^2 = \sum_{i=1}^N (\hat f_i-f_i)^2+ N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<math>\displaystyle err=MSE+N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math><br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math> <math>\displaystyle (3)</math><br />
<br />
In statistics, this is known as [http://www.reference.com/browse/Stein%27s+unbiased+risk+estimate Stein's unbiased risk estimate (SURE)] is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter, and thus cannot be determined completely.<br />
<br />
<br />
<br />
===SURE for RBF Network===<br />
<br />
Based on SURE, the optimum number of basis functions should be assigned to have the minimum generalization err <math>\displaystyle err</math>. Based on the Radial Basis Function Network, by setting <math>\frac{\partial err}{\partial W}</math> equal to zero , we get the least squared solution of<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math>.Then we have <math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}Y = HY</math>,where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}</math>, <math>\,H</math> is the hat matrix for this model.<br />
<br />
<br />
<math>\hat f=\,H_{i1}y_2+\,H_{i2}y_2+\cdots+\,H_{in}y_n</math> <math>\displaystyle (3)</math><br />
<br />
where <math>\,H</math> depends on the input vector <math>\displaystyle x_i</math> but not on <math>\displaystyle y_i</math>. <br />
<br />
By taking the derivative of <math>\hat f_i</math> with respect to <math>\displaystyle y_i</math>, we can easily obtain:<br />
<br />
<math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Now, substituing this into<math>\displaystyle (3)</math> ,we get<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Here, we can tell that <math>\sum_{i=1}^N \,H_{ii}= \,Trace(H)</math>, the sum of the diagonal elements of <math>\,H</math>. Thus, we can obtain the further simplification that <math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1})\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=d</math>, where<math>\displaystyle d</math> is the dimension of <math>\displaystyle \Phi</math>. Since <math>\displaystyle \Phi</math> is a projection of input matrix <math>\,X</math> onto a basis set spanned by <math>\,M</math>, the number of basis functions. If considering intercept, then <math>\,Trace(H)= M+1</math>.<br />
<br />
Then,<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1)</math>.<br />
<br />
===SURE Algorithm===<br />
<br />
<br />
[[File:27.1.jpg|350px|thumb|right|Figure 27.1]]<br />
<br />
We use this method to find the optimum number of basis function by choosing the model with smallest MSE over the set of models considered. Given a set of models <math>\hat f_M(x)</math> indexed by the number of basis functions, <math>\displaystyle err(M)</math>. <br />
<br />
Then, <math>\displaystyle MSE(M)=err(M)-N\sigma^2+2\sigma^2(M+1)</math><br />
<br />
where <math>\displaystyle N</math> is the number of training samples and the noise,<math>\sigma^2</math>, can be estimated from the training data as<br />
<br />
<math>\hat \sigma^2=\frac {1}{N-1}\sum_{i=1}^N (\hat y-y)^2</math>.<br />
<br />
<br />
By applying SURE algorithm to SPECT Heart data, we get the optimal number of basis functions is <math>\displaystyle M=4</math>.<br />
<br />
<br />
Pls take a look at figure 27.1 on the right, which shows that<math>\displaystyle MSE</math> is smallest when <math>\displaystyle M=4</math>.<br />
<br />
<br />
Calculating the SURE value is easy if you have access to <math>\,\sigma</math>.<br />
<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
If <math>\,\sigma</math> is not known, it can be estimated using the error.<br />
<br />
error = (output - expected_output) .^ 2;<br />
sigma = (dot(err, ones(1, num_data_point)) / (num_data_point)) / (num_data_point - 1);<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
=='''SURE for RBF network & Support Vector Machine - November 13th, 2009'''==<br />
<br />
===SURE for RBF network===<br />
<br />
====Minimizing MSE====<br />
<br />
By Stein's unbiased risk estimate (SURE) for Radial Basis Function (RBF) Network<br />
we get:<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1) </math> (28.1)<br />
<br />
*<math>\displaystyle MSE</math>(mean square error)= <math>\sum_{i=1}^N (\hat y_i-y_i)^2 </math><br />
*<math>\displaystyle err</math>(training error)= <math>\sum_{i=1}^N (\hat f_i-f_i)^2 </math><br />
*<math>\displaystyle (M+1) </math>( number of hidden units)= <math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math><br />
<br />
<br />
'''Goal''': To minimize MSE<br />
<br />
1. If <math>\displaystyle \sigma </math> is known, then it is no impact (i.e. a constant),<br />
and we can ignore it. Only need to minimize <math>\displaystyle MSE=err +2\sigma^2(M+1)</math>.<br />
<br />
2. In reality, we do not know <math>\displaystyle \sigma</math>, and it changes when <math>\displaystyle (M+1) </math> changes. However, we can estimate <math>\displaystyle \sigma </math>.<br />
<br />
<math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>. Suppose we do not know the variance of <math>\displaystyle \epsilon</math>. Then, <br />
<br />
<math>\displaystyle \sigma^2=\frac{1}{N-1}\sum_{i=1}^N (\hat y_i-y_i)^2 =\frac{1}{N-1}err</math> (28.2)<br />
<br />
Substitute (28.2) into (28.1), get<br />
<br />
<math>\displaystyle MSE=err-N\frac{1}{N-1}err+2\frac{1}{N-1}err(M+1)</math><br />
<br />
<math>\displaystyle MSE=err(1-\frac{N}{N-1}+\frac{2(M+1)}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{N-1-N+2M+2}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{2M+1}{N-1}) </math> (28.3) <br />
<br />
<br />
[[File:28.1.jpg|350px|thumb|Figure 28.1: MSE vs err]]<br />
<br />
Figure 28.1: the training error will decrease and the MSE will increase when increasing the number of hidden units (i.e. the model is more complex).<br />
<br />
<br />
When the number of hidden units gets larger and larger, the training error will decrease until it approaches to <math>\displaystyle 0 </math>. If training error approaches <math>\displaystyle 0 </math>, then no matter how large <math>\displaystyle (M+1) </math> is, from (28.3) we can see MSE will approaches to <math>\displaystyle 0 </math> as well. However, in fact it does not happen since when training error is close to <math>\displaystyle 0 </math> [http://en.wikipedia.org/wiki/Overfitting overfitting] happens, and MSE will increase instead of being close to <math>\displaystyle 0 </math>. We can see it from the Figure 28.1. <br />
<br />
<br />
We can see <math>\displaystyle \sigma^2 </math> is the average of <math>\displaystyle err </math>. In order to deal with this problem, we can take the average for <math>\displaystyle err</math> of each hidden unit. For example: we can first take 1 hidden unit, and take 10 hidden units in the next.<br />
<br />
We can also see that unlike the classical Cross Validation (CV) or Leave one out (LOO) techniques, the SURE technique does not need to do the validation to find the optimal model. Hence, SURE technique uses less data than CV or LOO. It is suitable for the case that there is not enough data for validation. However, to implement SURE we need to find <math>\frac {\partial \hat f}{\partial y_i}</math>, which may not be trivial for models that do not have a closed-form solution.<br />
<br />
====Kmeans Clustering====<br />
<br />
Discription:<br /> [http://en.wikipedia.org/wiki/K-means_clustering Kmeans clustering] is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.<br />
<br />
*The number of hidden units is same as the number of clusters, which is <math>\displaystyle \phi </math><br />
<br />
*<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, we set it same for all clusters.<br />
<br />
The basic details for <math>K</math>-means clustering are given:<br />
<br />
The <math>K</math> initial centers are randomly chosen from the training data.<br />
<br />
Then the following two steps are iterated alternatively until convergernce.<br />
<br />
1. for each existing center, we reidentify its cluster (every points in this cluster should be closer to this center than others).<br />
<br />
2. compute the mean for each cluster and make it as the new center for each cluster.<br />
<br />
<br />
Example:<br /><br />
Partition data into 2 clusters (2 hidden values)<br />
<br />
<br />
>> X=rand(30,80); <br />
>> [IDX,C,sumD,D]=kmeans(X,2); <br />
>> size(IDX) <br />
>> 30 1<br />
>> size(C) <br />
>> 2 80<br />
>> size(sumD) <br />
>> 2 1<br />
>> c1=sum(IDX==1)<br />
>> 14<br />
>> c2=sum(IDX==2)<br />
>> 16<br />
>> sumD<br />
>> 85.6643<br />
>> 101.0419<br />
>> v1=sumD(1,1)/c1 <br />
>> 6.1189<br />
>> v2=sumD(2,1)/c2 <br />
>> 6.3151 <br />
<br />
<br />
<br />
Comments:<br />
<br />
We create <math>X</math> randomly as a train set with 80 data points and 30 dimensions, and then apply “kmeans” method to separate X into 2 clusters.IDX is a vector contains 1 or 2 which indicates 2 clusters, and its size is 30*1. <math>\displaystyle C </math> is the center (mean) of each cluster with size 2*80; sumD is sum of the square distance between the data points and center of its cluster. The <math>\displaystyle c1 </math> and <math>\displaystyle c2 </math> indicate the number of data points in cluster 1 and 2. <math>\displaystyle V1 </math> is the variance of the first cluster <math>\displaystyle (v1=\sigma_1)</math>; <math>\displaystyle V2 </math> is the variance of the second cluster <math>\displaystyle (v2=\sigma_2)</math>. Now we can get <math>\displaystyle \phi </math>, <math>\displaystyle w </math>, hat matrix <math>\displaystyle H </math> and <math>\displaystyle \hat Y </math> by following equations. Finally, we will get the <math>\displaystyle MSE </math> and predict the test set. <br />
<br />
<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br />
<br />
<math>\displaystyle W = (\phi^{T}\Phi)^{-1}\phi^{T}Y</math><br />
<br />
<math>\displaystyle H = \phi(\Phi^{T}\Phi)^{-1}\phi^{T}</math><br />
<br />
<math>\displaystyle \hat{Y} = \phi W = \phi(\phi^{T}\phi)^{-1}\phi^{T}Y = HY</math><br />
<br />
<br />
<br />
Aside:<br />
<br />
Similar in spirit to <math>K</math>-means, there is EM algorithm with respect to Gaussian mixture model. Generally speaking, the Gaussian mixture model is referred to as a soft clustering while <math>K</math>-means is hard clustering.<br />
<br />
Similar to <math>K</math>-means, the following two steps are iterated alternatively until convergernce.<br />
<br />
E-step, each point is assigned a weight for each cluster based on the likelihood of each of the corresponding Gaussians. Observations is assigned 1 for one cluster if they are closer to the center of that cluster, and is assigned 0 for other clusters. <br />
<br />
M-step, compute the weighted means and covariances and make them as the new means and covariances for every cluster.<br />
<br />
>>[P,mu,phi,1Pxtr]=mdgEM(X,2,200,0);<br />
<br />
===Support Vector Machine===<br />
<br />
====Introduction====<br />
We have seen that linear discriminate analysis and logistic regression both estimate linear decision boundaries in similar but slightly different ways. Separating hyperplane classifiers provide the basis for support vector classifier. It constructs linear decision boundaries that explicitly try to separate the data into different classes as well as possible. The techniques that make the extensions to the nonseparable case, where the classes overlap are generalized to what is known as the support vector machine. It produces nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space.<br />
<br />
Definition: <br /><br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machine (SVM)] are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space, which can be used for classification, regression or other tasks.<br />
<br />
====Optimal Seperating Hyperplane====<br />
<br />
[[File:28.2.jpg|350px|thumb|right|Figure 28.2]]<br />
<br />
Figure 28.2 An example with two classes separable by a hyperplane. The blue line is the least squares solution, which misclassifies one of the training points. Also shown are the black separating hyperplanes found by the [http://en.wikipedia.org/wiki/Perceptron perceptron] learning algorithm with different random starts.<br /><br />
<br />
We can see the data points can be separated by a linear boundary are in two classes in <math>\displaystyle \mathbb{R}^{2} </math>. Suppose a dataset is indeed linearly separable, then there exits infinitely many possible separating hyperplanes including the black lines in the figure as two of them for training data. However, which solution is the best when we introduce the new data. <br /><br />
<br />
Aside: <br /><br />
The blue line is the least squares solution to the problem,obtained by regressing the <math>\displaystyle -1/+1 </math> response <math>\displaystyle Y </math> on <math>\displaystyle X </math> (with intercept); the line is given by<br />
<math>\displaystyle {X:\hat\beta_0+\hat\beta_1X_1+\hat\beta_2X_2=0}</math>.<br />
This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by linear discriminant analysis, in light of its equivalence with linear regression in the two-class case.<br />
<br />
Classifiers such as (28.4) that compute a linear combination of the input features and return the sign were called ''perceptrons'' in the engineering literature in the late 1950s. <br />
<br />
<br />
Identifications:<br />
<br />
*Hyperplane: seprate two classes <br />
<br />
<math>\displaystyle x^{T}\beta+\beta_0=0</math><br />
<br />
*Margin: the distance between the hyperplane and the closest point.<br />
<br />
<math>\displaystyle d_i=x_i^{T}\beta+\beta_0 </math> where <math>\displaystyle i=1,....,N</math><br />
<br />
Note: since distance is positive, if the data is on <math>\displaystyle +1 </math> side the distance is <math>\displaystyle d_i(+1)</math>. If the data is on the <math>\displaystyle -1 </math> side the distance is <math>\displaystyle d_i(-1)</math>.<br />
<br />
*Data points: <math>\displaystyle y_i\in\{-1,+1\}</math>we can classify points as<math>\displaystyle sign\{d_i\}</math> if <math>\displaystyle \beta,\beta_0 </math> is known.<br /><br />
<br />
====Maximum Margin Classfifiers====<br />
Choose the line farthest from both classes or Choose the line which has the maximum distance from the closest point. (i.e maximize the margin)<br /><br />
<br />
<math>\displaystyle Margin=min\{y_id_i\}</math> <math>\displaystyle i=1,2,....,N </math> <br />
where <math>\displaystyle y_i </math> is label and <math>\displaystyle d_i </math> is distance<br /><br />
<br />
[[File:28.3.jpg|350px|thumb|right|Figure 28.3 The linear algebra of a hyperplane]]<br />
<br />
<br />
<br />
Fiqure 28.3 depicts a hyperplane defined by the equation <math>\displaystyle x^{T}\beta+\beta_0=0</math>. Since they are in <math>\displaystyle \mathbb{R}^{2} </math>, the hyperplane is a line.<br /><br />
<br />
<br />
Properties:<br /><br />
<br />
1. <math>\displaystyle \beta </math> is orthogonal to the hyperplane <br /><br />
<br />
Two points<math>\displaystyle x_1,x_2</math> lying on the hyperplane.<br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0\beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}(x_1-x_2)=0</math><br />
<br />
Hence,<math>\displaystyle \beta </math> is orthogonal to <math>\displaystyle (x_1-x_2)</math>, and<math>\displaystyle \beta^*=\frac{\beta}{\|\beta\|} </math> is the vector normal to the hyperplane.<br /><br />
<br />
2. For any point <math>\displaystyle x_0 </math> on the hyperplane, <br />
<br />
<math>\displaystyle \beta^{T}x_0+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1=-\beta_0</math><br />
<br />
<br />
3. The signed distance fo any point <math>\displaystyle x </math> to the hyperplane is: since the length of <math>\displaystyle \beta </math> is not known, let it be unit vector.<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}(x_i-x_0)}{\|\beta\|} </math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i-\beta^{T}x_0}{\|\beta\|} </math><br />
<br />
by property 2<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i+\beta_0}{\|\beta\|} </math><br />
<br />
<br />
<br />
[[File:4.jpg|350px|thumb|right|Figure 28.4]]<br />
<br />
<br />
<math>\displaystyle Margin=min(y_id_i)</math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle Margin=min\{y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\} </math><br />
<br />
Suppose <math>\displaystyle x_i </math> is not on the hyperplan<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)>0 </math><br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq c </math>for <math>\displaystyle c>0 </math><br />
<br />
<br />
<math>\displaystyle y_i(\frac{\beta^{T}x_i}{c}+\frac{\beta_0}{c})\geq1</math> <br />
<br />
This is known as the cnonical representation of the decision hyperplane.<br />
<br />
For <math>\displaystyle \beta^{T} </math> only the direction is important, so <math>\displaystyle \frac{\beta^{T}}{c} </math> does not change its direction, the hyperplance will be the same.<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq1 </math><br />
<br />
<math>\displaystyle y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\geq\frac{1}{\|\beta\|} </math><br />
<br />
<math>\displaystyle Margin=min(\frac{1}{\|\beta\|} )</math><br />
<br />
which is equivalent as <math>\displaystyle Margin=max(\|\beta\| )</math><br />
<br />
<br />
<br />
<br />
Refrence:<br /><br />
Hastie,T.,Tibshirani,R., Friedman,J.,(2008).The Elements of Statistical Learning:129-130<br />
<br />
=='''Optimizing The Support Vector Machine - November 16th, 2009'''==<br />
The Support Vector Machine is used to find a maximum margin hyperplane, assuming the two classes are separable. This margin can be written as <math>\,min\{y_id_i\}</math>, or the distance of each point from the hyperplane, where <math>\,d_i</math> is the distance and <math>\,y_i</math> is used as the sign.<br />
===Maximizing the Support Vector Machine===<br />
<math>\,Margin=min\{y_id_i\}</math> can be rewritten as <math>\,min\left\{\frac{y_i\left(\beta^Tx_i+\beta_0\right)}{|\beta|}\right\}</math>. <br />
<br\>Note that the term <math>\,y_i\left(\beta^Tx_i+\beta_0\right) = 0</math> if <math>\,x_i</math> is on the hyperplane, but <math>\,y_i\left(\beta^Tx_i+\beta_0\right) > 0</math> if <math>\,x_i</math> is <math>\displaystyle not</math> on the hyperplane<br />
<br />
This implies <math>\,\exists C</math> such that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq C</math>.<br />
<br />
Divide through by C to produce <math>\,y_i\left(\frac{\beta^T}{C}x_i + \frac{\beta_0}{C}\right) \geq 1</math>. <br />
<br />
<math>\,\beta, \beta_0</math> compose a hyperplane that can have different values but we care about the direction, dividng through by a constant does not change the hyperplane. Thus, by scaling <math>\,\beta, \beta_0</math> we can eliminate C, so that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq 1</math>. Implying that the lower bound on <math>\,y_i\left(\beta^Tx_i+\beta_0\right)</math> is <math>\displaystyle 1</math><br />
<br />
Now in order to maximize the margin, we simply need to find <math>\,min\left\{\frac{1}{|\beta|}\right\}</math>. <br />
<br />
In other words, find maximum <math>\,|\beta|</math>, s.t. <math>\,min_i\{y_i(\beta^Tx_i+\beta_0)\} = 1</math>.<br />
<br />
Note that we're dealing with the norm of <math>\,\beta</math>. The 1-norm of a vector is simply the sum of the absolute value of each element (also known as the taxicab or Manhattan distance), and is apparently more accurate, but also has a discontinuity in the derivative. 2-norm, the Euclidean norm (the intuitive length of the vector), is easier to work with - that is <math>\,\|\beta\|_2 = (\beta^T\beta)^{1/2}</math>. For convenience, we will maximize <math>\,\frac{1}{2}\|\beta\|_2^2 = \frac{1}{2}\beta^T\beta</math>.<br />
<br />
This is an example of a quadratic programming problem and we will minimize a quadratic function respective to a linear inequality constraints.<br />
<br />
<br />
====Writing Lagrangian Form of Support Vector Machine====<br />
The Lagrangian form is introduced to ensure that the conditions are satisfied, as well as finding an optimal solution. <math>\,\alpha_i</math> are introduced as dual constraints. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.<br />
<br />
<math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math>. To find the optimal value, set the derivative equal to zero.<br />
<br />
<math>\,\frac{\partial L}{\partial \beta} = 0</math>, <math>\,\frac{\partial L}{\partial \beta_0} = 0</math>. Note that <math>\,\frac{\partial L}{\partial \alpha_i}</math> is equivalent to the constraints <math>\left(y_i(\beta^Tx_i+\beta_0)-1\right) \geq 0, \,\forall\, i</math><br />
<br />
First, <math>\,\frac{\partial L}{\partial \beta} = \frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\left\{\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i)+\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0-\frac{\partial}{\partial \beta}\alpha_iy_i\right\}}</math><br />
<br />
: <math>\frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 = \beta</math>.<br />
<br />
: <math>\,\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i) = \alpha_iy_ix_i</math><br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0 = 0</math>. <br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i = 0</math>.<br />
<br />
So this simplifies to <math>\,\frac{\partial L}{\partial \beta} = \beta - \sum_{i=1}^n{\alpha_iy_ix_i} = 0</math>. In other words,<br />
<br />
<math>\,\beta = \sum_{i=1}^n{\alpha_iy_ix_i}</math>, <math>\,\beta^T = \sum_{i=1}^n{\alpha_iy_ix_i^T}</math><br />
<br />
Similarly, <math>\,\frac{\partial L}{\partial \beta_0} = -\sum_{i=1}^n{\alpha_iy_i} = 0</math>.<br />
<br />
This allows us to rewrite the Lagrangian without <math>\,\beta</math>.<br />
<br />
<math>\,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} - \sum_{i=1}^n{\alpha_iy_i(\sum_{j=1}^n{\alpha_jy_jx_j^Tx_i} + \beta_0) - 1}</math>. <br />
<br />
Because <math>\,\sum_{i=1}^n{\alpha_iy_i} = 0</math>, and <math>\,\beta_0</math> is constant, <math>\,\sum_{i=1}^n{\alpha_iy_i\beta_0} = 0</math>. So this simplifies further, to<br />
<br />
<math>L(\alpha) = \,-\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} + \sum_{i=1}^n{\alpha_i}</math><br />
A dual representation of the maximum margin.<br />
<br />
Because <math>\,\alpha_i</math> is the Lagrangian multiplier, <math>\,\alpha_i \geq 0 \forall i</math>.<br />
<br />
This is a much simpler optimization problem.<br />
<br />
=='''The Support Vector Machine algorithm - November 18, 2009'''==<br />
<br />
===Solving the Lagrangian===<br />
<br />
Continuing from the above derivation, we now have the equation that we need to minimize, as well as two constraints.<br />
<br />
The Support Vector Machine problem boils down to:<br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math><br />
:such that <math>\alpha_i \geq 0</math><br />
:and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
<br />
We are looking to minimize <math>\,\alpha</math>, which is our only unknown. Once we know <math>\,\alpha</math>, we can easily find <math>\,\beta</math> and <math>\,\beta_0</math> (see the Support Vector algorithm below for complete details).<br />
<br />
If we examine the Lagrangian equation, we can see that <math>\,\alpha</math> is multiplied by itself; that is, the Lagrangian is quadratic with respect to <math>\,\alpha</math>. Our constraints are linear. This is therefore a problem that can be solved through [http://en.wikipedia.org/wiki/Quadratic_programming quadratic programming] techniques. We will examine how to do this in Matlab shortly.<br />
<br />
We can write the Lagrangian equation in matrix form:<br />
<br />
<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:such that <math>\underline{\alpha} \geq \underline{0}</math><br />
:and <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Where:<br />
* <math>\underline{\alpha}</math> denotes an <math>\,n \times 1</math> vector; <math>\underline{\alpha}^T = [\alpha_1, ..., \alpha_n]</math><br />
* Matrix <math>S = y_iy_jx_i^Tx_j = (y_ix_i)^T(y_ix_i)</math><br />
* <math>\,\underline{0}</math> and <math>\,\underline{1}</math> are vectors containing all 0s or all 1s respectively<br />
<br />
Using this matrix notation, we can use Matlab's build in quadratic programming routine, [http://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/quadprog.html quadprog].<br />
<br />
===Quadprog example===<br />
<br />
Let's use quadprog to find the solution to <math>\,L(\alpha)</math>.<br />
<br />
Matlab's quadprog function minimizes an equation of the following form:<br />
:<math>\min_x\frac{1}{2}x^THx+f^Tx</math><br />
:such that: <math>\,A \cdot x \leq b</math>, <math>\,Aeq \cdot x = beq</math> and <math>\,lb \leq x \leq ub</math><br />
<br />
We can now see why we kept the <math>\frac{1}{2}</math> constant in the original derivation of the equation.<br />
<br />
The function is called as such: <code>x = quadprog(H,f,A,b,Aeq,beq,lb,ub)</code>. The variables correspond to values in the equation above.<br />
<br />
We can now try to find the solution to <math>\,L(\alpha)</math> (though, it should be noted, that in <math>\,L(\alpha)</math> we subtract the first term rather than add it; I'm not sure if this difference would change how we call <code>quadprog</code>).<br />
<br />
We'll use a simple one-dimensional data set, which is essentially y = -1 or 1 + Gaussian noise. (Note: you could easily put the values straight into the quadprog equation; they are separated for clarity)<br />
<br />
x = [mvnrnd([-1],[0.01],100); mvnrnd([1],[0.01],100)]';<br />
y = [-ones(100,1); ones(100,1)];<br />
S = (y*x)' * (y*x);<br />
f = ones(200,1);<br />
A = -ones(1,200);<br />
b = 0;<br />
Aeq = y';<br />
beq = 0;<br />
lb = 0;<br />
ub = []; % There is no upper bound<br />
alpha = quadprog(S,f,A,b,Aeq,beq,lb,ub);<br />
<br />
This gives us the optimal <math>\,\alpha</math>... or at least I think it should, but it does not appear to work for me (that is, despite setting the lower boundary of 0, a number of values are still negative. Whether this is just the nature of <code>quadprog</code> or an error on my part is an exercise for the reader).<br />
<br />
===Examining K.K.T. conditions===<br />
<br />
[http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker conditions] ([http://webrum.uni-mannheim.de/mokuhn/public/KarushKuhnTucker.pdf more info]) give us a closer look into the Lagrangian equation and the associated conditions.<br />
<br />
Suppose we are looking to minimize <math>\,f(x)</math> such that <math>\,g_i(x) \geq 0, \forall{x}</math>. If <math>\,f</math> and <math>\,g</math> are differentiable, then the ''necessary'' conditions for <math>\hat{x}</math> to be local minimum are:<br />
<br />
# At the optimal point, <math>\frac{\partial L}{\partial \hat{x}} = 0</math>; i.e. <math>f'(\hat{x}) - \sum{\alpha_ig'(\hat{x})}=0</math><br />
# <math>\alpha_i \geq 0</math>. (Dual Feasibility) <br />
# <math>\alpha_ig_i(\hat{x}) = 0, \forall{i}</math> (Complementary Slackness)<br />
# <math>g_i(\hat{x}) \geq 0</math> (Primal Feasibility)<br />
<br />
If any of these conditions are violated, then the problem is deemed not feasible.<br />
<br />
These are all trivial except for condition 4. Let's examine it further in our support vector machine problem.<br />
<br />
===Support Vectors===<br />
<br />
Basically, the support vectors are the training points that actually determine the optimal separating hyperplane which we are look for. Also, they are the most difficult points to classify or the most informative for the classification.<br />
<br />
In our case, the <math>g_i(\hat{x})</math> function is:<br />
:<math>\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1</math><br />
<br />
Substituting <math>\,g_i</math> into KKT condition 3, we get <math>\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0</math>. <br\>In order for this condition to be satisfied either <br/><math>\,\alpha_i= 0</math> or <br/><math>\,y_i(\beta^Tx_i+\beta_0=1</math><br />
<br />
All points <math>x_i</math> are either be 1 or greater than 1 distance away from the hyperplane.<br />
<br />
'''Case 1: a point <math>\displaystyle x_i</math> > 1 away from the margin'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) > 1 \Rightarrow \alpha_i = 0</math>.<br />
<br />
If point <math>\, x_i</math> is not on the margin, then the corresponding <math>\,\alpha_i=0</math>.<br />
<br />
'''Case 2: a point <math>\displaystyle x_i</math> 1 away from the margin'''<br />
<br />
If <math>\,\alpha_i > 0 \Rightarrow y_i(\beta^Tx_i+\beta_0) = 1</math> <br />
<br\>If point <math>\, x_i</math> is on the margin, then the corresponding <math>\,\alpha_i>0</math>.<br />
<br />
<br />
<br />
Points on the margin -- points with corresponding <math>\,\alpha_i > 0</math> -- are called support vectors of that margin.<br />
<br />
===Using support vectors===<br />
<br />
Support vectors are important because they allow the support vector machine algorithm to be insensitive to outliers. If <math>\,\alpha = 0</math>, then the cost function is also 0, and won't contribute to the solution of the SVM problem; only points on the margin -- support vectors -- contribute.<br />
<br />
====The support vector machine algorithm====<br />
<br />
# Solve the quadratic programming problem:<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
## Use Matlab's quadprog to find the optimal <math>\,\underline{\alpha}</math><br />
# Find <math>\beta = \sum_{i=1}^n{\alpha_iy_i\underline{x_i}}</math><br />
# Find <math>\,\beta_0</math> by choosing a support vector (a point with <math>\,\alpha_i > 0</math> and solving <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math><br />
<br />
===Example in Matlab===<br />
<br />
The following code, taken verbatim from the lecture, shows how to use Matlab built-in SVM routines (found in the Bioinformatics toolkit) to do classification through support vector machines.<br />
<br />
load 2_3;<br />
[U,Y] = princomp(X');<br />
data = Y(:,1:2);<br />
l = [-ones(1,200) ones(1,200)];<br />
[train,test] = crossvalind('holdOut',400);<br />
% Gives indices of train and test; so, train is a matrix of 0 or 1, 1 where the point should be used as part of the training set<br />
svmStruct = svmtrain(data(train,:), l(train), 'showPlot', true);<br />
<br />
[[File:Svm1.png|frame|center|The plot produced by training on some of the 2_3 data's first two features.]]<br />
<br />
yh = svmclassify(svmStruct, data(test,:), 'showPlot', true);<br />
<br />
[[File:Svm2.png|frame|center|The plot produced by testing some of the 2_3 data.]]</div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=5449stat8412009-11-21T19:24:23Z<p>Ipargaru: /* Examining K.K.T. conditions */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rise of fields such as data-mining, bioinformatics, and machine learning, classification has becomes a fast-developing topic. In the age of information, vast amounts of data are generated constantly, and the goal of classification is to learn from data. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
'''Definition''': The problem of Prediction a discrete random variable <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math> is called Classification.<br />
<br />
In classification,, we attempt to approximate a function <math>\,h</math>, by using a training data set, which will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>d</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data which are identical independent distributions, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, for instance, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability. Intuitively speaking, to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we find <math>\,y\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y|X=x)</math>, and classify <math>\,X</math> into class <math>\,y</math>. In order to calculate the value of <math>\,P(Y=y|X=x)</math>, we use ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posterior probability, <math>\,P(Y=y)</math> as the prior probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two classes, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<sub></sub><br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''3 different approaches to classification''':<br />
<br />
1) Empirical Risk Minimization: Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of <math>\,L(h)</math><br />
<br />
2) Regression: Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3) Density Estimation: estimate <math>\,P(X=x|Y=0)</math> and <math>\,P(X=x|Y=1)</math> (less popular in high-dimension cases)<br />
<br />
<br />
<br />
'''Bayes Classification Rule Optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when<math></math> the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
<br />
:<math>\, h^*(X)= \left\{\begin{matrix} <br />
1 & if P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
Remark:<br />
<br />
1)Bayes classification rule is optimal. Proof:[http://www.ee.columbia.edu/~vittorio/BayesProof.pdf]<br />
<br />
2)We still need any other method, since we cannot define prior probability in realistic.<br />
<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, since it is generally impossible for us to know the prior <math>\,P(Y=1)</math>, and class conditional density <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], tree-augmented naive Bayes (TAN), Bayesian network augmented naive Bayes (BAN) and general Bayesian network (GBN).<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as having objective existence. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayesian and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayesian'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In Bayesian method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
'''Multi-class Classification''':<br />
<br />
Y takes on more than two values.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
''Theorem'': Suppose that <math>\,Y \in \mathcal{Y}= \{1,\dots,k\}</math>, the optimal rule is :<math>\,h^*(X) = \arg\max_{k}{P(Y = k|X = x)}</math><br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
1 Empirical Risk Minimization:Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
<br />
2 Regression:Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3 Density estimation, estimate <math>P(X = x|Y = 0)</math> and <math>P(X = x|Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
It is quadratic because there is no boundaries.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on [http://academicearth.org/lectures/advice-for-applying-machine-learning LDA and QDA] so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
LDA is for classification and FDA is used for feature extraction.<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
<br />
=== logistic function ===<br />
A logistic function or logistic curve is the most common sigmoid curve. <br />
<br />
:<math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
1. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
2. <math>y(0) = \frac{1}{2}</math><br />
<br />
3. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
4. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
5. The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood, using Pr(Y|X). The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Seeing these equations as a weighted least squares problem makes them easier to derivate.<br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== Perceptron (Foundation of Neural Network) ===<br />
<br />
==== Separating Hyperplane Classifiers ====<br />
Separating hyperplane trys to separate the data using linear decision boundaries. When the classes overlap, it can be generalized to support vector machine, which constructs nonlinear boundaries by constructing a linear boundary in an enlarged and transformed feature space.<br />
<br />
==== Perceptron ====<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
Least Squares returns the sign of a linear combination of data points as the class label<br />
<br />
sign(<math>(\underline{\beta}^T \underline{x} + {\beta}_0)) = sign(\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2})</math><br />
<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br />
<br />
<br />
If we find a hyperplane that is not unique between 2 classes, there will be infinitely many solutions obtained from the perceptron algorithm.<br />
<br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
A Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math>, where <math>\,I</math> indicates the sign of the expression and returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math> (initial guess). Its goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach], which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
====Problems with the Algorithm and Issues Affecting Convergence====<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. This problem can be eliminated by using basis expansions technique. To be specific, we try to find a hyperplane not in the original space, but in the enlarged space obtained by using some basis functions.<br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#A perfect separation is not always available even desirable. If observations comes from different classes sharing the same imput, the classification model seems to be overfitting and will generally have poor predictive performance.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
<br />
</ref>.<br />
====Comment on gradient descent algorithm====<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction in which the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and you initially stand in the middle, then you will finally arrive at the saddle point (local minimum) and get stuck there.<br />
<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over <math>\,i</math> (all data points). Actually, this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent) known as Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>\,{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
*A Perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection with branches that "fan out" onto as many connections as desired, each carrying the same signal - the processing element output signal. <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref> It is a multistage regression or classification model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
A regression problem typically has only one unit in the output layer. In a k-class classification problem, there are usually k units in the output layer that each represent the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1).<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classification by NN. <br />
<br />
In perceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] and is not continuous at 0. Thus, we replace it by a smooth function <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it the '''activation function'''.<br />
<br>The choice of this function <math>\displaystyle \sigma </math> is determined by the properties of the data and the assumed distribution of target variables, but for multiple binary classification problems the logistic function, also known as inverse-logit, is often used: <br />
<math>\sigma(a)=\frac {1}{1+e^{-a}}</math><br />
<br />
There are some important properties for the activation function.<br />
<br />
# Activation function is nonlinear. It can be shown that if the activation function of the hidden units is linear, a three-layer neural network is equivalent to a two layer one. <br />
# Activation function saturate, which means there are maximum and minimum output value. This property ensures that the weights are bounded and therefore the searching time is limited. <br />
# Activation function is continuous and smooth.<br />
# Activation function is monotonic. This property is not necessary, since we know that RBF networks is also a kind of power model. <br />
<br />
'''Note:''' A key difference between a perceptron and a neural network is that a neural network uses continuous nonlinearities in the units, for the purpose of differentiation, whereas the perceptron often uses a non-differentiable activation function. The neural network function is differentiable with respect to the network parameters so that a gradient descent method can be used in training. Moreover, a perceptron is a linear classifier, whereas a neural network, by combining layers of perceptrons, is able to classify non-linear problems through proper training.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes into the perceptron, to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural network is called [http://en.wikipedia.org/wiki/Feedforward_neural_network Feed-Forward Neural Network]. Applications to Feed-Forward Neural Networks include data reduction, speech recognition, sensor signal processing, and ECG abnormality detection, to name a few. <ref>J. Annema, Feed-Forward Neural Networks, (Springer 1995), pp. 9 </ref><br />
<br />
===Back-propagation===<br />
For a while, the Neural Network model was just an idea, since there were no algorithms for training the model until 1986, when Geoffrey Hinton <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> devised an algorithm called '''back-propagation''' [http://en.wikipedia.org/wiki/Backpropagation#Algorithm]. After that, a number of other training algorithms and various configurations of neural networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied a gradient descent algorithm for optimizing weights. Back-propagation uses this idea of gradient descent to train a neural network based on the chain rule in calculus. <br />
<br />
Assume that the last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
For simplicity, there is only 1 unit at the end and assume for the moment we are doing regression.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each unit we have a function <math>\displaystyle z_i=\sigma(a_i)</math> that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classification and the resulting classification output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
''i j are reversed!''<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\delta_j \cdot z_l</math><br />
<br />
Note that a change in <math>\,a_j</math> causes changes in all <math>\,a_i</math> in the next layer on which the error is based, so we need to sum over i in the chain:<br />
<math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} =\sum_i \delta_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\,\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math> Using the activation function <math>\,\sigma(\cdot)</math><br />
<br />
So <math>\delta_j = \sum_i \delta_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \delta_i \cdot u_{ij}</math><br />
<br />
We can propagate the error calculated in the output back through the previous layers and adjust weights to minimize error.<br />
<br />
==Neural Networks (NN) - October 30, 2009 ==<br />
<br />
=== Back-propagation ===<br />
The idea is that we first feed an input from the training set to the Neural Network, then find the error rate at the output and then we propagate the error to previous layers and for each edge of weight <math>\,u_{ij}</math> we find <math>\frac{\partial \mathrm{err}}{\partial u_{ij}}</math>. Having the error rates at hand we adjust the weight of each edge by taking steps proportional to the negative of the gradient to decrease the error at output. The next step is to apply the next input from the training set and go through the described adjustment procedure.<br />
The overview of Back-propagation algorithm:<br />
#Feed a point <math>\,x</math> in the training set to the network, and find the output of all the nodes.<br />
#Evaluate <math>\,\delta_k=y_k-\hat{y_k}</math> for all output units, where <math>y_k</math> is the expected output and <math>\hat{y_k}</math> is the real output.<br />
#By propagating to the previous layers evaluate all <math>\,\delta_j</math>s for hidden units: <math>\,\delta_j=\sigma'(a_j)\sum_i \delta_i u_{ij}</math> where <math>i</math> is associated to the previous layer.<br />
#Using <math>\frac{\partial \mathrm{err}}{\partial u_{jl}} = \delta_j\cdot z_l</math> find all the derivatives.<br />
#Adjust each weight by taking steps proportional to the negative of the gradient: <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} -\rho \frac{\partial \mathrm{err}}{u_{ij}}</math><br />
#Feed the next point in the training set and repeat the above steps.<br />
==== How to initialize the weights ====<br />
This still leaves the question of how to initialize the weights <math>\,u_{ij}, w_i</math>. The method of choosing weights mentioned in class was to randomize the weights before the first step. This is not likely to be near the optimal solution in every case, but is simple to implement. To be more specific, random values near zero will be a good choice for the initial weights(usually from [-1,1]). In this case, the model evolves from a nearly linear one to a nonlinear one as we desired. An alternative is to use an orthogonal least squares method to find the initial weights <ref>http://www.mitpressjournals.org/doi/abs/10.1162/neco.1995.7.5.982</ref>. Regression is performed on the weights and output by using a linear approximation of <math>\,\sigma(a_i)</math>, and finds optimal weights in the linear model. Back propagation is used afterward to find the optimal solution, since the NN is non-linear.<br />
<br />
==== How to set learning rates ====<br />
The learning rate <math>\,\rho</math> is usually a constant. <br />
<br />
If we use On-line learning, as a form of stochastic approximation process, <math>\,\rho</math> should decrease as the iteration increase.<br />
<br />
<br />
Choosing too large learning rate may cause the unstability of the system, while the too small learning rate may lead to a very slow convergence rate(very long time in learning phase). However, the advantage of small learning rate is that it can guarantee the convergence. Thus, generally, it is better to choose a relatively small learning rate to ensure the stability. Usually, choose <math>\,\rho</math> between 0.01 and 0.7<br />
<br />
==== How to determine the number of hidden units ====<br />
<br />
Here we will mainly discuss how to estimate the number of hidden units at very beginning. Obviously, we should adjust it to be more precise using CV, LOO or other complexity control methods. <br />
<br />
Basically, if the patterns are well separated, few hidden units are fairly enough. If the patterns are drawn from some highly complicated mixture models, more hidden units are really needed. <br />
<br />
Actually, the number of hidden units determines the size of the model, and therefore the total number of the weights in the model. Typically speaking, the number of weights should not be larger than the number of training data, say N. Thus, sometimes, N/10 is a good choice. However, in pratice, many well performed models will use more hidden units.<br />
<br />
=== Dimensionality reduction application ===<br />
[[File:NN-bottelneck.png|350px|thumb|right|Figure 1: Bottleneck configuration for applying dimensionality reduction.]]<br />
One possible application of Neural Networks is to perform dimensionality reduction, like other techniques, e.g., PCA, MDS, LLE and Isomap.<br />
<br />
Consider the following configuration as shown in figure 1:<br />
As we go forward in layers of this Neural Network, the number of nodes is reduced, until we reach a layer with the number of nodes representing the desired dimensionality. However, note that at the very first few layers the number of nodes may not be strictly decreasing, as long as finally it can reach a layer with less nodes. From now on in the Neural Network <br />
the previous layers are mirrored. So at the output layer we have the same number of states as we have in the input layer. Now note that if we feed the network with each point and get an output approximately equal to the fed input, that means at the output the same input is reconstructed from the middle layer units. So the output of the middle layer units can represent the input with less dimensions.<br />
<br />
To train this Neural Network, we feed the network with a training point and through back propagation we adjust the network weights based on the error between the input layer and the reconstruction at the output layer. Our low dimensional mapping will be the observed output from the middle layer. Data reconstruction consists of putting the low dimensional data through the second half of the network.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
== Complexity Control October 30, 2009 ==<br />
<br />
[[File:overfitting-model.png|500px|thumb|right|Figure 2. The overfitting model passes through all the points of the training set, but has poor predictive power for new points.<br />
In exchange the line model has some error on the training points but has extracted the main characteristic of the training points, and has good predictive power.]]<br />
There are [http://academicearth.org/lectures/underfitting-and-overfitting two issues] that we have to avoid in Machine Learning:<br />
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]<br />
#Underfitting<br />
<br />
Overfitting occurs when our model is heavily complex with so many degrees of freedom, that we can learn every detail of the training set. Such a model will have very high precision on the training set but will show very poor ability to predict outcomes of new instances, especially outside the domain of the training set.<br />
<br />
In a Neural Network if the depth is too much, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will show a very precise outcome of the training set but will not be able to generalize the commonality of the training set to predict the outcome of new cases.<br />
<br />
Underfitting occurs when the model we picked to describe the data is not complex enough, and has high error rate on the training set.<br />
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.<br />
<br />
'''Example'''<br />
#Consider the example showed in the figure. We have a training set and we want to find a model which fits it the best. We can find a polynomial of high degree which almost passes through all the points in the training set. But, in fact the training set is coming from a line model. Now the problem is although the complex model has less error on the training set it diverges from the line in other ranges which we have no training point. Because of that the high degree polynomail has very poor predictive result on test cases. This is an example of overfitting model.<br />
#Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem.<br />
#Consider a simple classification example. If our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is a yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features typical of bananas, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie., we have overfit the data. This occurs when we have considered so many features that we have perfectly described the existing bananas, but if presented with a new banana of slightly different shape than the existing bananas, for example, it cannot be detected. This is the tradeoff; what is the right level of complexity?<br />
<br />
== Complexity Control - Nov 2, 2009 ==<br />
<br />
Overfitting occurs when the model becomes too complex and underfitting occurs when it is not complex enough, both of which are not desirable. To control complexity, it is necessary to make assumptions for the model before fitting the data. Assumptions that we can make for a model are with polynomials or a neural network. There are other ways as well.<br />
<br />
[[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 1: An example of a model with a family of polynomials]]<br />
We do not want a model to get too complex, so we control it by making an assumption on the model. With complexity control, we want a model or a classifier with a low error rate.<br />
<br />
=== '''How do we choose a good classifier?''' ===<br />
<br />
Our goal is to find a classifier that minimizes the true error rate. <br />
Recall the empirical error rate<br />
<br />
<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math><br />
<br />
<math>\,h</math> is a classifier and we want to minimize the error rate. So we apply <math>\displaystyle x_1</math> to <math>\displaystyle x_n</math> to <math>\displaystyle h</math>, and take the average to get the empirical true error rate estimation of probability that <br />
<math>h(x_{i}) \neq y_{i}</math>.<br />
<br />
<span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 2]]</span><br />
There is a downward bias to this estimate meaning that it is always less than the true error rate. <br />
<br />
If there is a change in our complexity from low to high, our error rate is always decreasing. When we apply our model to the test data, our error rate will start to decrease to a point, but then it will increase since the model hasn't seen it before. This can be explained since training error will decrease when we fit the model better by increasing its complexity, but as we have seen, this complex model will not generalize well, resulting in a larger test error. <br />
<br />
We use our test data (from the test sample line shown on Figure 2) to get our empirical error rate.<br />
Right complexity is defined as where error rate of the test data is minimum; and this is one idea behind complexity control.<br />
<br />
<br />
<br />
[[File:Bias.jpg|200px|thumb|left|Figure 3]]<br />
<br />
We assume that we have samples <math>\,X_1, . . . ,X_n</math> that follow some (possibly unknown) distribution. We want to estimate a parameter <math>\,f</math> of the unknown distribution. This parameter may be the mean <math>\,E(X_i)</math>, the variance <math>\,var(X_i)</math> or some other quantity.<br />
<br />
The unknown parameter <math>\,f</math> is a fixed real number <math>f\in R</math>. To estimate it, we use an estimator which is a<br />
function of our observations, <math>\hat{f}(X_1,...,X_n)</math>. <br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f</math><br />
<br />
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]</math><br />
<br />
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math><br />
<br />
One property we desire of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>.<br />
However, there is a more important property for an estimator than just being unbiased: the mean squared error. In statistics, there are problems for which it may be good to use an estimator with a small bias. In some cases, an estimator with a small bias may have lesser mean squared error or be median-unbiased (rather than mean-unbiased, the standard unbiasedness property). The property of median-unbiasedness is invariant under transformations while the property of mean-unbiasedness may be lost under nonlinear transformations. For example, while using an unbiased estimator with large mean square error to estimate the parameter, we highly risk a big error. In contrast, a biased estimator with small mean square error will well improve the precision of our prediction.<br />
<br />
Hence, our goal is to minimize <math>MSE (\hat{f})</math>.<br />
<br />
From figure 3, we can see that the relationship of the three parameters is:<br />
<math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.<br />
<br />
A Test error is a good estimation on MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias.<br />
<br />
<br />
Referring to Figure 2, overfitting happens after the point where training data (training sample line) starts to decrease and test data (test sample line) starts to increase. There are 2 main approaches to avoid overfitting:<br />
<br />
1. Estimating error rate<br />
<br />
<math>\hookrightarrow</math> Empirical training error is not a good estimation<br />
<br />
<math>\hookrightarrow</math> Empirical test error is a better estimation<br />
<br />
<math>\hookrightarrow</math> Cross-Validation is fast<br />
<br />
<math>\hookrightarrow</math> Computing error bound (analytically) using some probability inequality.<br />
<br />
We will not discuss computing the error bound in class; however, a popular method for doing this computation is called VC Dimension (short for Vapnik–Chervonenkis Dimension). Information can be found from [http://www.autonlab.org/tutorials/vcdim.html Andrew Moore] and [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.7171&rep=rep1&type=pdf Steve Gunn].<br />
<br />
2. Regularization<br />
<br />
<math>\hookrightarrow</math> Use of shrinkage method<br />
<br />
<math>\hookrightarrow</math> Decrease the chance of overfitting by controlling the weights<br />
<br />
=== '''Example of under and overfitting in R''' ===<br />
<br />
To give further intuition of over and underfitting, consider this example. A simple quadratic data set with some random noise is generated, and then polynomials of varying degrees are fitted. The errors for the training set and a test set are calculated.<br />
[[File:Curvefitting-rex2.png|250px|thumb|right|Polynomial fits to curved data set.]]<br />
<br />
>> x <- rnorm(200,0,1)<br />
>> y <- x^2-0.5*x+rnorm(200,0,0.3)<br />
>> xtest <- rnorm(50,1,1)<br />
>> ytest <- xtest^2-0.5*xtest+rnorm(50,0,0.3)<br />
>> p1 <- lm(y~x)<br />
>> p2 <- lm(y ~ poly(x,2))<br />
>> pn <- lm(y ~ poly(x,10))<br />
>> psi <- lm(y~I(sin(x))+I(cos(x)))<br />
<br />
: <code>x</code> values for the training set are based on a <math>\,N(0,1)</math> distribution, while the test set has a <math>\,N(1,1)</math> distribution. <code>y</code> values are determined by <math>\,y = x^2 + x + N(0,0.3)</math>, a quadratic function with some random variation. Polynomial least square fits of degree 1, 2, and 10 are calculated, as well as a fit of <math>\,sin(x)+cos(x)</math>.<br />
<br />
>> > # calculate the mean squared error of degree 1 poly<br />
>> > sum((y-predict(p1,data.frame(x)))^2)/length(y)<br />
>> [1] 1.576042<br />
>> > sum((ytest-predict(p1,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 7.727615<br />
: Training and test mean squared errors for the linear fit. These are both quite high - and since the data is non-linear, the different mean value of the test data increases the error quite a bit.<br />
>> > # calculate the mean squared error of degree 2 poly<br />
>> > sum((y-predict(p2,data.frame(x)))^2)/length(y)<br />
>> [1] 0.08608467<br />
>> > sum((ytest-predict(p2,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 0.08407432<br />
: This fit is far better - and there is not much difference between the training and test error, either.<br />
>> > # calculate the mean squared error of degree 10 poly<br />
>> > sum((y-predict(pn,data.frame(x)))^2)/length(y)<br />
>> [1] 0.07967558<br />
>> > sum((ytest-predict(pn,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 156.7139<br />
: With a high-degree polynomial, the training error continues to decrease, but not by much - and the test set error has risen again. The overfitting makes it a poor predictor. As the degree of the polynomial rises further, the accuracy of the computer becomes an issue - and a good fit is not even consistently produced for the training data.<br />
>> > # calculate mse of sin/cos fit<br />
>> > sum((y-predict(psi,data.frame(x)))^2)/length(y)<br />
>> [1] 0.1105446<br />
>> > sum((ytest-predict(psi,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 1.320404<br />
: Fitting a function of the form sin(x)+cos(x) works pretty well on the training set, but because it is not the real underlying function, it fails on test data which doesn't lie on the same domain.<br />
<br />
== ''' Cross-Validation (CV) - Introduction ''' ==<br />
<br />
[[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]]<br />
Cross-Validation is used to estimate the error rate of a classifier with respect to test data rather than data used in the model. Here is a general introduction to CV:<br />
<br />
<math>\hookrightarrow</math> We have a set of collected data for which we know the proper labels<br />
<br />
<math>\hookrightarrow</math> We divide it into 2 parts, Training data (T) and Validation data (V)<br />
<br />
<math>\hookrightarrow</math> For our calculation, we pretend that we do not know the label of V and we use data in T to train the classifier<br />
<br />
<math>\hookrightarrow</math> We estimate an empirical error rate on V since the model hasn't seen V yet and we know the proper label of all elements in V to know how many were misclassified.<br />
<br />
CV has different implementations which can reduce the variance of the calculated error rate, but sometimes with a tradeoff of a higher calculation time.<br />
<br />
== ''' Complexity Control - Nov 4, 2009''' ==<br />
<br />
== Cross-validation ==<br />
[[File:Cross-validation.png|350px|thumb|right|Figure 1: Classical/Standard cross-validation]]<br />
Cross-validation is the simplest and most widely used method to estimate the true error. It comes from the observation that although training error always decreases with the increasing complexity of the model, the test error starts to increase from a certain point, which is noted as overfitting (see [[#prediction-error|figure 2]] above). Since test error estimates MSE (mean square error) best, people came up with the idea to divide the data set into three parts: training set, validation set, and test set. training set is used to build the model, validation set is used to deside the parameters and the optimal model, and the test set is used to estimate the performance of the chosen model. A classical division is 50% for training set, and 25% each for validation set and test set. All of them are randomly selected from the original data set. <br />
<br />
Then, we only use the part of our data marked as the "training set" to train our algorithm, while keeping the remaining marked as the "validation set" untouched. As a result, the validation set will be totally unknown to the trained model. The error rate is then estimated by:<br />
<br />
<math>\hat L(h) = \frac{1}{|\nu|}\sum_{X_i \in \nu}(h(x_i) \neq y_i)</math>, where <math>\,|\nu|</math> is the cardinality of the validation set.<br />
<br />
When we change the complexity, the error generated by the validation set will have the same behavior as the test set, so we are able to choose the best parameters to get the lowest error.<br />
<br />
<br />
=== K-fold Cross-validation ===<br />
[[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]]<br />
Above is the simplest form of complexity control. However, in reality, it may be hard to collect data ??and we usually suffer from the curse of dimensionality??, and a larger data set may be hard to come by. Consequently, we may not be able to afford to sacrifice part of the limited resources. In this case we use another method that addresses this problem, K-fold cross-validation. We divide the data set into <math>\,K</math> subsets roughly equal in size. The usual choice is <math>\,K = 10</math>.<br />
<br />
Generally, how to choose <math>\,K</math>:<br />
<br />
if <math>\,K=n</math>, leave one out, low bias, high variance. Each subset contains a single element, so the model is trained with all except one point, and then validated using that point.<br />
<br />
if <math>\,K=2</math>, say 2-fold, 5-fold, high bias, low variance. Each subset contains approximately <math>\,\frac{1}{2}</math> or <math>\,\frac{1}{5}</math> of the data.<br />
<br />
For every <math>\,k</math>th <math>( \,k \in [ 1, K ] )</math> part, we use the other <math>\,K-1</math> parts to fit the model and test on the <math>\,k</math>th part to estimate the prediction error <math>\hat L_k</math>, where<br />
<br />
<math>\hat L(h) = \frac{1}{K}\sum_{k=1}^K\hat L_k</math><br />
<br />
For example, suppose we want to fit a polynomial model to the data set and split the set into four equal subsets as shown in Figure 2. First we choose the degree to be 1, i.e. a linear model. Next we use the first three sets as training sets and the last as validation set, then the 1st, 2nd, 4th subsets as training set and the 3rd as validation set, so on and so forth until all the subsets have been the validation set once (all observations are used for both training and validation). After we get <math>\hat L_1, \hat L_2, \hat L_3, \hat L_4</math>, we can calculate the average <math>\hat L</math> for degree 1 model. Similarly, we can estimate the error for n degree model and generate a simulating curve. Now we are able to choose the right degree which corresponds to the minimum error. Also, we can use this method to find the optimal unit number of hidden layers of neural networks. We can begin with 1 unit number, then 2, 3 and so on and so forth. Then find the unit number of hidden layers with lowest average error.<br />
<br />
=== Generalized Cross-validation ===<br />
If the vector of observed values is denoted by <math>\mathbf{y}</math>, and the vector of fitted values by <math>\hat\mathbf{y}</math>.<br />
<br />
<math>\mathbf{y} = \mathbf{H}\hat\mathbf{y}</math>, <br />
<br />
where the hat matrix is given by<br />
<br />
<math>\mathbf{H} = \mathbf{X}( \mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X}^{T}</math>,<br />
<br />
<math> \frac{1}{N}\sum_{i=1}^{N}[y_{i} - \hat f^{-i}(\mathbf{x}_{i})]^{2}=\frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-\mathbf{H}_{ii}}]^{2}</math>,<br />
<br />
Then the GCV approximation is given by<br />
<br />
<math> GCV(\hat f) = \frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-trace(\mathbf{H})/N}]^{2}</math>,<br />
<br />
Thus, one of the biggest advantages of the GCV is that the trace is more easily computed.<br />
<br />
=== Leave-one-out Cross-validation ===<br />
Leave-one-out cross-validation involves using all but one data point in the original training data set to train our model, then using the data point that we initially left out as a means to estimate true error. But repeating this process for every data point in our original data set, we can obtain a good estimation of true error.<br />
<br />
In other words, leave-one-out cross-validation is k-fold cross-validation in which we set the subset number <math>\,K</math> to be the cardinality of the whole data set.<br />
<br />
In the above example, we can see that k-fold cross-validation can be computationally expensive: for every possible value of the parameter, we must train the model <math>\,K</math> times. This deficiency is even more obvious in leave-one-out cross-validation, where we must train the model <math>\,n</math> times, where <math>\,n</math> is the number of data point in the data set.<br />
<br />
Fortunately, when adding data points to the classifier is reversible, calculating the difference between two classifiers is computationally more efficient than calculating the two classifiers separately. So, if the classifier on all the data points is known, we simply undo the changes from a data point <math>\,K</math> times to calculate the leave-one-out cross-validation error rate.<br />
<br />
== Regularization for Neural Network — Weight Decay ==<br />
[[File:figure 2.png|350px|thumb|right|Figure 1: activation function]]<br />
Weight decay training is suggested as an implementation for achieving a robust neural network which is insensitive to noise. Since the number of hidden layers in NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting.<br />
<br />
It can be seen from Figure 1 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, we can avoid overfitting by constraining the weights to be small. This gives us a hint to initialize the random weights to be close to zero.<br />
<br />
Formally, we penalize nonlinear weights by adding a penalty term in the error function. Now the regularized error function becomes:<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}|w_i|^2 + \sum_{jk}|u_{jk}|^2)</math>, where <math>\,err</math> is the original error in back-propagation; <math>\,w_i</math> is the weights of the output layer; <math>\,u_{jk}</math> is the weights of the hidden layers.<br />
<br />
Usually, too large <math>\,\lambda</math> will make the weights <math>\,w_i</math> and <math>\,u_{jk}</math> too small. We can use cross validation to estimate <math>\,\lambda</math>.<br />
<br />
A similar penalty, weight elimination, is given by,<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}\frac{|w_i|^2}{1 + |w_i|^2} + \sum_{jk}\frac{|u_{jk}|^2}{1+|u_{jk}|^2})</math>.<br />
<br />
As in back-propagation, we take partial derivative with respect to the weights:<br />
<br />
<math>\frac{\partial REG}{\partial w_i} = \frac{\partial err}{\partial w_i} + 2\lambda w_i</math><br />
<br />
<math>\frac{\partial REG}{\partial u_{jk}} = \frac{\partial err}{\partial u_{jk}} + 2\lambda u_{jk}</math><br />
<br />
<math>w^{new} \leftarrow w^{old} - \rho\left(\frac{\partial err}{\partial w} + 2\lambda w\right)</math><br />
<br />
<math>u^{new} \leftarrow u^{old} - \rho\left(\frac{\partial err}{\partial u} + 2\lambda u\right)</math><br />
<br />
Note that here <math>\,\lambda</math> serves as a trade-off parameter, tuning between the error rate and the linearity. Actually, we may also set <math>\,\lambda</math> by cross-validation. The tuning parameter is important since weights of zero will lead to zero derivatives and the algorithm will not change. On the other hand, starting with weights that are too large means starting with a nonlinear model which can often lead to poor solutions. <ref>Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning (Springer 2009) pp.398</ref><br />
<br />
== Radial Basis Function (RBF) Networks - November 6, 2009 ==<br />
<br />
[[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]]<br />
<br />
=== Introduction === <br />
<br />
A Radial Basis Function (RBF) network [http://en.wikipedia.org/wiki/Radial_basis_function_network] is a type of artificial neural network with an output layer and a single hidden layer, with weights from the hidden layer to the output layer, and can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. One choice that has been widely used is that of radial basis functions, which have the property that each basis function depends only on the radial distance (typically Euclidean) from a center <math>\displaystyle\mu_{j}</math>, so that <math>\phi_{j}(x)= h({\Vert x - \mu_{j}\Vert})</math>.<br />
<br />
<br />
The output of an RBF network can be expressed as a weighted sum of its radial basis functions as follows:<br />
<br />
<math>\hat y_{k} = \sum_{j=1}^M\phi_{j}(x) w_{jk}</math><br />
<br />
The radial basis function is: <br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br /><br />
(Gaussian without a normalization constant)<br /><br /><br />
'''note:'''The hidden layer has a variable number of neurons (the optimal number is determined by the training process). As usual the more neurons in the hidden layer, the higher the model complexity. Each neuron consists of a radial basis function centered on a point with the same dimensions as the input data. The radii of the RBF functions may be different. The centers and radii can be determined through clustering or an EM algorithm. When the x vector is given from the input layer, the hidden neuron computes the radial distance from the neuron’s center point and then applies RBF function to this distance. The resulting value is passed to the the output layer and weighed together to form the output. <br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat{Y}_{n,k} = \left[ \begin{matrix}<br />
\hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\<br />
\hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\<br />
\vdots &\vdots & \ddots & \vdots \\<br />
\hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k}<br />
\end{matrix}\right] </math> is the matrix of output variables. <br />
<br />
:<math>\Phi_{n,m} = \left[ \begin{matrix}<br />
\phi_{1,1} & \phi_{1,2} & \cdots & \phi_{1,m} \\<br />
\phi_{2,1} & \phi_{2,2} & \cdots & \phi_{2,m} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
\phi_{n,1} & \phi_{n,2} & \cdots & \phi_{n,m}<br />
\end{matrix}\right] </math> is the matrix of Radial Basis Functions.<br />
<br />
:<math>W_{m,k} = \left[ \begin{matrix}<br />
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\<br />
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
w_{m,1} & w_{m,2} & \cdots & w_{m,k}<br />
\end{matrix}\right] </math> is the matrix of weights.<br />
<br />
Here, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>k = 1</math>, <math>\hat Y</math> and <math>W</math> are column vectors.<br />
<br />
''related reading'':<br />
<br />
Introduction of the Radial Basis Function (RBF) Networks [http://axiom.anu.edu.au/~daa/courses/GSAC6017/rbf.pdf]<br />
<br />
Radial Basis Function (RBF) Networks [http://documents.wolfram.com/applications/neuralnetworks/index6.html] [http://lcn.epfl.ch/tutorial/english/rbf/html/index.html]<br />
<br />
=== Estimation of weight matrix W ===<br />
<br />
We minimize the training error, <math>\Vert Y - \hat{Y}\Vert^2</math> in order to find <math>\,W</math>.<br /><br /><br />
From a previous result in linear algebra we know that <br />
<br />
<math>\Vert A \Vert^2 = Tr(A^{T}A)</math><br />
<br />
Thus we have a problem similar to linear regression:<br />
<br />
<math>\ err = \Vert Y - \Phi W\Vert^{2} = Tr[(Y - \Phi W)^{T}(Y - \Phi W)]</math><br />
<br />
<math>\ err = Tr[Y^{T}Y - Y^{T}\Phi W - W^{T} \Phi^{T} Y + W^{T}\Phi^{T} \Phi W]</math><br />
<br />
<br />
==== Useful properties of matrix differentiation ====<br />
<br />
<br />
<math>\frac{\partial Tr(AX)}{\partial X} = A^{T}</math><br />
<br />
<math>\frac{\partial Tr(X^{T}A)}{\partial X} = A</math><br />
<br />
<math>\frac{\partial Tr(X^{T}AX)}{\partial X} = (A^{T} + A)X</math><br />
<br />
==== Solving for W ====<br />
<br />
We find the minimum over <math>\,W</math> by setting <math>\frac{\partial err}{\partial W}</math> equal to zero and using the aforementioned properties of matrix differentiation.<br />
<br />
<math>\frac{\partial err}{\partial W} = 0</math><br />
<br />
<math>\ 0 - \Phi^{T}Y - \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ -2 \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math><br />
<br />
<math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY</math><br />
<br />
where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}</math><br />
<br />
<math>\,H</math> is the hat matrix for this model. This gives us a nice results since the solution has a closed form and we do not have to worry about convexity problems in this case.<br />
<br />
=== Including an additional bias ===<br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat Y = \left[ \begin{matrix}<br />
y_{11} & y_{12} & \cdots & y_{1k} \\<br />
y_{21} & y_{22} & \cdots & y_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
y_{n1} & y_{n2} & \cdots & y_{nk}<br />
\end{matrix}\right] </math> is the matrix(n by k) of output variables.<br />
<br />
:<math>\Phi = \left[ \begin{matrix}<br />
\phi_{10} &\phi_{11} & \phi_{12} & \cdots & \phi_{1M} \\<br />
\phi_{20} & \phi_{21} & \phi_{22} & \cdots & \phi_{2M} \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{n0} &\phi_{n1} & \phi_{n2} & \cdots & \phi_{nM}<br />
\end{matrix}\right] </math> is the matrix(n by M+1) of Radial Basis Functions.<br />
<br />
:<math>W = \left[ \begin{matrix}<br />
w_{01} & w_{02} & \cdots & w_{0k} \\<br />
w_{11} & w_{12} & \cdots & w_{1k} \\<br />
w_{21} & w_{22} & \cdots & w_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
w_{M1} & w_{M2} & \cdots & w_{Mk}<br />
\end{matrix}\right] </math> is the matrix(M+1 by k) of weights.<br />
<br />
where the extra basis function <math>\Phi_{0}</math> is set to 1.<br />
<br />
==== Normalized RBF ====<br />
<br />
In addition to the above unnormalized architecture, the normalized RBF can be represented as:<br />
<br />
<math>\hat{y}_{k}(X) = \frac{\sum_{j=1}^{M} w_{jk}\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math><br /><br /><br />
<br />
<br />
Actually, <math>\Phi^{\ast}_{j}(X) = \frac{\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math> is known as a normalized radial basis function. Giving the familiar form,<br /><br />
<br />
<math>\hat{y}_{k}(X) = \sum_{j=1}^{M} w_{jk}\Phi^{\ast}_{j}(X)</math><br /><br /><br />
<br />
=== Conceptualizing RBF networks ===<br />
<br />
In the past, we have classified data using models that were explicitly linear, quadratic, or otherwise definite. In RBF networks, like in Neural Networks, we can fit an arbitrary model. How can we do this without changing the equations being used?<br />
<br />
Recall a [[#Trick:_Using_LDA_to_do_QDA_-_October_7.2C_2009|trick]] that was discussed in the October 7 lecture: if we add new features to our original data set, we can project into higher dimensions, use a linear algorithm, and get a quadratic result by collapsing to a lower dimension afterward. In RBF networks, something similar can happen.<br />
<br />
Think of <math>\,\Phi</math>, our matrix of radial basis functions, as a feature space of the input. Each hidden unit, then, can be thought to represent a feature; we can see that, if there are more hidden units than input units, we can essentially project to a higher-dimensional space, as we did in our earlier trick. However, this does not mean that an RBF network will actually do this, it is merely a way to convince yourself that RBF networks (and neural networks) can fit arbitrary models. Nevertheless, it is also noticed that just because of such great power, the problem of overfitting appears more important. We have to control its complexity so that it does not fit to an arbitrary training model but to a general one.<br />
<br />
=== RBF networks for classification -- a probabilistic paradigm ===<br />
<br />
[[File:Rbf_graphical_model.png|350px|thumb|left|Figure 1: RBF graphical model]]<br />
<br />
An RBF network is akin to fitting a Gaussian mixture model to data. We assume that each class can be modelled by a single function <math>\,\phi</math> and data is generated by a mixture model. According to Bayes Rule,<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(x|y_{k})*Pr(y_{k})}{Pr(x)}</math><br />
<br />
While all classifiers that we have seen thus far in the course have been in discriminative form, the RBF network is a generative model that can be represented using a directed graph.<br />
<br />
We can replace the class conditional density in the above conditional probability expression by marginalizing <math>\,x</math> over <math>\,j</math>:<br />
<math>\Pr(x|y_{k}) = \sum_{j} Pr(x|j)*Pr(j|y_{k})</math><br />
<br />
<br />
<br />
<br/><br/><br />
*'''Note''' We made the assumption that each class can be modelled by a single function <math>\displaystyle\Phi</math> and that the data was generated by a mixture model. The Gaussian mixture model has the form:<br />
<math>f(x)=\sum_{m=1}^M \alpha_m \phi(x;\mu_m,\Sigma_m)</math> where <math>\displaystyle\alpha_m</math> are mixing proportions, <math>\displaystyle\sum_m \alpha_m=1</math>, and <math>\displaystyle\mu_m</math> and <math>\displaystyle\Sigma_m</math> are the mean and covariance of each Gaussian density respectively. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009), pp. 214. </ref> The generative model in Figure 1 shows graphically how each Gaussian in the mixture model is chosen to sample from.<br />
<br />
== '''Radial Basis Function (RBF) Networks - November 9th, 2009''' ==<br />
<br />
=== RBF Network for classification (A probabilistic point of view) ===<br />
Using RBF Network to do classification, we usually treat it as regression problem. We want to set a threshold to decide what the data’s class membership is. However, to find some insight to the classification problem of what we are doing in terms of RBF Network, we often think of mixture models and make certain assumptions.<br />
<br />
[[File:RBF.png|350px|thumb|right|Figure 26.1: RBF Network Classification Demo]] <br />
<br />
We assume, as we can see in the graph on the right hand side, that we have three random variables, <math>\displaystyle y_k</math>, <math>\displaystyle j</math>, and <math>\displaystyle x</math> where <math>\displaystyle y_k</math> denotes class <math>\,k</math>, <math>\displaystyle x</math> is what we observed here, and <math>\displaystyle j</math> is some hidden random variable. There is a process that there are different classes, and each class can trigger a different hidden random variable <math>\displaystyle j</math>. To understand this, we can assume that, for instance, this random variable <math>\displaystyle j</math> has a Gaussian distribution (it could have any other distribution as well) and that all the <math>\displaystyle j</math>’s have the same distribution (Gaussian), but with different parameters. From each Gaussian distribution triggered by each class, we are going to sample some data points. Therefore, in the end, we are going to get a set of data, which are not strictly Gaussian, but are actually a mixture of Gaussians.<br />
<br />
Again, we look at the posterior distribution from [http://en.wikipedia.org/wiki/Bayes'_theorem Bayes' Rule].<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(X = x | Y = y_{k})*Pr(Y = y_{k})}{Pr(X = x)}</math><br />
<br />
Since we made the assumption that the data has been generated from a mixture model, we can estimate this conditional probability by<br />
<br />
<math>\Pr(X = x | Y = y_{k}) = \sum_{j} Pr(X = x | j)*Pr(j | Y = y_{k})</math>, <br />
<br />
which is the class conditional distribution (or probability) of the mixture model. Note, here, if we only have a simple model from <math>\displaystyle y_k</math> to <math>\displaystyle x</math>, then we won’t have this summation.<br />
<br />
We can substitute this class conditional distribution into Bayes' formula. We can see that the posterior of class <math>\displaystyle k</math> is the summation over <math>\displaystyle j</math> of the probability of <math>\displaystyle x</math> given <math>\displaystyle j</math> times the probability of <math>\displaystyle j</math> given <math>\displaystyle y_k</math>, times the prior distribution of class <math>\displaystyle k</math>, and lastly divided by the marginal probability of <math>\displaystyle x</math>. That is,<br />
<br />
<math>\Pr(y_k | x) = \frac {\sum_{j} Pr(x | j)*Pr(j | y_{k})*Pr(y_{k})}{Pr(x)}</math>.<br />
<br />
Since, the prior probability of class <math>\displaystyle k</math>, <math>\displaystyle Pr(y_{k})</math>, does not have an index of <math>\displaystyle j</math>, it can be taken out of the summation. This yields,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)}</math>.<br />
<br />
We multiply this by <math>\displaystyle 1 = \frac {Pr(j)}{Pr(j)}</math>. Then, it becomes,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)} * \frac {Pr(j)}{Pr(j)}</math>.<br />
<br />
Next, note that <math>\displaystyle Pr(j | x) = \frac {Pr(x | j)*Pr(j)}{Pr(x)}</math>, and <math>\displaystyle Pr(y_k | j) = \frac {Pr(j | y_k)*Pr(y_k)}{Pr(j)}</math>. Then rearranging the terms, we finally have the posterior:<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} Pr(j | x)Pr(y_k | j)</math>.<br />
<br />
Interestingly, this is just the product of the posterior of the two functions that are summed.<br />
<br />
==== Interpretation of RBF Network classification ====<br />
<br />
[[File:2.png|350px|thumb|right|Figure 26.1.2(2): RBF Nerwork ]]<br />
<br />
We want to relate the results that we derived above to our RBF Network. In a RBF Network, as we can see on the right hand side, we have a set of data, <math>\displaystyle x_1</math> to <math>\displaystyle x_d</math>, and the hidden basis function, <math>\displaystyle \phi_{1}</math> to <math>\displaystyle \phi_{M}</math>, and then we have some output, <math>\displaystyle y_1</math> to <math>\displaystyle y_k</math>. Also, we have weights from the hidden layer to output layer. The output is just the linear sum of <math>\displaystyle \phi</math>’s. <br />
<br />
Now consider probability of <math>\displaystyle j</math> given <math>\displaystyle x</math> to be <math>\displaystyle \phi</math>, and the probability of <math>\displaystyle y_k</math> given <math>\displaystyle j</math> to be the weights <math>\displaystyle w_{jk}</math>, then the posterior can be written as,<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math>.<br />
<br />
[[File:3.png|350px|thumb|left|Figure 26.1.2(1): Gaussian mixture ]]<br />
<br />
Now, let us look at an example in one dimensional case. Suppose,<br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, and <math>\displaystyle j</math> is from 1 to 2. <br />
<br />
We know that <math>\displaystyle \phi</math> is a radial basis function. It's as if we put some Gaussian over data. And for each Gaussian, we consider the center <math>\displaystyle \mu</math>. Then, what <math>\displaystyle \phi</math> computes is the similarity of any data point to the center. <br />
<br />
We can see the graph on the left which plots the density of <math>\displaystyle \phi_{1}</math> and <math>\displaystyle \phi_{2}</math>. Take <math>\displaystyle \phi_{1}</math> for instance, if the point gets far from the center <math>\displaystyle \mu_{1}</math>, then it will reduce <math>\displaystyle \phi_{1}</math> to become nearly zero. Remember that, we can usually find a non-linear regression or classification of input space by doing a linear one in some extended space or some feature space (more details in Aside). Here, the <math>\displaystyle \phi</math>’s actually produce that feature space. <br />
<br />
So, one way to look at this is that this <math>\displaystyle \phi</math> is telling us that given an input, how likely the probability of presence of a particular feature is. Say, for example, we define the features as the centers of these Gaussian distributions. Then, this <math>\displaystyle \phi</math> function somehow computes the possibility given certain data points, of this kind of feature appearing. If the data point is right at the center, then the value of that <math>\displaystyle \phi</math> would be one, i.e. the probability is 1. If the point is far from the center, then the probability (<math>\displaystyle \phi</math> function value) will be close to zero, that is, it’s less likely. Therefore, we can treat <math>\displaystyle Pr(j | x)</math> as the probability of a particular feature given data. <br />
<br />
When we have those features, then <math>\displaystyle y</math> is the linear combination of the features. Hence, any of the weights <math>\displaystyle w</math>, which is equal to <math>\displaystyle Pr(y_k | j)</math>, tells us how likely this particular <math>\displaystyle y</math> will appear given those features. Therefore, the weight <math>\displaystyle w_{jk}</math> shows the probability of class membership given feature. <br />
<br />
Hence, we have found a probabilistic point of view to look at RBF Network!<br />
<br />
*'''Note''' There are some inconsistencies with this probabilistic point of view. There are no restrictions that force <math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math> to be between 0 and 1. So if least squares is used to solve this, <math>\displaystyle w_{jk}</math> cannot be interpreted as a probability. <br />
<br />
<br />
''' Aside '''<br />
*Feature Space:<br />
:One way to produce a feature space is LDA<br />
:Suppose, we have n data points <math>\mathbf{x}_1</math> to <math>\mathbf{x}_n </math>. Each data point has d features. And these n data points consist of the <math>X</math> matrix, <br />
:<math>X = \left[ \begin{matrix}<br />
x_{11} & x_{21} & \cdots & x_{n1} \\<br />
x_{12} & x_{22} & \cdots & x_{n2} \\<br />
\vdots & & \ddots & \vdots \\<br />
x_{1d} & x_{2d} & \cdots & x_{nd}<br />
\end{matrix}\right] </math><br />
:Also, we have feature space,<br />
:<math>\Phi^{T} = \left[ \begin{matrix}<br />
\phi_{1}(\mathbf{x_1}) & \phi_{1}(\mathbf{x_2})& \cdots & \phi_{1}(\mathbf{x_n})\\<br />
\phi_{2}(\mathbf{x_1})& \phi_{2}(\mathbf{x_2})& \cdots & \phi_{2}(\mathbf{x_n}) \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{M}(\mathbf{x_1}) & \phi_{M}(\mathbf{x_2}) & \cdots & \phi_{M}(\mathbf{x_n})<br />
\end{matrix}\right] </math> <br />
:If we want to solve a regression problem for the input data, we don’t perform Least Square on this <math>\displaystyle X</math> matrix, we do Least Square on the feature space, i.e. on the <math>\displaystyle \Phi^{T}</math> matrix. The dimensionality of <math>\displaystyle \Phi^{T}</math> is M by n.<br />
:Now, we still have n data points, but we define these n data points in terms of a new set of features. So, originally, we define our data points by d features, but now, we define them by M features. And what are those M features telling us? <br />
:Let us look at the first column of <math>\displaystyle \Phi^{T}</math> matrix. The first entry is <math>\displaystyle \phi_1</math> applied to <math>\mathbf{x_1}</math>, and so on, until the last entry is <math>\displaystyle \phi_M</math> applied to <math>\mathbf{x_1}</math>. Suppose each of these <math>\displaystyle \phi_j</math> is defined by<br />
:<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>.<br />
:Then, each <math>\displaystyle \phi_j</math> checks the similarity of the data point with its center. Hence, the new set of features are actually representing M centers in our data set, and for each data point, its new features check how this point is similar to the first center; how it is similar to the second center; and how it is similar to the <math>\displaystyle M^{th}</math> center. And this checking process will apply to all data points. Therefore, feature space gives another representation of our data set. <br />
<br />
</noinclude><br />
<br />
=== Model selection or complexity control for RBF Network - a brief introduction ===<br />
In order to obtain a better fit for the training data, we often want to increase the complexity of our RBF Network. By its construction, we know that to change the complexity of a RBF Network, the only way is to add or decrease the number of basis functions. A large number of basis function yields a more complex network. In theory, if we add enough basis functions, the RFB Network would work for any training; however, it doesn't mean this RBF Network model can generalize well. Therefore, for the purpose of avoiding overfitting problem (see Notes below), we only want to increase the number of basis function to certain point, i.e. its optimal level. <br />
<br />
For the model selection, what we usually do is estimate the training error. After working through the training error, we’ll see that the training error in fact can be decomposed, and one component of training error is called Mean Squared Error (MSE). In the later notes, we will find that our final goal is to get a good estimate of MSE. Moreover, in order to find an optimal model for our data, we select the model with the smallest MSE.<br />
<br />
Now, let us introduce some notations that we will use in the analysis:<br />
*<math>\hat f</math> -- the prediction model estimated by a RBF network from the training data<br />
*<math>\displaystyle f</math> -- the real model (not null), and ideally, we want <math>\hat f</math> to be close to <math>\displaystyle f</math><br />
*<math>\displaystyle err</math> -- the training error<br />
*<math>\displaystyle Err</math> -- the testing error<br />
*<math>\displaystyle MSE</math> -- the Mean Squared Error<br />
<br />
''' Notes '''<br />
<br />
[[File:overfitting.png|350px|thumb|left|Figure 26.2: Overfitting]]<br />
<br />
*Being more complex isn’t always a good thing. Sometime, [http://en.wikipedia.org/wiki/Overfitting overfitting] causes the model to lose its generality. For example in the graph on left hand side, the data points are sampled from the model <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle f(x_i)</math> is a linear function, which is shown by the blue line, and <math>\displaystyle \epsilon_i</math>is additive Gaussian noise from <math>~N(0,\sigma^2)</math>. The red curve displayed in the graph shows the over-fitted model. Clearly, this over-fitted model only works for any training data, and is useless for any further prediction when new data points are introduced.<br />
<br />
> n<-20;<br />
> x<-seq(1,10,length=n);<br />
> alpha<-2.5;<br />
> beta<-1.75;<br />
> y<-alpha+beta*x+rnorm(n);<br />
> plot(y~x, pch=16, lwd=3, cex=0.5, main='Overfitting');<br />
> abline(alpha, beta, col='blue');<br />
> lines(spline(x, y), col = 2);<br />
<br />
*More details on this topic later on.<br />
<br />
<br />
<br />
</noinclude><br />
<br />
<br />
<br />
<br />
<br />
<br />
== '''Model Selection(Stein's Unbiased Risk Estimate)- November 11th, 2009''' ==<br />
<br />
===Model Selection===<br />
<br />
Model selection is a task of selecting a model of optimal complexity for a given data. Learning a radial basis function network from data is a parameter estimation problem. One difficulty with this problem is selecting parameters that show good performance on both training and testing data. In principle, a model is selected to have parameters associated with the best observed performance on training data, although our goal really is to achieve good performance on unseen testing data. Not surprisingly, a model selected on the basis of training data does not necessarily exhibit comparable performance on the testing data. When squared error is used as the performance index, a zero-error model on the training data can always be achieved by using a sufficient number of basis functions.<br />
<br />
<br />
But, training error and testing error do not demonstrate a linear relationship. In particular, a smaller training error does do not necessarily result in a smaller testing error. In practice, one often observes that, up to a certain point, the model error on testing data tends to decrease as the training error decreases. However, if one attempts to decrease the training error too far by increasing model complexity, the testing error often can take a dramatic increase.<br />
<br />
<br />
The basic reason behind this phenomenon is that in the process of minimizing training error, after a certain point, the model begins to over-fit the training set. Over-fitting in this context means fitting the model to training data at the expense of losing generality. In the extreme form, a set of <math>\displaystyle N</math> training data points can be modeled exactly with <math>\displaystyle N</math> radial basis functions. Such a model follows the training data perfectly. However, the model is not representative features of the true underlying data source, and this is why it fails to correctly model new data points.<br />
<br />
<br />
In general, the training error rate will be less than the testing error on the new data. A model typically adapts to the training data, and hence the training error will be an overly optimistic estimate of the testing error. An obvious way to well estimate testing err is to add a penalty term to the training error to compensate. Actually, SURE is developed based on this.<br />
<br />
<br />
<br />
===Stein's unbiased risk estimate (SURE)===<br />
<br />
<br />
====Important Notation====<br />
<br />
Let:<br />
*<math>\hat f(X)</math> denote the ''prediction model'', which is estimated from a training sample by the RBF neural network model.<br />
*<math>\displaystyle f(X)</math> denote the ''true model''.<br />
*<math>\displaystyle err=\sum_{i=1}^N (\hat y_i-y_i)^2 </math> denote the ''training error'',which is the average loss over the training sample.<br />
*<math>\displaystyle Err=\sum_{i=1}^M (\hat y_i-y_i)^2 </math> denote the ''test error'', which is the expected prediction error on an independent test sample.<br />
*<math>\displaystyle MSE=E(\hat f-f)^2</math> denote the ''mean squared error'', where <math>\hat f(X)</math> is the estimated model and <math>\displaystyle f(X)</math> is the true model.<br />
<br />
<br />
<br />
Suppose the observations <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>.We need to estimate <math>\hat f</math> from training data set <math>T=\{(x_i,y_i)\}^2_{i=1}</math>. Let <math>\hat f_i=\hat f_(x_i)</math> and <math>\displaystyle f_i= f(x_i)</math>, then <br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i-\epsilon_i)^2]</math><math>=E[(\hat f_i-f_i)^2]+E[\epsilon_i^2]-2E[\epsilon_i(\hat f_i-f_i)]</math><br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2E[\epsilon_i(\hat f_i-f_i)]</math> <math>\displaystyle (1)</math><br />
<br />
The last term can be written as:<br />
<br />
<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=E[(y_i-f_i)(\hat f_i-f_i)]=cov(y_i,\hat f)</math>, where<math>\displaystyle y_i</math> and <math>\hat f_i</math> both have same mean <math>\displaystyle f_i</math>.<br />
<br />
<br />
<br />
====Stein's Lemma====<br />
<br />
If <math>\,Z</math> is <math>\,N(\mu,\sigma^2)</math> and if <math>\displaystyle g(Z)</math> is weakly differentiable,such that<math>\displaystyle E[\vert g'(Z)\vert]<\infty</math>, then <math>\displaystyle E[g(Z)(Z-\mu)]=\sigma^2E(g'(Z))</math>.<br />
<br />
<br />
According to Stein's Lemma, the last cross term of <math>\displaystyle (1)</math>, <math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]</math> can be written as <math>\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math>. The derivation is as follows.<br />
<br />
<math>\displaystyle Proof</math>:<br />
<math>\displaystyle E[g(Z)(Z-\mu)]=E[(\hat f-f)\epsilon]=\sigma^2E(g'(Z))=\sigma^2 E[\frac {\partial (\hat f-f)}{\partial y_i}]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}-\frac {\partial f}{\partial y_i}]</math><br />
<br />
<br />
Since <math>\displaystyle f</math> is the true model, not the function of the observations <math>\displaystyle y_i</math>, then <math>\frac {\partial f}{\partial y_i}=0</math>.<br />
<br />
So,<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math> <math>\displaystyle (2)</math><br />
<br />
<br />
<br />
====Two Different Cases====<br />
<br />
=====''Case 1''=====<br />
<br />
Consider the case in which a new data point has been introduced to the estimated model, i.e. <math>(x_i,y_i)\not\in\tau</math>; this new point belong to the validation set <math>\displaystyle \nu</math>,i.e.<math>(x_i,y_i)\in\nu</math>. Since <math>\displaystyle y_i</math> is a new point, <math>\hat f</math> and <math>\displaystyle y_i</math> are independent. Therefore <math>\displaystyle cov(y_i,\hat f)=0</math> (or think about <math>\frac{\partial \hat f}{\partial y_i}</math>, when <math>\,y_i</math>is a new point, then it has nothing with <math>\hat f</math> because the estimation of <math>\hat f</math> is from the training data, so <math>\frac{\partial \hat f}{\partial y_i}=0</math> ) and <math>\displaystyle (1)</math> in this case can be written as:<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2</math>. <br />
<br />
This expectation means <math>\frac {1}{m}\sum_{i=1}^m (\hat y_i-y_i)^2 = \frac {1}{m}\sum_{i=1}^m (\hat f_i-f_i)^2+ \sigma^2</math>.<br />
<br />
<math>\sum_{i=1}^m (\hat y_i-y_i)^2 = \sum_{i=1}^m (\hat f_i-f_i)^2+ m\sigma^2</math><br />
<br />
Based on the notation we denote above, then we obtain:<br />
<math>\displaystyle MSE=Err-m\sigma^2</math><br />
<br />
<br />
<br />
This is the justification behind the technique of cross validation. since <math>\displaystyle \sigma^2</math> is constant, to minimize <math>\displaystyle MSE</math> is equal to minimize the test err <math>\displaystyle Err</math>. In cross vaildation to avoid overfitting or underfitting, a validation data set is independent from the estimated model.<br />
<br />
<br />
=====''Case 2''=====<br />
<br />
A more interesting case is the case in which we do not use new data points to assess the performance of the estimated model. and the training data is used for both estimating and assessing a model <math>\hat f_i</math>. In this case the cross term in <math>\displaystyle (1)</math> cannot be ignored because <math>\hat f_i</math> and <math>\displaystyle y_i</math> are not independent. Therefore the cross term can be estimated by Stein's lemma, which was originally proposed to estimated the mean of a Guassian distribution.<br />
<br />
<br />
Suppose <math>(x_i,y_i)\in\tau</math>, then by applying Stein's lemma, we obtain <math>\displaystyle (2)</math> proved above.<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2\sigma^2E[\frac {\partial \hat f}{\partial y_i}]</math>.<br />
<br />
This expectation means <math>\frac {1}{N}\sum_{i=1}^N (\hat y_i-y_i)^2 = \frac {1}{N}\sum_{i=1}^N (\hat f_i-f_i)^2+ \sigma^2-\frac {2}{N}\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<br />
<math>\sum_{i=1}^N (\hat y_i-y_i)^2 = \sum_{i=1}^N (\hat f_i-f_i)^2+ N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<math>\displaystyle err=MSE+N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math><br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math> <math>\displaystyle (3)</math><br />
<br />
In statistics, this is known as [http://www.reference.com/browse/Stein%27s+unbiased+risk+estimate Stein's unbiased risk estimate (SURE)] is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter, and thus cannot be determined completely.<br />
<br />
<br />
<br />
===SURE for RBF Network===<br />
<br />
Based on SURE, the optimum number of basis functions should be assigned to have the minimum generalization err <math>\displaystyle err</math>. Based on the Radial Basis Function Network, by setting <math>\frac{\partial err}{\partial W}</math> equal to zero , we get the least squared solution of<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math>.Then we have <math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}Y = HY</math>,where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}</math>, <math>\,H</math> is the hat matrix for this model.<br />
<br />
<br />
<math>\hat f=\,H_{i1}y_2+\,H_{i2}y_2+\cdots+\,H_{in}y_n</math> <math>\displaystyle (3)</math><br />
<br />
where <math>\,H</math> depends on the input vector <math>\displaystyle x_i</math> but not on <math>\displaystyle y_i</math>. <br />
<br />
By taking the derivative of <math>\hat f_i</math> with respect to <math>\displaystyle y_i</math>, we can easily obtain:<br />
<br />
<math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Now, substituing this into<math>\displaystyle (3)</math> ,we get<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Here, we can tell that <math>\sum_{i=1}^N \,H_{ii}= \,Trace(H)</math>, the sum of the diagonal elements of <math>\,H</math>. Thus, we can obtain the further simplification that <math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1})\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=d</math>, where<math>\displaystyle d</math> is the dimension of <math>\displaystyle \Phi</math>. Since <math>\displaystyle \Phi</math> is a projection of input matrix <math>\,X</math> onto a basis set spanned by <math>\,M</math>, the number of basis functions. If considering intercept, then <math>\,Trace(H)= M+1</math>.<br />
<br />
Then,<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1)</math>.<br />
<br />
===SURE Algorithm===<br />
<br />
<br />
[[File:27.1.jpg|350px|thumb|right|Figure 27.1]]<br />
<br />
We use this method to find the optimum number of basis function by choosing the model with smallest MSE over the set of models considered. Given a set of models <math>\hat f_M(x)</math> indexed by the number of basis functions, <math>\displaystyle err(M)</math>. <br />
<br />
Then, <math>\displaystyle MSE(M)=err(M)-N\sigma^2+2\sigma^2(M+1)</math><br />
<br />
where <math>\displaystyle N</math> is the number of training samples and the noise,<math>\sigma^2</math>, can be estimated from the training data as<br />
<br />
<math>\hat \sigma^2=\frac {1}{N-1}\sum_{i=1}^N (\hat y-y)^2</math>.<br />
<br />
<br />
By applying SURE algorithm to SPECT Heart data, we get the optimal number of basis functions is <math>\displaystyle M=4</math>.<br />
<br />
<br />
Pls take a look at figure 27.1 on the right, which shows that<math>\displaystyle MSE</math> is smallest when <math>\displaystyle M=4</math>.<br />
<br />
<br />
Calculating the SURE value is easy if you have access to <math>\,\sigma</math>.<br />
<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
If <math>\,\sigma</math> is not known, it can be estimated using the error.<br />
<br />
error = (output - expected_output) .^ 2;<br />
sigma = (dot(err, ones(1, num_data_point)) / (num_data_point)) / (num_data_point - 1);<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
=='''SURE for RBF network & Support Vector Machine - November 13th, 2009'''==<br />
<br />
===SURE for RBF network===<br />
<br />
====Minimizing MSE====<br />
<br />
By Stein's unbiased risk estimate (SURE) for Radial Basis Function (RBF) Network<br />
we get:<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1) </math> (28.1)<br />
<br />
*<math>\displaystyle MSE</math>(mean square error)= <math>\sum_{i=1}^N (\hat y_i-y_i)^2 </math><br />
*<math>\displaystyle err</math>(training error)= <math>\sum_{i=1}^N (\hat f_i-f_i)^2 </math><br />
*<math>\displaystyle (M+1) </math>( number of hidden units)= <math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math><br />
<br />
<br />
'''Goal''': To minimize MSE<br />
<br />
1. If <math>\displaystyle \sigma </math> is known, then it is no impact (i.e. a constant),<br />
and we can ignore it. Only need to minimize <math>\displaystyle MSE=err +2\sigma^2(M+1)</math>.<br />
<br />
2. In reality, we do not know <math>\displaystyle \sigma</math>, and it changes when <math>\displaystyle (M+1) </math> changes. However, we can estimate <math>\displaystyle \sigma </math>.<br />
<br />
<math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>. Suppose we do not know the variance of <math>\displaystyle \epsilon</math>. Then, <br />
<br />
<math>\displaystyle \sigma^2=\frac{1}{N-1}\sum_{i=1}^N (\hat y_i-y_i)^2 =\frac{1}{N-1}err</math> (28.2)<br />
<br />
Substitute (28.2) into (28.1), get<br />
<br />
<math>\displaystyle MSE=err-N\frac{1}{N-1}err+2\frac{1}{N-1}err(M+1)</math><br />
<br />
<math>\displaystyle MSE=err(1-\frac{N}{N-1}+\frac{2(M+1)}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{N-1-N+2M+2}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{2M+1}{N-1}) </math> (28.3) <br />
<br />
<br />
[[File:28.1.jpg|350px|thumb|Figure 28.1: MSE vs err]]<br />
<br />
Figure 28.1: the training error will decrease and the MSE will increase when increasing the number of hidden units (i.e. the model is more complex).<br />
<br />
<br />
When the number of hidden units gets larger and larger, the training error will decrease until it approaches to <math>\displaystyle 0 </math>. If training error approaches <math>\displaystyle 0 </math>, then no matter how large <math>\displaystyle (M+1) </math> is, from (28.3) we can see MSE will approaches to <math>\displaystyle 0 </math> as well. However, in fact it does not happen since when training error is close to <math>\displaystyle 0 </math> [http://en.wikipedia.org/wiki/Overfitting overfitting] happens, and MSE will increase instead of being close to <math>\displaystyle 0 </math>. We can see it from the Figure 28.1. <br />
<br />
<br />
We can see <math>\displaystyle \sigma^2 </math> is the average of <math>\displaystyle err </math>. In order to deal with this problem, we can take the average for <math>\displaystyle err</math> of each hidden unit. For example: we can first take 1 hidden unit, and take 10 hidden units in the next.<br />
<br />
We can also see that unlike the classical Cross Validation (CV) or Leave one out (LOO) techniques, the SURE technique does not need to do the validation to find the optimal model. Hence, SURE technique uses less data than CV or LOO. It is suitable for the case that there is not enough data for validation. However, to implement SURE we need to find <math>\frac {\partial \hat f}{\partial y_i}</math>, which may not be trivial for models that do not have a closed-form solution.<br />
<br />
====Kmeans Clustering====<br />
<br />
Discription:<br /> [http://en.wikipedia.org/wiki/K-means_clustering Kmeans clustering] is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.<br />
<br />
*The number of hidden units is same as the number of clusters, which is <math>\displaystyle \phi </math><br />
<br />
*<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, we set it same for all clusters.<br />
<br />
The basic details for <math>K</math>-means clustering are given:<br />
<br />
The <math>K</math> initial centers are randomly chosen from the training data.<br />
<br />
Then the following two steps are iterated alternatively until convergernce.<br />
<br />
1. for each existing center, we reidentify its cluster (every points in this cluster should be closer to this center than others).<br />
<br />
2. compute the mean for each cluster and make it as the new center for each cluster.<br />
<br />
<br />
Example:<br /><br />
Partition data into 2 clusters (2 hidden values)<br />
<br />
<br />
>> X=rand(30,80); <br />
>> [IDX,C,sumD,D]=kmeans(X,2); <br />
>> size(IDX) <br />
>> 30 1<br />
>> size(C) <br />
>> 2 80<br />
>> size(sumD) <br />
>> 2 1<br />
>> c1=sum(IDX==1)<br />
>> 14<br />
>> c2=sum(IDX==2)<br />
>> 16<br />
>> sumD<br />
>> 85.6643<br />
>> 101.0419<br />
>> v1=sumD(1,1)/c1 <br />
>> 6.1189<br />
>> v2=sumD(2,1)/c2 <br />
>> 6.3151 <br />
<br />
<br />
<br />
Comments:<br />
<br />
We create <math>X</math> randomly as a train set with 80 data points and 30 dimensions, and then apply “kmeans” method to separate X into 2 clusters.IDX is a vector contains 1 or 2 which indicates 2 clusters, and its size is 30*1. <math>\displaystyle C </math> is the center (mean) of each cluster with size 2*80; sumD is sum of the square distance between the data points and center of its cluster. The <math>\displaystyle c1 </math> and <math>\displaystyle c2 </math> indicate the number of data points in cluster 1 and 2. <math>\displaystyle V1 </math> is the variance of the first cluster <math>\displaystyle (v1=\sigma_1)</math>; <math>\displaystyle V2 </math> is the variance of the second cluster <math>\displaystyle (v2=\sigma_2)</math>. Now we can get <math>\displaystyle \phi </math>, <math>\displaystyle w </math>, hat matrix <math>\displaystyle H </math> and <math>\displaystyle \hat Y </math> by following equations. Finally, we will get the <math>\displaystyle MSE </math> and predict the test set. <br />
<br />
<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br />
<br />
<math>\displaystyle W = (\phi^{T}\Phi)^{-1}\phi^{T}Y</math><br />
<br />
<math>\displaystyle H = \phi(\Phi^{T}\Phi)^{-1}\phi^{T}</math><br />
<br />
<math>\displaystyle \hat{Y} = \phi W = \phi(\phi^{T}\phi)^{-1}\phi^{T}Y = HY</math><br />
<br />
<br />
<br />
Aside:<br />
<br />
Similar in spirit to <math>K</math>-means, there is EM algorithm with respect to Gaussian mixture model. Generally speaking, the Gaussian mixture model is referred to as a soft clustering while <math>K</math>-means is hard clustering.<br />
<br />
Similar to <math>K</math>-means, the following two steps are iterated alternatively until convergernce.<br />
<br />
E-step, each point is assigned a weight for each cluster based on the likelihood of each of the corresponding Gaussians. Observations is assigned 1 for one cluster if they are closer to the center of that cluster, and is assigned 0 for other clusters. <br />
<br />
M-step, compute the weighted means and covariances and make them as the new means and covariances for every cluster.<br />
<br />
>>[P,mu,phi,1Pxtr]=mdgEM(X,2,200,0);<br />
<br />
===Support Vector Machine===<br />
<br />
====Introduction====<br />
We have seen that linear discriminate analysis and logistic regression both estimate linear decision boundaries in similar but slightly different ways. Separating hyperplane classifiers provide the basis for support vector classifier. It constructs linear decision boundaries that explicitly try to separate the data into different classes as well as possible. The techniques that make the extensions to the nonseparable case, where the classes overlap are generalized to what is known as the support vector machine. It produces nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space.<br />
<br />
Definition: <br /><br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machine (SVM)] are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space, which can be used for classification, regression or other tasks.<br />
<br />
====Optimal Seperating Hyperplane====<br />
<br />
[[File:28.2.jpg|350px|thumb|right|Figure 28.2]]<br />
<br />
Figure 28.2 An example with two classes separable by a hyperplane. The blue line is the least squares solution, which misclassifies one of the training points. Also shown are the black separating hyperplanes found by the [http://en.wikipedia.org/wiki/Perceptron perceptron] learning algorithm with different random starts.<br /><br />
<br />
We can see the data points can be separated by a linear boundary are in two classes in <math>\displaystyle \mathbb{R}^{2} </math>. Suppose a dataset is indeed linearly separable, then there exits infinitely many possible separating hyperplanes including the black lines in the figure as two of them for training data. However, which solution is the best when we introduce the new data. <br /><br />
<br />
Aside: <br /><br />
The blue line is the least squares solution to the problem,obtained by regressing the <math>\displaystyle -1/+1 </math> response <math>\displaystyle Y </math> on <math>\displaystyle X </math> (with intercept); the line is given by<br />
<math>\displaystyle {X:\hat\beta_0+\hat\beta_1X_1+\hat\beta_2X_2=0}</math>.<br />
This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by linear discriminant analysis, in light of its equivalence with linear regression in the two-class case.<br />
<br />
Classifiers such as (28.4) that compute a linear combination of the input features and return the sign were called ''perceptrons'' in the engineering literature in the late 1950s. <br />
<br />
<br />
Identifications:<br />
<br />
*Hyperplane: seprate two classes <br />
<br />
<math>\displaystyle x^{T}\beta+\beta_0=0</math><br />
<br />
*Margin: the distance between the hyperplane and the closest point.<br />
<br />
<math>\displaystyle d_i=x_i^{T}\beta+\beta_0 </math> where <math>\displaystyle i=1,....,N</math><br />
<br />
Note: since distance is positive, if the data is on <math>\displaystyle +1 </math> side the distance is <math>\displaystyle d_i(+1)</math>. If the data is on the <math>\displaystyle -1 </math> side the distance is <math>\displaystyle d_i(-1)</math>.<br />
<br />
*Data points: <math>\displaystyle y_i\in\{-1,+1\}</math>we can classify points as<math>\displaystyle sign\{d_i\}</math> if <math>\displaystyle \beta,\beta_0 </math> is known.<br /><br />
<br />
====Maximum Margin Classfifiers====<br />
Choose the line farthest from both classes or Choose the line which has the maximum distance from the closest point. (i.e maximize the margin)<br /><br />
<br />
<math>\displaystyle Margin=min\{y_id_i\}</math> <math>\displaystyle i=1,2,....,N </math> <br />
where <math>\displaystyle y_i </math> is label and <math>\displaystyle d_i </math> is distance<br /><br />
<br />
[[File:28.3.jpg|350px|thumb|right|Figure 28.3 The linear algebra of a hyperplane]]<br />
<br />
<br />
<br />
Fiqure 28.3 depicts a hyperplane defined by the equation <math>\displaystyle x^{T}\beta+\beta_0=0</math>. Since they are in <math>\displaystyle \mathbb{R}^{2} </math>, the hyperplane is a line.<br /><br />
<br />
<br />
Properties:<br /><br />
<br />
1. <math>\displaystyle \beta </math> is orthogonal to the hyperplane <br /><br />
<br />
Two points<math>\displaystyle x_1,x_2</math> lying on the hyperplane.<br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0\beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}(x_1-x_2)=0</math><br />
<br />
Hence,<math>\displaystyle \beta </math> is orthogonal to <math>\displaystyle (x_1-x_2)</math>, and<math>\displaystyle \beta^*=\frac{\beta}{\|\beta\|} </math> is the vector normal to the hyperplane.<br /><br />
<br />
2. For any point <math>\displaystyle x_0 </math> on the hyperplane, <br />
<br />
<math>\displaystyle \beta^{T}x_0+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1=-\beta_0</math><br />
<br />
<br />
3. The signed distance fo any point <math>\displaystyle x </math> to the hyperplane is: since the length of <math>\displaystyle \beta </math> is not known, let it be unit vector.<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}(x_i-x_0)}{\|\beta\|} </math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i-\beta^{T}x_0}{\|\beta\|} </math><br />
<br />
by property 2<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i+\beta_0}{\|\beta\|} </math><br />
<br />
<br />
<br />
[[File:4.jpg|350px|thumb|right|Figure 28.4]]<br />
<br />
<br />
<math>\displaystyle Margin=min(y_id_i)</math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle Margin=min\{y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\} </math><br />
<br />
Suppose <math>\displaystyle x_i </math> is not on the hyperplan<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)>0 </math><br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq c </math>for <math>\displaystyle c>0 </math><br />
<br />
<br />
<math>\displaystyle y_i(\frac{\beta^{T}x_i}{c}+\frac{\beta_0}{c})\geq1</math> <br />
<br />
This is known as the cnonical representation of the decision hyperplane.<br />
<br />
For <math>\displaystyle \beta^{T} </math> only the direction is important, so <math>\displaystyle \frac{\beta^{T}}{c} </math> does not change its direction, the hyperplance will be the same.<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq1 </math><br />
<br />
<math>\displaystyle y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\geq\frac{1}{\|\beta\|} </math><br />
<br />
<math>\displaystyle Margin=min(\frac{1}{\|\beta\|} )</math><br />
<br />
which is equivalent as <math>\displaystyle Margin=max(\|\beta\| )</math><br />
<br />
<br />
<br />
<br />
Refrence:<br /><br />
Hastie,T.,Tibshirani,R., Friedman,J.,(2008).The Elements of Statistical Learning:129-130<br />
<br />
=='''Optimizing The Support Vector Machine - November 16th, 2009'''==<br />
The Support Vector Machine is used to find a maximum margin hyperplane, assuming the two classes are separable. This margin can be written as <math>\,min\{y_id_i\}</math>, or the distance of each point from the hyperplane, where <math>\,d_i</math> is the distance and <math>\,y_i</math> is used as the sign.<br />
===Maximizing the Support Vector Machine===<br />
<math>\,Margin=min\{y_id_i\}</math> can be rewritten as <math>\,min\left\{\frac{y_i\left(\beta^Tx_i+\beta_0\right)}{|\beta|}\right\}</math>. <br />
<br\>Note that the term <math>\,y_i\left(\beta^Tx_i+\beta_0\right) = 0</math> if <math>\,x_i</math> is on the hyperplane, but <math>\,y_i\left(\beta^Tx_i+\beta_0\right) > 0</math> if <math>\,x_i</math> is <math>\displaystyle not</math> on the hyperplane<br />
<br />
This implies <math>\,\exists C</math> such that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq C</math>.<br />
<br />
Divide through by C to produce <math>\,y_i\left(\frac{\beta^T}{C}x_i + \frac{\beta_0}{C}\right) \geq 1</math>. <br />
<br />
<math>\,\beta, \beta_0</math> compose a hyperplane that can have different values but we care about the direction, dividng through by a constant does not change the hyperplane. Thus, by scaling <math>\,\beta, \beta_0</math> we can eliminate C, so that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq 1</math>. Implying that the lower bound on <math>\,y_i\left(\beta^Tx_i+\beta_0\right)</math> is <math>\displaystyle 1</math><br />
<br />
Now in order to maximize the margin, we simply need to find <math>\,min\left\{\frac{1}{|\beta|}\right\}</math>. <br />
<br />
In other words, find maximum <math>\,|\beta|</math>, s.t. <math>\,min_i\{y_i(\beta^Tx_i+\beta_0)\} = 1</math>.<br />
<br />
Note that we're dealing with the norm of <math>\,\beta</math>. The 1-norm of a vector is simply the sum of the absolute value of each element (also known as the taxicab or Manhattan distance), and is apparently more accurate, but also has a discontinuity in the derivative. 2-norm, the Euclidean norm (the intuitive length of the vector), is easier to work with - that is <math>\,\|\beta\|_2 = (\beta^T\beta)^{1/2}</math>. For convenience, we will maximize <math>\,\frac{1}{2}\|\beta\|_2^2 = \frac{1}{2}\beta^T\beta</math>.<br />
<br />
This is an example of a quadratic programming problem and we will minimize a quadratic function respective to a linear inequality constraints.<br />
<br />
<br />
====Writing Lagrangian Form of Support Vector Machine====<br />
The Lagrangian form is introduced to ensure that the conditions are satisfied, as well as finding an optimal solution. <math>\,\alpha_i</math> are introduced as dual constraints. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.<br />
<br />
<math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math>. To find the optimal value, set the derivative equal to zero.<br />
<br />
<math>\,\frac{\partial L}{\partial \beta} = 0</math>, <math>\,\frac{\partial L}{\partial \beta_0} = 0</math>. Note that <math>\,\frac{\partial L}{\partial \alpha_i}</math> is equivalent to the constraints <math>\left(y_i(\beta^Tx_i+\beta_0)-1\right) \geq 0, \,\forall\, i</math><br />
<br />
First, <math>\,\frac{\partial L}{\partial \beta} = \frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\left\{\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i)+\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0-\frac{\partial}{\partial \beta}\alpha_iy_i\right\}}</math><br />
<br />
: <math>\frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 = \beta</math>.<br />
<br />
: <math>\,\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i) = \alpha_iy_ix_i</math><br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0 = 0</math>. <br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i = 0</math>.<br />
<br />
So this simplifies to <math>\,\frac{\partial L}{\partial \beta} = \beta - \sum_{i=1}^n{\alpha_iy_ix_i} = 0</math>. In other words,<br />
<br />
<math>\,\beta = \sum_{i=1}^n{\alpha_iy_ix_i}</math>, <math>\,\beta^T = \sum_{i=1}^n{\alpha_iy_ix_i^T}</math><br />
<br />
Similarly, <math>\,\frac{\partial L}{\partial \beta_0} = -\sum_{i=1}^n{\alpha_iy_i} = 0</math>.<br />
<br />
This allows us to rewrite the Lagrangian without <math>\,\beta</math>.<br />
<br />
<math>\,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} - \sum_{i=1}^n{\alpha_iy_i(\sum_{j=1}^n{\alpha_jy_jx_j^Tx_i} + \beta_0) - 1}</math>. <br />
<br />
Because <math>\,\sum_{i=1}^n{\alpha_iy_i} = 0</math>, and <math>\,\beta_0</math> is constant, <math>\,\sum_{i=1}^n{\alpha_iy_i\beta_0} = 0</math>. So this simplifies further, to<br />
<br />
<math>L(\alpha) = \,-\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} + \sum_{i=1}^n{\alpha_i}</math><br />
A dual representation of the maximum margin.<br />
<br />
Because <math>\,\alpha_i</math> is the Lagrangian multiplier, <math>\,\alpha_i \geq 0 \forall i</math>.<br />
<br />
This is a much simpler optimization problem.<br />
<br />
=='''The Support Vector Machine algorithm - November 18, 2009'''==<br />
<br />
===Solving the Lagrangian===<br />
<br />
Continuing from the above derivation, we now have the equation that we need to minimize, as well as two constraints.<br />
<br />
The Support Vector Machine problem boils down to:<br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math><br />
:such that <math>\alpha_i \geq 0</math><br />
:and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
<br />
We are looking to minimize <math>\,\alpha</math>, which is our only unknown. Once we know <math>\,\alpha</math>, we can easily find <math>\,\beta</math> and <math>\,\beta_0</math> (see the Support Vector algorithm below for complete details).<br />
<br />
If we examine the Lagrangian equation, we can see that <math>\,\alpha</math> is multiplied by itself; that is, the Lagrangian is quadratic with respect to <math>\,\alpha</math>. Our constraints are linear. This is therefore a problem that can be solved through [http://en.wikipedia.org/wiki/Quadratic_programming quadratic programming] techniques. We will examine how to do this in Matlab shortly.<br />
<br />
We can write the Lagrangian equation in matrix form:<br />
<br />
<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:such that <math>\underline{\alpha} \geq \underline{0}</math><br />
:and <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Where:<br />
* <math>\underline{\alpha}</math> denotes an <math>\,n \times 1</math> vector; <math>\underline{\alpha}^T = [\alpha_1, ..., \alpha_n]</math><br />
* Matrix <math>S = y_iy_jx_i^Tx_j = (y_ix_i)^T(y_ix_i)</math><br />
* <math>\,\underline{0}</math> and <math>\,\underline{1}</math> are vectors containing all 0s or all 1s respectively<br />
<br />
Using this matrix notation, we can use Matlab's build in quadratic programming routine, [http://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/quadprog.html quadprog].<br />
<br />
===Quadprog example===<br />
<br />
Let's use quadprog to find the solution to <math>\,L(\alpha)</math>.<br />
<br />
Matlab's quadprog function minimizes an equation of the following form:<br />
:<math>\min_x\frac{1}{2}x^THx+f^Tx</math><br />
:such that: <math>\,A \cdot x \leq b</math>, <math>\,Aeq \cdot x = beq</math> and <math>\,lb \leq x \leq ub</math><br />
<br />
We can now see why we kept the <math>\frac{1}{2}</math> constant in the original derivation of the equation.<br />
<br />
The function is called as such: <code>x = quadprog(H,f,A,b,Aeq,beq,lb,ub)</code>. The variables correspond to values in the equation above.<br />
<br />
We can now try to find the solution to <math>\,L(\alpha)</math> (though, it should be noted, that in <math>\,L(\alpha)</math> we subtract the first term rather than add it; I'm not sure if this difference would change how we call <code>quadprog</code>).<br />
<br />
We'll use a simple one-dimensional data set, which is essentially y = -1 or 1 + Gaussian noise. (Note: you could easily put the values straight into the quadprog equation; they are separated for clarity)<br />
<br />
x = [mvnrnd([-1],[0.01],100); mvnrnd([1],[0.01],100)]';<br />
y = [-ones(100,1); ones(100,1)];<br />
S = (y*x)' * (y*x);<br />
f = ones(200,1);<br />
A = -ones(1,200);<br />
b = 0;<br />
Aeq = y';<br />
beq = 0;<br />
lb = 0;<br />
ub = []; % There is no upper bound<br />
alpha = quadprog(S,f,A,b,Aeq,beq,lb,ub);<br />
<br />
This gives us the optimal <math>\,\alpha</math>... or at least I think it should, but it does not appear to work for me (that is, despite setting the lower boundary of 0, a number of values are still negative. Whether this is just the nature of <code>quadprog</code> or an error on my part is an exercise for the reader).<br />
<br />
===Examining K.K.T. conditions===<br />
<br />
[http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker conditions] ([http://webrum.uni-mannheim.de/mokuhn/public/KarushKuhnTucker.pdf more info]) give us a closer look into the Lagrangian equation and the associated conditions.<br />
<br />
Suppose we are looking to minimize <math>\,f(x)</math> such that <math>\,g_i(x) \geq 0, \forall{x}</math>. If <math>\,f</math> and <math>\,g</math> are differentiable, then the ''necessary'' conditions for <math>\hat{x}</math> to be local minimum are:<br />
<br />
# At the optimal point, <math>\frac{\partial L}{\partial \hat{x}} = 0</math>; i.e. <math>f'(\hat{x}) - \sum{\alpha_ig'(\hat{x})}=0</math><br />
# <math>\alpha_i \geq 0</math>. (Dual Feasibility) <br />
# <math>\alpha_ig_i(\hat{x}) = 0, \forall{i}</math> (Complementary Slackness)<br />
# <math>g_i(\hat{x}) \geq 0</math> (Primal Feasibility)<br />
<br />
If any of these conditions are violated, then the problem is deemed not feasible.<br />
<br />
These are all trivial except for condition 4. Let's examine it further in our support vector machine problem.<br />
<br />
===Support Vectors===<br />
<br />
Basically, the support vectors are the training points that actually determine the optimal separating hyperplane which we are look for. Also, they are the most difficult points to classify or the most informative for the classification.<br />
<br />
In our case, the <math>g_i(\hat{x})</math> function is:<br />
:<math>\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1</math><br />
<br />
Substituting <math>\,g_i</math> into KKT condition 3, we get <math>\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0</math><br />
<br />
Points will either be 1 or greater than 1 away from the hyperplane.<br />
<br />
'''Case 1: a point > 1 away'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) - 1 > 0</math> then <math>\,\alpha_i = 0</math>.<br />
<br />
In other words, if the point isn't on the margin, then the corresponding <math>\,\alpha</math> value is 0.<br />
<br />
'''Case 2: a point 1 away'''<br />
<br />
Conversely, an <math>\,\alpha</math> value can either be 0 or <math>\, > 0</math>. If <math>\,\alpha_i > 0</math>, then that point is on the margin.<br />
<br />
That is, if <math>\,\alpha_i > 0</math> then <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math>.<br />
<br />
Points on the margin -- points with corresponding <math>\,\alpha_i > 0</math> -- are called support vectors of that margin.<br />
<br />
===Using support vectors===<br />
<br />
Support vectors are important because they allow the support vector machine algorithm to be insensitive to outliers. If <math>\,\alpha = 0</math>, then the cost function is also 0, and won't contribute to the solution of the SVM problem; only points on the margin -- support vectors -- contribute.<br />
<br />
====The support vector machine algorithm====<br />
<br />
# Solve the quadratic programming problem:<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
## Use Matlab's quadprog to find the optimal <math>\,\underline{\alpha}</math><br />
# Find <math>\beta = \sum_{i=1}^n{\alpha_iy_i\underline{x_i}}</math><br />
# Find <math>\,\beta_0</math> by choosing a support vector (a point with <math>\,\alpha_i > 0</math> and solving <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math><br />
<br />
===Example in Matlab===<br />
<br />
The following code, taken verbatim from the lecture, shows how to use Matlab built-in SVM routines (found in the Bioinformatics toolkit) to do classification through support vector machines.<br />
<br />
load 2_3;<br />
[U,Y] = princomp(X');<br />
data = Y(:,1:2);<br />
l = [-ones(1,200) ones(1,200)];<br />
[train,test] = crossvalind('holdOut',400);<br />
% Gives indices of train and test; so, train is a matrix of 0 or 1, 1 where the point should be used as part of the training set<br />
svmStruct = svmtrain(data(train,:), l(train), 'showPlot', true);<br />
<br />
[[File:Svm1.png|frame|center|The plot produced by training on some of the 2_3 data's first two features.]]<br />
<br />
yh = svmclassify(svmStruct, data(test,:), 'showPlot', true);<br />
<br />
[[File:Svm2.png|frame|center|The plot produced by testing some of the 2_3 data.]]</div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=5448stat8412009-11-21T19:17:21Z<p>Ipargaru: /* Maximizing the Support Vector Machine */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rise of fields such as data-mining, bioinformatics, and machine learning, classification has becomes a fast-developing topic. In the age of information, vast amounts of data are generated constantly, and the goal of classification is to learn from data. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
'''Definition''': The problem of Prediction a discrete random variable <math> \mathcal{Y} </math> from another random variable <math> \mathcal{X} </math> is called Classification.<br />
<br />
In classification,, we attempt to approximate a function <math>\,h</math>, by using a training data set, which will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>d</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data which are identical independent distributions, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, for instance, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability. Intuitively speaking, to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we find <math>\,y\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y|X=x)</math>, and classify <math>\,X</math> into class <math>\,y</math>. In order to calculate the value of <math>\,P(Y=y|X=x)</math>, we use ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posterior probability, <math>\,P(Y=y)</math> as the prior probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two classes, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<sub></sub><br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''3 different approaches to classification''':<br />
<br />
1) Empirical Risk Minimization: Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^*\in \mathcal{H}</math> that minimizes some estimate of <math>\,L(h)</math><br />
<br />
2) Regression: Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3) Density Estimation: estimate <math>\,P(X=x|Y=0)</math> and <math>\,P(X=x|Y=1)</math> (less popular in high-dimension cases)<br />
<br />
<br />
<br />
'''Bayes Classification Rule Optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when<math></math> the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
<br />
:<math>\, h^*(X)= \left\{\begin{matrix} <br />
1 & if P(Y=1|X=x)>P(Y=0|X=x) \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
Remark:<br />
<br />
1)Bayes classification rule is optimal. Proof:[http://www.ee.columbia.edu/~vittorio/BayesProof.pdf]<br />
<br />
2)We still need any other method, since we cannot define prior probability in realistic.<br />
<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, since it is generally impossible for us to know the prior <math>\,P(Y=1)</math>, and class conditional density <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], tree-augmented naive Bayes (TAN), Bayesian network augmented naive Bayes (BAN) and general Bayesian network (GBN).<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayesian vs. Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as having objective existence. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayesian and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayesian'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In Bayesian method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
'''Multi-class Classification''':<br />
<br />
Y takes on more than two values.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
''Theorem'': Suppose that <math>\,Y \in \mathcal{Y}= \{1,\dots,k\}</math>, the optimal rule is :<math>\,h^*(X) = \arg\max_{k}{P(Y = k|X = x)}</math><br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
1 Empirical Risk Minimization:Choose a set fo classifier <math>\mathcal{H}</math> and find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
<br />
2 Regression:Find an estimate <math> (\hat r) </math> of the function <math> r </math> and deifne <br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & \hat r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
3 Density estimation, estimate <math>P(X = x|Y = 0)</math> and <math>P(X = x|Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
It is quadratic because there is no boundaries.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on [http://academicearth.org/lectures/advice-for-applying-machine-learning LDA and QDA] so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
LDA is for classification and FDA is used for feature extraction.<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
(It is more reasonable to have at least 2 directions)<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
<br />
=== logistic function ===<br />
A logistic function or logistic curve is the most common sigmoid curve. <br />
<br />
:<math>y = \frac{1}{1+e^{-x}}</math><br />
<br />
1. <math>\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}</math><br />
<br />
2. <math>y(0) = \frac{1}{2}</math><br />
<br />
3. <math> \int y dx = ln(1 + e^{x})</math><br />
<br />
4. <math> y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots </math> <br />
<br />
5. The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.<br />
<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
<br />
'''Class 1'''<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})</math> <br />
<br />
<br />
Then we have that<br />
<br />
'''Class 0'''<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood, using Pr(Y|X). The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Seeing these equations as a weighted least squares problem makes them easier to derivate.<br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== Perceptron (Foundation of Neural Network) ===<br />
<br />
==== Separating Hyperplane Classifiers ====<br />
Separating hyperplane trys to separate the data using linear decision boundaries. When the classes overlap, it can be generalized to support vector machine, which constructs nonlinear boundaries by constructing a linear boundary in an enlarged and transformed feature space.<br />
<br />
==== Perceptron ====<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
Least Squares returns the sign of a linear combination of data points as the class label<br />
<br />
sign(<math>(\underline{\beta}^T \underline{x} + {\beta}_0)) = sign(\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2})</math><br />
<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br />
<br />
<br />
If we find a hyperplane that is not unique between 2 classes, there will be infinitely many solutions obtained from the perceptron algorithm.<br />
<br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
A Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math>, where <math>\,I</math> indicates the sign of the expression and returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math> (initial guess). Its goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach], which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
====Problems with the Algorithm and Issues Affecting Convergence====<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. This problem can be eliminated by using basis expansions technique. To be specific, we try to find a hyperplane not in the original space, but in the enlarged space obtained by using some basis functions.<br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#A perfect separation is not always available even desirable. If observations comes from different classes sharing the same imput, the classification model seems to be overfitting and will generally have poor predictive performance.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
<br />
</ref>.<br />
====Comment on gradient descent algorithm====<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction in which the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and you initially stand in the middle, then you will finally arrive at the saddle point (local minimum) and get stuck there.<br />
<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over <math>\,i</math> (all data points). Actually, this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent) known as Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>\,{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
*A Perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection with branches that "fan out" onto as many connections as desired, each carrying the same signal - the processing element output signal. <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref> It is a multistage regression or classification model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
A regression problem typically has only one unit in the output layer. In a k-class classification problem, there are usually k units in the output layer that each represent the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1).<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classification by NN. <br />
<br />
In perceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] and is not continuous at 0. Thus, we replace it by a smooth function <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it the '''activation function'''.<br />
<br>The choice of this function <math>\displaystyle \sigma </math> is determined by the properties of the data and the assumed distribution of target variables, but for multiple binary classification problems the logistic function, also known as inverse-logit, is often used: <br />
<math>\sigma(a)=\frac {1}{1+e^{-a}}</math><br />
<br />
There are some important properties for the activation function.<br />
<br />
# Activation function is nonlinear. It can be shown that if the activation function of the hidden units is linear, a three-layer neural network is equivalent to a two layer one. <br />
# Activation function saturate, which means there are maximum and minimum output value. This property ensures that the weights are bounded and therefore the searching time is limited. <br />
# Activation function is continuous and smooth.<br />
# Activation function is monotonic. This property is not necessary, since we know that RBF networks is also a kind of power model. <br />
<br />
'''Note:''' A key difference between a perceptron and a neural network is that a neural network uses continuous nonlinearities in the units, for the purpose of differentiation, whereas the perceptron often uses a non-differentiable activation function. The neural network function is differentiable with respect to the network parameters so that a gradient descent method can be used in training. Moreover, a perceptron is a linear classifier, whereas a neural network, by combining layers of perceptrons, is able to classify non-linear problems through proper training.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes into the perceptron, to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural network is called [http://en.wikipedia.org/wiki/Feedforward_neural_network Feed-Forward Neural Network]. Applications to Feed-Forward Neural Networks include data reduction, speech recognition, sensor signal processing, and ECG abnormality detection, to name a few. <ref>J. Annema, Feed-Forward Neural Networks, (Springer 1995), pp. 9 </ref><br />
<br />
===Back-propagation===<br />
For a while, the Neural Network model was just an idea, since there were no algorithms for training the model until 1986, when Geoffrey Hinton <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> devised an algorithm called '''back-propagation''' [http://en.wikipedia.org/wiki/Backpropagation#Algorithm]. After that, a number of other training algorithms and various configurations of neural networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied a gradient descent algorithm for optimizing weights. Back-propagation uses this idea of gradient descent to train a neural network based on the chain rule in calculus. <br />
<br />
Assume that the last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
For simplicity, there is only 1 unit at the end and assume for the moment we are doing regression.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each unit we have a function <math>\displaystyle z_i=\sigma(a_i)</math> that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classification and the resulting classification output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
''i j are reversed!''<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_{jl}}=\delta_j \cdot z_l</math><br />
<br />
Note that a change in <math>\,a_j</math> causes changes in all <math>\,a_i</math> in the next layer on which the error is based, so we need to sum over i in the chain:<br />
<math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} =\sum_i \delta_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\,\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math> Using the activation function <math>\,\sigma(\cdot)</math><br />
<br />
So <math>\delta_j = \sum_i \delta_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \delta_i \cdot u_{ij}</math><br />
<br />
We can propagate the error calculated in the output back through the previous layers and adjust weights to minimize error.<br />
<br />
==Neural Networks (NN) - October 30, 2009 ==<br />
<br />
=== Back-propagation ===<br />
The idea is that we first feed an input from the training set to the Neural Network, then find the error rate at the output and then we propagate the error to previous layers and for each edge of weight <math>\,u_{ij}</math> we find <math>\frac{\partial \mathrm{err}}{\partial u_{ij}}</math>. Having the error rates at hand we adjust the weight of each edge by taking steps proportional to the negative of the gradient to decrease the error at output. The next step is to apply the next input from the training set and go through the described adjustment procedure.<br />
The overview of Back-propagation algorithm:<br />
#Feed a point <math>\,x</math> in the training set to the network, and find the output of all the nodes.<br />
#Evaluate <math>\,\delta_k=y_k-\hat{y_k}</math> for all output units, where <math>y_k</math> is the expected output and <math>\hat{y_k}</math> is the real output.<br />
#By propagating to the previous layers evaluate all <math>\,\delta_j</math>s for hidden units: <math>\,\delta_j=\sigma'(a_j)\sum_i \delta_i u_{ij}</math> where <math>i</math> is associated to the previous layer.<br />
#Using <math>\frac{\partial \mathrm{err}}{\partial u_{jl}} = \delta_j\cdot z_l</math> find all the derivatives.<br />
#Adjust each weight by taking steps proportional to the negative of the gradient: <math>u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} -\rho \frac{\partial \mathrm{err}}{u_{ij}}</math><br />
#Feed the next point in the training set and repeat the above steps.<br />
==== How to initialize the weights ====<br />
This still leaves the question of how to initialize the weights <math>\,u_{ij}, w_i</math>. The method of choosing weights mentioned in class was to randomize the weights before the first step. This is not likely to be near the optimal solution in every case, but is simple to implement. To be more specific, random values near zero will be a good choice for the initial weights(usually from [-1,1]). In this case, the model evolves from a nearly linear one to a nonlinear one as we desired. An alternative is to use an orthogonal least squares method to find the initial weights <ref>http://www.mitpressjournals.org/doi/abs/10.1162/neco.1995.7.5.982</ref>. Regression is performed on the weights and output by using a linear approximation of <math>\,\sigma(a_i)</math>, and finds optimal weights in the linear model. Back propagation is used afterward to find the optimal solution, since the NN is non-linear.<br />
<br />
==== How to set learning rates ====<br />
The learning rate <math>\,\rho</math> is usually a constant. <br />
<br />
If we use On-line learning, as a form of stochastic approximation process, <math>\,\rho</math> should decrease as the iteration increase.<br />
<br />
<br />
Choosing too large learning rate may cause the unstability of the system, while the too small learning rate may lead to a very slow convergence rate(very long time in learning phase). However, the advantage of small learning rate is that it can guarantee the convergence. Thus, generally, it is better to choose a relatively small learning rate to ensure the stability. Usually, choose <math>\,\rho</math> between 0.01 and 0.7<br />
<br />
==== How to determine the number of hidden units ====<br />
<br />
Here we will mainly discuss how to estimate the number of hidden units at very beginning. Obviously, we should adjust it to be more precise using CV, LOO or other complexity control methods. <br />
<br />
Basically, if the patterns are well separated, few hidden units are fairly enough. If the patterns are drawn from some highly complicated mixture models, more hidden units are really needed. <br />
<br />
Actually, the number of hidden units determines the size of the model, and therefore the total number of the weights in the model. Typically speaking, the number of weights should not be larger than the number of training data, say N. Thus, sometimes, N/10 is a good choice. However, in pratice, many well performed models will use more hidden units.<br />
<br />
=== Dimensionality reduction application ===<br />
[[File:NN-bottelneck.png|350px|thumb|right|Figure 1: Bottleneck configuration for applying dimensionality reduction.]]<br />
One possible application of Neural Networks is to perform dimensionality reduction, like other techniques, e.g., PCA, MDS, LLE and Isomap.<br />
<br />
Consider the following configuration as shown in figure 1:<br />
As we go forward in layers of this Neural Network, the number of nodes is reduced, until we reach a layer with the number of nodes representing the desired dimensionality. However, note that at the very first few layers the number of nodes may not be strictly decreasing, as long as finally it can reach a layer with less nodes. From now on in the Neural Network <br />
the previous layers are mirrored. So at the output layer we have the same number of states as we have in the input layer. Now note that if we feed the network with each point and get an output approximately equal to the fed input, that means at the output the same input is reconstructed from the middle layer units. So the output of the middle layer units can represent the input with less dimensions.<br />
<br />
To train this Neural Network, we feed the network with a training point and through back propagation we adjust the network weights based on the error between the input layer and the reconstruction at the output layer. Our low dimensional mapping will be the observed output from the middle layer. Data reconstruction consists of putting the low dimensional data through the second half of the network.<br />
<br />
=== Deep Neural Network ===<br />
Back-propagation in practice may not work well when there are too many hidden layers, since the <math>\,\delta</math> may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a<br />
Neural Network with Back-propagation faces some subtleties.<br />
Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.<br />
<br />
The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.<br />
<br />
Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with.<br />
There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.<br />
<br />
===Neural Networks in Practice===<br />
Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries. <br />
<br />
Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.<br />
<br />
Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.<br />
<br />
=== Issues with Neural Network ===<br />
When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.<br />
<br />
We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.<br />
<br />
As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.<br />
<br />
In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.<br />
<br />
== Complexity Control October 30, 2009 ==<br />
<br />
[[File:overfitting-model.png|500px|thumb|right|Figure 2. The overfitting model passes through all the points of the training set, but has poor predictive power for new points.<br />
In exchange the line model has some error on the training points but has extracted the main characteristic of the training points, and has good predictive power.]]<br />
There are [http://academicearth.org/lectures/underfitting-and-overfitting two issues] that we have to avoid in Machine Learning:<br />
#[http://en.wikipedia.org/wiki/Overfitting Overfitting]<br />
#Underfitting<br />
<br />
Overfitting occurs when our model is heavily complex with so many degrees of freedom, that we can learn every detail of the training set. Such a model will have very high precision on the training set but will show very poor ability to predict outcomes of new instances, especially outside the domain of the training set.<br />
<br />
In a Neural Network if the depth is too much, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will show a very precise outcome of the training set but will not be able to generalize the commonality of the training set to predict the outcome of new cases.<br />
<br />
Underfitting occurs when the model we picked to describe the data is not complex enough, and has high error rate on the training set.<br />
There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.<br />
<br />
'''Example'''<br />
#Consider the example showed in the figure. We have a training set and we want to find a model which fits it the best. We can find a polynomial of high degree which almost passes through all the points in the training set. But, in fact the training set is coming from a line model. Now the problem is although the complex model has less error on the training set it diverges from the line in other ranges which we have no training point. Because of that the high degree polynomail has very poor predictive result on test cases. This is an example of overfitting model.<br />
#Now consider a training set which comes from a polynomial of degree two model. If we model this training set with a polynomial of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem.<br />
#Consider a simple classification example. If our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is a yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features typical of bananas, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie., we have overfit the data. This occurs when we have considered so many features that we have perfectly described the existing bananas, but if presented with a new banana of slightly different shape than the existing bananas, for example, it cannot be detected. This is the tradeoff; what is the right level of complexity?<br />
<br />
== Complexity Control - Nov 2, 2009 ==<br />
<br />
Overfitting occurs when the model becomes too complex and underfitting occurs when it is not complex enough, both of which are not desirable. To control complexity, it is necessary to make assumptions for the model before fitting the data. Assumptions that we can make for a model are with polynomials or a neural network. There are other ways as well.<br />
<br />
[[File:Family_of_polynomials.jpg|200px|thumb|right|Figure 1: An example of a model with a family of polynomials]]<br />
We do not want a model to get too complex, so we control it by making an assumption on the model. With complexity control, we want a model or a classifier with a low error rate.<br />
<br />
=== '''How do we choose a good classifier?''' ===<br />
<br />
Our goal is to find a classifier that minimizes the true error rate. <br />
Recall the empirical error rate<br />
<br />
<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})</math><br />
<br />
<math>\,h</math> is a classifier and we want to minimize the error rate. So we apply <math>\displaystyle x_1</math> to <math>\displaystyle x_n</math> to <math>\displaystyle h</math>, and take the average to get the empirical true error rate estimation of probability that <br />
<math>h(x_{i}) \neq y_{i}</math>.<br />
<br />
<span id="prediction-error">[[File:Prediction_Error.jpg|200px|thumb|right|Figure 2]]</span><br />
There is a downward bias to this estimate meaning that it is always less than the true error rate. <br />
<br />
If there is a change in our complexity from low to high, our error rate is always decreasing. When we apply our model to the test data, our error rate will start to decrease to a point, but then it will increase since the model hasn't seen it before. This can be explained since training error will decrease when we fit the model better by increasing its complexity, but as we have seen, this complex model will not generalize well, resulting in a larger test error. <br />
<br />
We use our test data (from the test sample line shown on Figure 2) to get our empirical error rate.<br />
Right complexity is defined as where error rate of the test data is minimum; and this is one idea behind complexity control.<br />
<br />
<br />
<br />
[[File:Bias.jpg|200px|thumb|left|Figure 3]]<br />
<br />
We assume that we have samples <math>\,X_1, . . . ,X_n</math> that follow some (possibly unknown) distribution. We want to estimate a parameter <math>\,f</math> of the unknown distribution. This parameter may be the mean <math>\,E(X_i)</math>, the variance <math>\,var(X_i)</math> or some other quantity.<br />
<br />
The unknown parameter <math>\,f</math> is a fixed real number <math>f\in R</math>. To estimate it, we use an estimator which is a<br />
function of our observations, <math>\hat{f}(X_1,...,X_n)</math>. <br />
<br />
<math>Bias (\hat{f}) = E(\hat{f}) - f</math><br />
<br />
<math>MSE (\hat{f}) = E[(\hat{f} - f)^2]</math><br />
<br />
<math>Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]</math><br />
<br />
One property we desire of the estimator is that it is correct on average, that is, it is unbiased. <math>Bias (\hat{f}) = E(\hat{f}) - f=0</math>.<br />
However, there is a more important property for an estimator than just being unbiased: the mean squared error. In statistics, there are problems for which it may be good to use an estimator with a small bias. In some cases, an estimator with a small bias may have lesser mean squared error or be median-unbiased (rather than mean-unbiased, the standard unbiasedness property). The property of median-unbiasedness is invariant under transformations while the property of mean-unbiasedness may be lost under nonlinear transformations. For example, while using an unbiased estimator with large mean square error to estimate the parameter, we highly risk a big error. In contrast, a biased estimator with small mean square error will well improve the precision of our prediction.<br />
<br />
Hence, our goal is to minimize <math>MSE (\hat{f})</math>.<br />
<br />
From figure 3, we can see that the relationship of the three parameters is:<br />
<math>MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f}) </math>. Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.<br />
<br />
A Test error is a good estimation on MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias.<br />
<br />
<br />
Referring to Figure 2, overfitting happens after the point where training data (training sample line) starts to decrease and test data (test sample line) starts to increase. There are 2 main approaches to avoid overfitting:<br />
<br />
1. Estimating error rate<br />
<br />
<math>\hookrightarrow</math> Empirical training error is not a good estimation<br />
<br />
<math>\hookrightarrow</math> Empirical test error is a better estimation<br />
<br />
<math>\hookrightarrow</math> Cross-Validation is fast<br />
<br />
<math>\hookrightarrow</math> Computing error bound (analytically) using some probability inequality.<br />
<br />
We will not discuss computing the error bound in class; however, a popular method for doing this computation is called VC Dimension (short for Vapnik–Chervonenkis Dimension). Information can be found from [http://www.autonlab.org/tutorials/vcdim.html Andrew Moore] and [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.7171&rep=rep1&type=pdf Steve Gunn].<br />
<br />
2. Regularization<br />
<br />
<math>\hookrightarrow</math> Use of shrinkage method<br />
<br />
<math>\hookrightarrow</math> Decrease the chance of overfitting by controlling the weights<br />
<br />
=== '''Example of under and overfitting in R''' ===<br />
<br />
To give further intuition of over and underfitting, consider this example. A simple quadratic data set with some random noise is generated, and then polynomials of varying degrees are fitted. The errors for the training set and a test set are calculated.<br />
[[File:Curvefitting-rex2.png|250px|thumb|right|Polynomial fits to curved data set.]]<br />
<br />
>> x <- rnorm(200,0,1)<br />
>> y <- x^2-0.5*x+rnorm(200,0,0.3)<br />
>> xtest <- rnorm(50,1,1)<br />
>> ytest <- xtest^2-0.5*xtest+rnorm(50,0,0.3)<br />
>> p1 <- lm(y~x)<br />
>> p2 <- lm(y ~ poly(x,2))<br />
>> pn <- lm(y ~ poly(x,10))<br />
>> psi <- lm(y~I(sin(x))+I(cos(x)))<br />
<br />
: <code>x</code> values for the training set are based on a <math>\,N(0,1)</math> distribution, while the test set has a <math>\,N(1,1)</math> distribution. <code>y</code> values are determined by <math>\,y = x^2 + x + N(0,0.3)</math>, a quadratic function with some random variation. Polynomial least square fits of degree 1, 2, and 10 are calculated, as well as a fit of <math>\,sin(x)+cos(x)</math>.<br />
<br />
>> > # calculate the mean squared error of degree 1 poly<br />
>> > sum((y-predict(p1,data.frame(x)))^2)/length(y)<br />
>> [1] 1.576042<br />
>> > sum((ytest-predict(p1,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 7.727615<br />
: Training and test mean squared errors for the linear fit. These are both quite high - and since the data is non-linear, the different mean value of the test data increases the error quite a bit.<br />
>> > # calculate the mean squared error of degree 2 poly<br />
>> > sum((y-predict(p2,data.frame(x)))^2)/length(y)<br />
>> [1] 0.08608467<br />
>> > sum((ytest-predict(p2,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 0.08407432<br />
: This fit is far better - and there is not much difference between the training and test error, either.<br />
>> > # calculate the mean squared error of degree 10 poly<br />
>> > sum((y-predict(pn,data.frame(x)))^2)/length(y)<br />
>> [1] 0.07967558<br />
>> > sum((ytest-predict(pn,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 156.7139<br />
: With a high-degree polynomial, the training error continues to decrease, but not by much - and the test set error has risen again. The overfitting makes it a poor predictor. As the degree of the polynomial rises further, the accuracy of the computer becomes an issue - and a good fit is not even consistently produced for the training data.<br />
>> > # calculate mse of sin/cos fit<br />
>> > sum((y-predict(psi,data.frame(x)))^2)/length(y)<br />
>> [1] 0.1105446<br />
>> > sum((ytest-predict(psi,data.frame(x=xtest)))^2)/length(ytest)<br />
>> [1] 1.320404<br />
: Fitting a function of the form sin(x)+cos(x) works pretty well on the training set, but because it is not the real underlying function, it fails on test data which doesn't lie on the same domain.<br />
<br />
== ''' Cross-Validation (CV) - Introduction ''' ==<br />
<br />
[[File:Cv.jpg|200px|thumb|right|Figure 1: Illustration of Cross-Validation]]<br />
Cross-Validation is used to estimate the error rate of a classifier with respect to test data rather than data used in the model. Here is a general introduction to CV:<br />
<br />
<math>\hookrightarrow</math> We have a set of collected data for which we know the proper labels<br />
<br />
<math>\hookrightarrow</math> We divide it into 2 parts, Training data (T) and Validation data (V)<br />
<br />
<math>\hookrightarrow</math> For our calculation, we pretend that we do not know the label of V and we use data in T to train the classifier<br />
<br />
<math>\hookrightarrow</math> We estimate an empirical error rate on V since the model hasn't seen V yet and we know the proper label of all elements in V to know how many were misclassified.<br />
<br />
CV has different implementations which can reduce the variance of the calculated error rate, but sometimes with a tradeoff of a higher calculation time.<br />
<br />
== ''' Complexity Control - Nov 4, 2009''' ==<br />
<br />
== Cross-validation ==<br />
[[File:Cross-validation.png|350px|thumb|right|Figure 1: Classical/Standard cross-validation]]<br />
Cross-validation is the simplest and most widely used method to estimate the true error. It comes from the observation that although training error always decreases with the increasing complexity of the model, the test error starts to increase from a certain point, which is noted as overfitting (see [[#prediction-error|figure 2]] above). Since test error estimates MSE (mean square error) best, people came up with the idea to divide the data set into three parts: training set, validation set, and test set. training set is used to build the model, validation set is used to deside the parameters and the optimal model, and the test set is used to estimate the performance of the chosen model. A classical division is 50% for training set, and 25% each for validation set and test set. All of them are randomly selected from the original data set. <br />
<br />
Then, we only use the part of our data marked as the "training set" to train our algorithm, while keeping the remaining marked as the "validation set" untouched. As a result, the validation set will be totally unknown to the trained model. The error rate is then estimated by:<br />
<br />
<math>\hat L(h) = \frac{1}{|\nu|}\sum_{X_i \in \nu}(h(x_i) \neq y_i)</math>, where <math>\,|\nu|</math> is the cardinality of the validation set.<br />
<br />
When we change the complexity, the error generated by the validation set will have the same behavior as the test set, so we are able to choose the best parameters to get the lowest error.<br />
<br />
<br />
=== K-fold Cross-validation ===<br />
[[File:k-fold.png|350px|thumb|right|Figure 2: K-fold cross-validation]]<br />
Above is the simplest form of complexity control. However, in reality, it may be hard to collect data ??and we usually suffer from the curse of dimensionality??, and a larger data set may be hard to come by. Consequently, we may not be able to afford to sacrifice part of the limited resources. In this case we use another method that addresses this problem, K-fold cross-validation. We divide the data set into <math>\,K</math> subsets roughly equal in size. The usual choice is <math>\,K = 10</math>.<br />
<br />
Generally, how to choose <math>\,K</math>:<br />
<br />
if <math>\,K=n</math>, leave one out, low bias, high variance. Each subset contains a single element, so the model is trained with all except one point, and then validated using that point.<br />
<br />
if <math>\,K=2</math>, say 2-fold, 5-fold, high bias, low variance. Each subset contains approximately <math>\,\frac{1}{2}</math> or <math>\,\frac{1}{5}</math> of the data.<br />
<br />
For every <math>\,k</math>th <math>( \,k \in [ 1, K ] )</math> part, we use the other <math>\,K-1</math> parts to fit the model and test on the <math>\,k</math>th part to estimate the prediction error <math>\hat L_k</math>, where<br />
<br />
<math>\hat L(h) = \frac{1}{K}\sum_{k=1}^K\hat L_k</math><br />
<br />
For example, suppose we want to fit a polynomial model to the data set and split the set into four equal subsets as shown in Figure 2. First we choose the degree to be 1, i.e. a linear model. Next we use the first three sets as training sets and the last as validation set, then the 1st, 2nd, 4th subsets as training set and the 3rd as validation set, so on and so forth until all the subsets have been the validation set once (all observations are used for both training and validation). After we get <math>\hat L_1, \hat L_2, \hat L_3, \hat L_4</math>, we can calculate the average <math>\hat L</math> for degree 1 model. Similarly, we can estimate the error for n degree model and generate a simulating curve. Now we are able to choose the right degree which corresponds to the minimum error. Also, we can use this method to find the optimal unit number of hidden layers of neural networks. We can begin with 1 unit number, then 2, 3 and so on and so forth. Then find the unit number of hidden layers with lowest average error.<br />
<br />
=== Generalized Cross-validation ===<br />
If the vector of observed values is denoted by <math>\mathbf{y}</math>, and the vector of fitted values by <math>\hat\mathbf{y}</math>.<br />
<br />
<math>\mathbf{y} = \mathbf{H}\hat\mathbf{y}</math>, <br />
<br />
where the hat matrix is given by<br />
<br />
<math>\mathbf{H} = \mathbf{X}( \mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X}^{T}</math>,<br />
<br />
<math> \frac{1}{N}\sum_{i=1}^{N}[y_{i} - \hat f^{-i}(\mathbf{x}_{i})]^{2}=\frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-\mathbf{H}_{ii}}]^{2}</math>,<br />
<br />
Then the GCV approximation is given by<br />
<br />
<math> GCV(\hat f) = \frac{1}{N}\sum_{i=1}^{N}[\frac{y_{i}-\hat f(x_{i})}{1-trace(\mathbf{H})/N}]^{2}</math>,<br />
<br />
Thus, one of the biggest advantages of the GCV is that the trace is more easily computed.<br />
<br />
=== Leave-one-out Cross-validation ===<br />
Leave-one-out cross-validation involves using all but one data point in the original training data set to train our model, then using the data point that we initially left out as a means to estimate true error. But repeating this process for every data point in our original data set, we can obtain a good estimation of true error.<br />
<br />
In other words, leave-one-out cross-validation is k-fold cross-validation in which we set the subset number <math>\,K</math> to be the cardinality of the whole data set.<br />
<br />
In the above example, we can see that k-fold cross-validation can be computationally expensive: for every possible value of the parameter, we must train the model <math>\,K</math> times. This deficiency is even more obvious in leave-one-out cross-validation, where we must train the model <math>\,n</math> times, where <math>\,n</math> is the number of data point in the data set.<br />
<br />
Fortunately, when adding data points to the classifier is reversible, calculating the difference between two classifiers is computationally more efficient than calculating the two classifiers separately. So, if the classifier on all the data points is known, we simply undo the changes from a data point <math>\,K</math> times to calculate the leave-one-out cross-validation error rate.<br />
<br />
== Regularization for Neural Network — Weight Decay ==<br />
[[File:figure 2.png|350px|thumb|right|Figure 1: activation function]]<br />
Weight decay training is suggested as an implementation for achieving a robust neural network which is insensitive to noise. Since the number of hidden layers in NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting.<br />
<br />
It can be seen from Figure 1 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, we can avoid overfitting by constraining the weights to be small. This gives us a hint to initialize the random weights to be close to zero.<br />
<br />
Formally, we penalize nonlinear weights by adding a penalty term in the error function. Now the regularized error function becomes:<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}|w_i|^2 + \sum_{jk}|u_{jk}|^2)</math>, where <math>\,err</math> is the original error in back-propagation; <math>\,w_i</math> is the weights of the output layer; <math>\,u_{jk}</math> is the weights of the hidden layers.<br />
<br />
Usually, too large <math>\,\lambda</math> will make the weights <math>\,w_i</math> and <math>\,u_{jk}</math> too small. We can use cross validation to estimate <math>\,\lambda</math>.<br />
<br />
A similar penalty, weight elimination, is given by,<br />
<br />
<math>\,REG = err + \lambda(\sum_{i}\frac{|w_i|^2}{1 + |w_i|^2} + \sum_{jk}\frac{|u_{jk}|^2}{1+|u_{jk}|^2})</math>.<br />
<br />
As in back-propagation, we take partial derivative with respect to the weights:<br />
<br />
<math>\frac{\partial REG}{\partial w_i} = \frac{\partial err}{\partial w_i} + 2\lambda w_i</math><br />
<br />
<math>\frac{\partial REG}{\partial u_{jk}} = \frac{\partial err}{\partial u_{jk}} + 2\lambda u_{jk}</math><br />
<br />
<math>w^{new} \leftarrow w^{old} - \rho\left(\frac{\partial err}{\partial w} + 2\lambda w\right)</math><br />
<br />
<math>u^{new} \leftarrow u^{old} - \rho\left(\frac{\partial err}{\partial u} + 2\lambda u\right)</math><br />
<br />
Note that here <math>\,\lambda</math> serves as a trade-off parameter, tuning between the error rate and the linearity. Actually, we may also set <math>\,\lambda</math> by cross-validation. The tuning parameter is important since weights of zero will lead to zero derivatives and the algorithm will not change. On the other hand, starting with weights that are too large means starting with a nonlinear model which can often lead to poor solutions. <ref>Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning (Springer 2009) pp.398</ref><br />
<br />
== Radial Basis Function (RBF) Networks - November 6, 2009 ==<br />
<br />
[[File:Rbf_net.png|350px|thumb|right|Figure 1: Radial Basis Function Network]]<br />
<br />
=== Introduction === <br />
<br />
A Radial Basis Function (RBF) network [http://en.wikipedia.org/wiki/Radial_basis_function_network] is a type of artificial neural network with an output layer and a single hidden layer, with weights from the hidden layer to the output layer, and can be trained without back propagation since it has a closed-form solution. The neurons in the hidden layer contain basis functions. One choice that has been widely used is that of radial basis functions, which have the property that each basis function depends only on the radial distance (typically Euclidean) from a center <math>\displaystyle\mu_{j}</math>, so that <math>\phi_{j}(x)= h({\Vert x - \mu_{j}\Vert})</math>.<br />
<br />
<br />
The output of an RBF network can be expressed as a weighted sum of its radial basis functions as follows:<br />
<br />
<math>\hat y_{k} = \sum_{j=1}^M\phi_{j}(x) w_{jk}</math><br />
<br />
The radial basis function is: <br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br /><br />
(Gaussian without a normalization constant)<br /><br /><br />
'''note:'''The hidden layer has a variable number of neurons (the optimal number is determined by the training process). As usual the more neurons in the hidden layer, the higher the model complexity. Each neuron consists of a radial basis function centered on a point with the same dimensions as the input data. The radii of the RBF functions may be different. The centers and radii can be determined through clustering or an EM algorithm. When the x vector is given from the input layer, the hidden neuron computes the radial distance from the neuron’s center point and then applies RBF function to this distance. The resulting value is passed to the the output layer and weighed together to form the output. <br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat{Y}_{n,k} = \left[ \begin{matrix}<br />
\hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\<br />
\hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\<br />
\vdots &\vdots & \ddots & \vdots \\<br />
\hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k}<br />
\end{matrix}\right] </math> is the matrix of output variables. <br />
<br />
:<math>\Phi_{n,m} = \left[ \begin{matrix}<br />
\phi_{1,1} & \phi_{1,2} & \cdots & \phi_{1,m} \\<br />
\phi_{2,1} & \phi_{2,2} & \cdots & \phi_{2,m} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
\phi_{n,1} & \phi_{n,2} & \cdots & \phi_{n,m}<br />
\end{matrix}\right] </math> is the matrix of Radial Basis Functions.<br />
<br />
:<math>W_{m,k} = \left[ \begin{matrix}<br />
w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\<br />
w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\<br />
\vdots & \vdots & \ddots & \vdots \\<br />
w_{m,1} & w_{m,2} & \cdots & w_{m,k}<br />
\end{matrix}\right] </math> is the matrix of weights.<br />
<br />
Here, <math>k</math> is the number of outputs, <math>n</math> is the number of data points, and <math>m</math> is the number of hidden units. If <math>k = 1</math>, <math>\hat Y</math> and <math>W</math> are column vectors.<br />
<br />
''related reading'':<br />
<br />
Introduction of the Radial Basis Function (RBF) Networks [http://axiom.anu.edu.au/~daa/courses/GSAC6017/rbf.pdf]<br />
<br />
Radial Basis Function (RBF) Networks [http://documents.wolfram.com/applications/neuralnetworks/index6.html] [http://lcn.epfl.ch/tutorial/english/rbf/html/index.html]<br />
<br />
=== Estimation of weight matrix W ===<br />
<br />
We minimize the training error, <math>\Vert Y - \hat{Y}\Vert^2</math> in order to find <math>\,W</math>.<br /><br /><br />
From a previous result in linear algebra we know that <br />
<br />
<math>\Vert A \Vert^2 = Tr(A^{T}A)</math><br />
<br />
Thus we have a problem similar to linear regression:<br />
<br />
<math>\ err = \Vert Y - \Phi W\Vert^{2} = Tr[(Y - \Phi W)^{T}(Y - \Phi W)]</math><br />
<br />
<math>\ err = Tr[Y^{T}Y - Y^{T}\Phi W - W^{T} \Phi^{T} Y + W^{T}\Phi^{T} \Phi W]</math><br />
<br />
<br />
==== Useful properties of matrix differentiation ====<br />
<br />
<br />
<math>\frac{\partial Tr(AX)}{\partial X} = A^{T}</math><br />
<br />
<math>\frac{\partial Tr(X^{T}A)}{\partial X} = A</math><br />
<br />
<math>\frac{\partial Tr(X^{T}AX)}{\partial X} = (A^{T} + A)X</math><br />
<br />
==== Solving for W ====<br />
<br />
We find the minimum over <math>\,W</math> by setting <math>\frac{\partial err}{\partial W}</math> equal to zero and using the aforementioned properties of matrix differentiation.<br />
<br />
<math>\frac{\partial err}{\partial W} = 0</math><br />
<br />
<math>\ 0 - \Phi^{T}Y - \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ -2 \Phi^{T}Y + 2\Phi^{T}\Phi W = 0</math><br />
<br />
<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math><br />
<br />
<math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY</math><br />
<br />
where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}</math><br />
<br />
<math>\,H</math> is the hat matrix for this model. This gives us a nice results since the solution has a closed form and we do not have to worry about convexity problems in this case.<br />
<br />
=== Including an additional bias ===<br />
<br />
<math>\,y_{k}</math> can be expressed in matrix form as:<br />
<br />
<math>\hat Y = \Phi W </math><br />
<br />
where<br />
<br />
:<math>\hat Y = \left[ \begin{matrix}<br />
y_{11} & y_{12} & \cdots & y_{1k} \\<br />
y_{21} & y_{22} & \cdots & y_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
y_{n1} & y_{n2} & \cdots & y_{nk}<br />
\end{matrix}\right] </math> is the matrix(n by k) of output variables.<br />
<br />
:<math>\Phi = \left[ \begin{matrix}<br />
\phi_{10} &\phi_{11} & \phi_{12} & \cdots & \phi_{1M} \\<br />
\phi_{20} & \phi_{21} & \phi_{22} & \cdots & \phi_{2M} \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{n0} &\phi_{n1} & \phi_{n2} & \cdots & \phi_{nM}<br />
\end{matrix}\right] </math> is the matrix(n by M+1) of Radial Basis Functions.<br />
<br />
:<math>W = \left[ \begin{matrix}<br />
w_{01} & w_{02} & \cdots & w_{0k} \\<br />
w_{11} & w_{12} & \cdots & w_{1k} \\<br />
w_{21} & w_{22} & \cdots & w_{2k} \\<br />
\vdots & & \ddots & \vdots \\<br />
w_{M1} & w_{M2} & \cdots & w_{Mk}<br />
\end{matrix}\right] </math> is the matrix(M+1 by k) of weights.<br />
<br />
where the extra basis function <math>\Phi_{0}</math> is set to 1.<br />
<br />
==== Normalized RBF ====<br />
<br />
In addition to the above unnormalized architecture, the normalized RBF can be represented as:<br />
<br />
<math>\hat{y}_{k}(X) = \frac{\sum_{j=1}^{M} w_{jk}\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math><br /><br /><br />
<br />
<br />
Actually, <math>\Phi^{\ast}_{j}(X) = \frac{\Phi_{j}(X)}{\sum_{r=1}^{M}\Phi_{r}(X)}</math> is known as a normalized radial basis function. Giving the familiar form,<br /><br />
<br />
<math>\hat{y}_{k}(X) = \sum_{j=1}^{M} w_{jk}\Phi^{\ast}_{j}(X)</math><br /><br /><br />
<br />
=== Conceptualizing RBF networks ===<br />
<br />
In the past, we have classified data using models that were explicitly linear, quadratic, or otherwise definite. In RBF networks, like in Neural Networks, we can fit an arbitrary model. How can we do this without changing the equations being used?<br />
<br />
Recall a [[#Trick:_Using_LDA_to_do_QDA_-_October_7.2C_2009|trick]] that was discussed in the October 7 lecture: if we add new features to our original data set, we can project into higher dimensions, use a linear algorithm, and get a quadratic result by collapsing to a lower dimension afterward. In RBF networks, something similar can happen.<br />
<br />
Think of <math>\,\Phi</math>, our matrix of radial basis functions, as a feature space of the input. Each hidden unit, then, can be thought to represent a feature; we can see that, if there are more hidden units than input units, we can essentially project to a higher-dimensional space, as we did in our earlier trick. However, this does not mean that an RBF network will actually do this, it is merely a way to convince yourself that RBF networks (and neural networks) can fit arbitrary models. Nevertheless, it is also noticed that just because of such great power, the problem of overfitting appears more important. We have to control its complexity so that it does not fit to an arbitrary training model but to a general one.<br />
<br />
=== RBF networks for classification -- a probabilistic paradigm ===<br />
<br />
[[File:Rbf_graphical_model.png|350px|thumb|left|Figure 1: RBF graphical model]]<br />
<br />
An RBF network is akin to fitting a Gaussian mixture model to data. We assume that each class can be modelled by a single function <math>\,\phi</math> and data is generated by a mixture model. According to Bayes Rule,<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(x|y_{k})*Pr(y_{k})}{Pr(x)}</math><br />
<br />
While all classifiers that we have seen thus far in the course have been in discriminative form, the RBF network is a generative model that can be represented using a directed graph.<br />
<br />
We can replace the class conditional density in the above conditional probability expression by marginalizing <math>\,x</math> over <math>\,j</math>:<br />
<math>\Pr(x|y_{k}) = \sum_{j} Pr(x|j)*Pr(j|y_{k})</math><br />
<br />
<br />
<br />
<br/><br/><br />
*'''Note''' We made the assumption that each class can be modelled by a single function <math>\displaystyle\Phi</math> and that the data was generated by a mixture model. The Gaussian mixture model has the form:<br />
<math>f(x)=\sum_{m=1}^M \alpha_m \phi(x;\mu_m,\Sigma_m)</math> where <math>\displaystyle\alpha_m</math> are mixing proportions, <math>\displaystyle\sum_m \alpha_m=1</math>, and <math>\displaystyle\mu_m</math> and <math>\displaystyle\Sigma_m</math> are the mean and covariance of each Gaussian density respectively. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009), pp. 214. </ref> The generative model in Figure 1 shows graphically how each Gaussian in the mixture model is chosen to sample from.<br />
<br />
== '''Radial Basis Function (RBF) Networks - November 9th, 2009''' ==<br />
<br />
=== RBF Network for classification (A probabilistic point of view) ===<br />
Using RBF Network to do classification, we usually treat it as regression problem. We want to set a threshold to decide what the data’s class membership is. However, to find some insight to the classification problem of what we are doing in terms of RBF Network, we often think of mixture models and make certain assumptions.<br />
<br />
[[File:RBF.png|350px|thumb|right|Figure 26.1: RBF Network Classification Demo]] <br />
<br />
We assume, as we can see in the graph on the right hand side, that we have three random variables, <math>\displaystyle y_k</math>, <math>\displaystyle j</math>, and <math>\displaystyle x</math> where <math>\displaystyle y_k</math> denotes class <math>\,k</math>, <math>\displaystyle x</math> is what we observed here, and <math>\displaystyle j</math> is some hidden random variable. There is a process that there are different classes, and each class can trigger a different hidden random variable <math>\displaystyle j</math>. To understand this, we can assume that, for instance, this random variable <math>\displaystyle j</math> has a Gaussian distribution (it could have any other distribution as well) and that all the <math>\displaystyle j</math>’s have the same distribution (Gaussian), but with different parameters. From each Gaussian distribution triggered by each class, we are going to sample some data points. Therefore, in the end, we are going to get a set of data, which are not strictly Gaussian, but are actually a mixture of Gaussians.<br />
<br />
Again, we look at the posterior distribution from [http://en.wikipedia.org/wiki/Bayes'_theorem Bayes' Rule].<br />
<br />
<math>Pr(Y = y_{k} | X = x) = \frac {Pr(X = x | Y = y_{k})*Pr(Y = y_{k})}{Pr(X = x)}</math><br />
<br />
Since we made the assumption that the data has been generated from a mixture model, we can estimate this conditional probability by<br />
<br />
<math>\Pr(X = x | Y = y_{k}) = \sum_{j} Pr(X = x | j)*Pr(j | Y = y_{k})</math>, <br />
<br />
which is the class conditional distribution (or probability) of the mixture model. Note, here, if we only have a simple model from <math>\displaystyle y_k</math> to <math>\displaystyle x</math>, then we won’t have this summation.<br />
<br />
We can substitute this class conditional distribution into Bayes' formula. We can see that the posterior of class <math>\displaystyle k</math> is the summation over <math>\displaystyle j</math> of the probability of <math>\displaystyle x</math> given <math>\displaystyle j</math> times the probability of <math>\displaystyle j</math> given <math>\displaystyle y_k</math>, times the prior distribution of class <math>\displaystyle k</math>, and lastly divided by the marginal probability of <math>\displaystyle x</math>. That is,<br />
<br />
<math>\Pr(y_k | x) = \frac {\sum_{j} Pr(x | j)*Pr(j | y_{k})*Pr(y_{k})}{Pr(x)}</math>.<br />
<br />
Since, the prior probability of class <math>\displaystyle k</math>, <math>\displaystyle Pr(y_{k})</math>, does not have an index of <math>\displaystyle j</math>, it can be taken out of the summation. This yields,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)}</math>.<br />
<br />
We multiply this by <math>\displaystyle 1 = \frac {Pr(j)}{Pr(j)}</math>. Then, it becomes,<br />
<br />
<math>\Pr(y_k | x) = \frac {Pr(y_{k})\sum_{j} Pr(x | j)*Pr(j | y_{k})}{Pr(x)} * \frac {Pr(j)}{Pr(j)}</math>.<br />
<br />
Next, note that <math>\displaystyle Pr(j | x) = \frac {Pr(x | j)*Pr(j)}{Pr(x)}</math>, and <math>\displaystyle Pr(y_k | j) = \frac {Pr(j | y_k)*Pr(y_k)}{Pr(j)}</math>. Then rearranging the terms, we finally have the posterior:<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} Pr(j | x)Pr(y_k | j)</math>.<br />
<br />
Interestingly, this is just the product of the posterior of the two functions that are summed.<br />
<br />
==== Interpretation of RBF Network classification ====<br />
<br />
[[File:2.png|350px|thumb|right|Figure 26.1.2(2): RBF Nerwork ]]<br />
<br />
We want to relate the results that we derived above to our RBF Network. In a RBF Network, as we can see on the right hand side, we have a set of data, <math>\displaystyle x_1</math> to <math>\displaystyle x_d</math>, and the hidden basis function, <math>\displaystyle \phi_{1}</math> to <math>\displaystyle \phi_{M}</math>, and then we have some output, <math>\displaystyle y_1</math> to <math>\displaystyle y_k</math>. Also, we have weights from the hidden layer to output layer. The output is just the linear sum of <math>\displaystyle \phi</math>’s. <br />
<br />
Now consider probability of <math>\displaystyle j</math> given <math>\displaystyle x</math> to be <math>\displaystyle \phi</math>, and the probability of <math>\displaystyle y_k</math> given <math>\displaystyle j</math> to be the weights <math>\displaystyle w_{jk}</math>, then the posterior can be written as,<br />
<br />
<math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math>.<br />
<br />
[[File:3.png|350px|thumb|left|Figure 26.1.2(1): Gaussian mixture ]]<br />
<br />
Now, let us look at an example in one dimensional case. Suppose,<br />
<br />
<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, and <math>\displaystyle j</math> is from 1 to 2. <br />
<br />
We know that <math>\displaystyle \phi</math> is a radial basis function. It's as if we put some Gaussian over data. And for each Gaussian, we consider the center <math>\displaystyle \mu</math>. Then, what <math>\displaystyle \phi</math> computes is the similarity of any data point to the center. <br />
<br />
We can see the graph on the left which plots the density of <math>\displaystyle \phi_{1}</math> and <math>\displaystyle \phi_{2}</math>. Take <math>\displaystyle \phi_{1}</math> for instance, if the point gets far from the center <math>\displaystyle \mu_{1}</math>, then it will reduce <math>\displaystyle \phi_{1}</math> to become nearly zero. Remember that, we can usually find a non-linear regression or classification of input space by doing a linear one in some extended space or some feature space (more details in Aside). Here, the <math>\displaystyle \phi</math>’s actually produce that feature space. <br />
<br />
So, one way to look at this is that this <math>\displaystyle \phi</math> is telling us that given an input, how likely the probability of presence of a particular feature is. Say, for example, we define the features as the centers of these Gaussian distributions. Then, this <math>\displaystyle \phi</math> function somehow computes the possibility given certain data points, of this kind of feature appearing. If the data point is right at the center, then the value of that <math>\displaystyle \phi</math> would be one, i.e. the probability is 1. If the point is far from the center, then the probability (<math>\displaystyle \phi</math> function value) will be close to zero, that is, it’s less likely. Therefore, we can treat <math>\displaystyle Pr(j | x)</math> as the probability of a particular feature given data. <br />
<br />
When we have those features, then <math>\displaystyle y</math> is the linear combination of the features. Hence, any of the weights <math>\displaystyle w</math>, which is equal to <math>\displaystyle Pr(y_k | j)</math>, tells us how likely this particular <math>\displaystyle y</math> will appear given those features. Therefore, the weight <math>\displaystyle w_{jk}</math> shows the probability of class membership given feature. <br />
<br />
Hence, we have found a probabilistic point of view to look at RBF Network!<br />
<br />
*'''Note''' There are some inconsistencies with this probabilistic point of view. There are no restrictions that force <math>\displaystyle Pr(y_k | x) = \sum_{j} \phi_{j}(x)*w_{jk}</math> to be between 0 and 1. So if least squares is used to solve this, <math>\displaystyle w_{jk}</math> cannot be interpreted as a probability. <br />
<br />
<br />
''' Aside '''<br />
*Feature Space:<br />
:One way to produce a feature space is LDA<br />
:Suppose, we have n data points <math>\mathbf{x}_1</math> to <math>\mathbf{x}_n </math>. Each data point has d features. And these n data points consist of the <math>X</math> matrix, <br />
:<math>X = \left[ \begin{matrix}<br />
x_{11} & x_{21} & \cdots & x_{n1} \\<br />
x_{12} & x_{22} & \cdots & x_{n2} \\<br />
\vdots & & \ddots & \vdots \\<br />
x_{1d} & x_{2d} & \cdots & x_{nd}<br />
\end{matrix}\right] </math><br />
:Also, we have feature space,<br />
:<math>\Phi^{T} = \left[ \begin{matrix}<br />
\phi_{1}(\mathbf{x_1}) & \phi_{1}(\mathbf{x_2})& \cdots & \phi_{1}(\mathbf{x_n})\\<br />
\phi_{2}(\mathbf{x_1})& \phi_{2}(\mathbf{x_2})& \cdots & \phi_{2}(\mathbf{x_n}) \\<br />
\vdots & & \ddots & \vdots \\<br />
\phi_{M}(\mathbf{x_1}) & \phi_{M}(\mathbf{x_2}) & \cdots & \phi_{M}(\mathbf{x_n})<br />
\end{matrix}\right] </math> <br />
:If we want to solve a regression problem for the input data, we don’t perform Least Square on this <math>\displaystyle X</math> matrix, we do Least Square on the feature space, i.e. on the <math>\displaystyle \Phi^{T}</math> matrix. The dimensionality of <math>\displaystyle \Phi^{T}</math> is M by n.<br />
:Now, we still have n data points, but we define these n data points in terms of a new set of features. So, originally, we define our data points by d features, but now, we define them by M features. And what are those M features telling us? <br />
:Let us look at the first column of <math>\displaystyle \Phi^{T}</math> matrix. The first entry is <math>\displaystyle \phi_1</math> applied to <math>\mathbf{x_1}</math>, and so on, until the last entry is <math>\displaystyle \phi_M</math> applied to <math>\mathbf{x_1}</math>. Suppose each of these <math>\displaystyle \phi_j</math> is defined by<br />
:<math>\phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>.<br />
:Then, each <math>\displaystyle \phi_j</math> checks the similarity of the data point with its center. Hence, the new set of features are actually representing M centers in our data set, and for each data point, its new features check how this point is similar to the first center; how it is similar to the second center; and how it is similar to the <math>\displaystyle M^{th}</math> center. And this checking process will apply to all data points. Therefore, feature space gives another representation of our data set. <br />
<br />
</noinclude><br />
<br />
=== Model selection or complexity control for RBF Network - a brief introduction ===<br />
In order to obtain a better fit for the training data, we often want to increase the complexity of our RBF Network. By its construction, we know that to change the complexity of a RBF Network, the only way is to add or decrease the number of basis functions. A large number of basis function yields a more complex network. In theory, if we add enough basis functions, the RFB Network would work for any training; however, it doesn't mean this RBF Network model can generalize well. Therefore, for the purpose of avoiding overfitting problem (see Notes below), we only want to increase the number of basis function to certain point, i.e. its optimal level. <br />
<br />
For the model selection, what we usually do is estimate the training error. After working through the training error, we’ll see that the training error in fact can be decomposed, and one component of training error is called Mean Squared Error (MSE). In the later notes, we will find that our final goal is to get a good estimate of MSE. Moreover, in order to find an optimal model for our data, we select the model with the smallest MSE.<br />
<br />
Now, let us introduce some notations that we will use in the analysis:<br />
*<math>\hat f</math> -- the prediction model estimated by a RBF network from the training data<br />
*<math>\displaystyle f</math> -- the real model (not null), and ideally, we want <math>\hat f</math> to be close to <math>\displaystyle f</math><br />
*<math>\displaystyle err</math> -- the training error<br />
*<math>\displaystyle Err</math> -- the testing error<br />
*<math>\displaystyle MSE</math> -- the Mean Squared Error<br />
<br />
''' Notes '''<br />
<br />
[[File:overfitting.png|350px|thumb|left|Figure 26.2: Overfitting]]<br />
<br />
*Being more complex isn’t always a good thing. Sometime, [http://en.wikipedia.org/wiki/Overfitting overfitting] causes the model to lose its generality. For example in the graph on left hand side, the data points are sampled from the model <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle f(x_i)</math> is a linear function, which is shown by the blue line, and <math>\displaystyle \epsilon_i</math>is additive Gaussian noise from <math>~N(0,\sigma^2)</math>. The red curve displayed in the graph shows the over-fitted model. Clearly, this over-fitted model only works for any training data, and is useless for any further prediction when new data points are introduced.<br />
<br />
> n<-20;<br />
> x<-seq(1,10,length=n);<br />
> alpha<-2.5;<br />
> beta<-1.75;<br />
> y<-alpha+beta*x+rnorm(n);<br />
> plot(y~x, pch=16, lwd=3, cex=0.5, main='Overfitting');<br />
> abline(alpha, beta, col='blue');<br />
> lines(spline(x, y), col = 2);<br />
<br />
*More details on this topic later on.<br />
<br />
<br />
<br />
</noinclude><br />
<br />
<br />
<br />
<br />
<br />
<br />
== '''Model Selection(Stein's Unbiased Risk Estimate)- November 11th, 2009''' ==<br />
<br />
===Model Selection===<br />
<br />
Model selection is a task of selecting a model of optimal complexity for a given data. Learning a radial basis function network from data is a parameter estimation problem. One difficulty with this problem is selecting parameters that show good performance on both training and testing data. In principle, a model is selected to have parameters associated with the best observed performance on training data, although our goal really is to achieve good performance on unseen testing data. Not surprisingly, a model selected on the basis of training data does not necessarily exhibit comparable performance on the testing data. When squared error is used as the performance index, a zero-error model on the training data can always be achieved by using a sufficient number of basis functions.<br />
<br />
<br />
But, training error and testing error do not demonstrate a linear relationship. In particular, a smaller training error does do not necessarily result in a smaller testing error. In practice, one often observes that, up to a certain point, the model error on testing data tends to decrease as the training error decreases. However, if one attempts to decrease the training error too far by increasing model complexity, the testing error often can take a dramatic increase.<br />
<br />
<br />
The basic reason behind this phenomenon is that in the process of minimizing training error, after a certain point, the model begins to over-fit the training set. Over-fitting in this context means fitting the model to training data at the expense of losing generality. In the extreme form, a set of <math>\displaystyle N</math> training data points can be modeled exactly with <math>\displaystyle N</math> radial basis functions. Such a model follows the training data perfectly. However, the model is not representative features of the true underlying data source, and this is why it fails to correctly model new data points.<br />
<br />
<br />
In general, the training error rate will be less than the testing error on the new data. A model typically adapts to the training data, and hence the training error will be an overly optimistic estimate of the testing error. An obvious way to well estimate testing err is to add a penalty term to the training error to compensate. Actually, SURE is developed based on this.<br />
<br />
<br />
<br />
===Stein's unbiased risk estimate (SURE)===<br />
<br />
<br />
====Important Notation====<br />
<br />
Let:<br />
*<math>\hat f(X)</math> denote the ''prediction model'', which is estimated from a training sample by the RBF neural network model.<br />
*<math>\displaystyle f(X)</math> denote the ''true model''.<br />
*<math>\displaystyle err=\sum_{i=1}^N (\hat y_i-y_i)^2 </math> denote the ''training error'',which is the average loss over the training sample.<br />
*<math>\displaystyle Err=\sum_{i=1}^M (\hat y_i-y_i)^2 </math> denote the ''test error'', which is the expected prediction error on an independent test sample.<br />
*<math>\displaystyle MSE=E(\hat f-f)^2</math> denote the ''mean squared error'', where <math>\hat f(X)</math> is the estimated model and <math>\displaystyle f(X)</math> is the true model.<br />
<br />
<br />
<br />
Suppose the observations <math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>.We need to estimate <math>\hat f</math> from training data set <math>T=\{(x_i,y_i)\}^2_{i=1}</math>. Let <math>\hat f_i=\hat f_(x_i)</math> and <math>\displaystyle f_i= f(x_i)</math>, then <br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i-\epsilon_i)^2]</math><math>=E[(\hat f_i-f_i)^2]+E[\epsilon_i^2]-2E[\epsilon_i(\hat f_i-f_i)]</math><br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2E[\epsilon_i(\hat f_i-f_i)]</math> <math>\displaystyle (1)</math><br />
<br />
The last term can be written as:<br />
<br />
<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=E[(y_i-f_i)(\hat f_i-f_i)]=cov(y_i,\hat f)</math>, where<math>\displaystyle y_i</math> and <math>\hat f_i</math> both have same mean <math>\displaystyle f_i</math>.<br />
<br />
<br />
<br />
====Stein's Lemma====<br />
<br />
If <math>\,Z</math> is <math>\,N(\mu,\sigma^2)</math> and if <math>\displaystyle g(Z)</math> is weakly differentiable,such that<math>\displaystyle E[\vert g'(Z)\vert]<\infty</math>, then <math>\displaystyle E[g(Z)(Z-\mu)]=\sigma^2E(g'(Z))</math>.<br />
<br />
<br />
According to Stein's Lemma, the last cross term of <math>\displaystyle (1)</math>, <math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]</math> can be written as <math>\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math>. The derivation is as follows.<br />
<br />
<math>\displaystyle Proof</math>:<br />
<math>\displaystyle E[g(Z)(Z-\mu)]=E[(\hat f-f)\epsilon]=\sigma^2E(g'(Z))=\sigma^2 E[\frac {\partial (\hat f-f)}{\partial y_i}]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}-\frac {\partial f}{\partial y_i}]</math><br />
<br />
<br />
Since <math>\displaystyle f</math> is the true model, not the function of the observations <math>\displaystyle y_i</math>, then <math>\frac {\partial f}{\partial y_i}=0</math>.<br />
<br />
So,<math>\displaystyle E[\epsilon_i(\hat f_i-f_i)]=\sigma^2 E[\frac {\partial \hat f}{\partial y_i}]</math> <math>\displaystyle (2)</math><br />
<br />
<br />
<br />
====Two Different Cases====<br />
<br />
=====''Case 1''=====<br />
<br />
Consider the case in which a new data point has been introduced to the estimated model, i.e. <math>(x_i,y_i)\not\in\tau</math>; this new point belong to the validation set <math>\displaystyle \nu</math>,i.e.<math>(x_i,y_i)\in\nu</math>. Since <math>\displaystyle y_i</math> is a new point, <math>\hat f</math> and <math>\displaystyle y_i</math> are independent. Therefore <math>\displaystyle cov(y_i,\hat f)=0</math> (or think about <math>\frac{\partial \hat f}{\partial y_i}</math>, when <math>\,y_i</math>is a new point, then it has nothing with <math>\hat f</math> because the estimation of <math>\hat f</math> is from the training data, so <math>\frac{\partial \hat f}{\partial y_i}=0</math> ) and <math>\displaystyle (1)</math> in this case can be written as:<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2</math>. <br />
<br />
This expectation means <math>\frac {1}{m}\sum_{i=1}^m (\hat y_i-y_i)^2 = \frac {1}{m}\sum_{i=1}^m (\hat f_i-f_i)^2+ \sigma^2</math>.<br />
<br />
<math>\sum_{i=1}^m (\hat y_i-y_i)^2 = \sum_{i=1}^m (\hat f_i-f_i)^2+ m\sigma^2</math><br />
<br />
Based on the notation we denote above, then we obtain:<br />
<math>\displaystyle MSE=Err-m\sigma^2</math><br />
<br />
<br />
<br />
This is the justification behind the technique of cross validation. since <math>\displaystyle \sigma^2</math> is constant, to minimize <math>\displaystyle MSE</math> is equal to minimize the test err <math>\displaystyle Err</math>. In cross vaildation to avoid overfitting or underfitting, a validation data set is independent from the estimated model.<br />
<br />
<br />
=====''Case 2''=====<br />
<br />
A more interesting case is the case in which we do not use new data points to assess the performance of the estimated model. and the training data is used for both estimating and assessing a model <math>\hat f_i</math>. In this case the cross term in <math>\displaystyle (1)</math> cannot be ignored because <math>\hat f_i</math> and <math>\displaystyle y_i</math> are not independent. Therefore the cross term can be estimated by Stein's lemma, which was originally proposed to estimated the mean of a Guassian distribution.<br />
<br />
<br />
Suppose <math>(x_i,y_i)\in\tau</math>, then by applying Stein's lemma, we obtain <math>\displaystyle (2)</math> proved above.<br />
<br />
<math>\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2\sigma^2E[\frac {\partial \hat f}{\partial y_i}]</math>.<br />
<br />
This expectation means <math>\frac {1}{N}\sum_{i=1}^N (\hat y_i-y_i)^2 = \frac {1}{N}\sum_{i=1}^N (\hat f_i-f_i)^2+ \sigma^2-\frac {2}{N}\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<br />
<math>\sum_{i=1}^N (\hat y_i-y_i)^2 = \sum_{i=1}^N (\hat f_i-f_i)^2+ N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math>.<br />
<br />
<math>\displaystyle err=MSE+N\sigma^2-2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math><br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}</math> <math>\displaystyle (3)</math><br />
<br />
In statistics, this is known as [http://www.reference.com/browse/Stein%27s+unbiased+risk+estimate Stein's unbiased risk estimate (SURE)] is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter, and thus cannot be determined completely.<br />
<br />
<br />
<br />
===SURE for RBF Network===<br />
<br />
Based on SURE, the optimum number of basis functions should be assigned to have the minimum generalization err <math>\displaystyle err</math>. Based on the Radial Basis Function Network, by setting <math>\frac{\partial err}{\partial W}</math> equal to zero , we get the least squared solution of<math>\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y</math>.Then we have <math>\hat{Y} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}Y = HY</math>,where <math>\ H = \Phi(\Phi^{T}\Phi)^{-1})\Phi^{T}</math>, <math>\,H</math> is the hat matrix for this model.<br />
<br />
<br />
<math>\hat f=\,H_{i1}y_2+\,H_{i2}y_2+\cdots+\,H_{in}y_n</math> <math>\displaystyle (3)</math><br />
<br />
where <math>\,H</math> depends on the input vector <math>\displaystyle x_i</math> but not on <math>\displaystyle y_i</math>. <br />
<br />
By taking the derivative of <math>\hat f_i</math> with respect to <math>\displaystyle y_i</math>, we can easily obtain:<br />
<br />
<math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Now, substituing this into<math>\displaystyle (3)</math> ,we get<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2\sum_{i=1}^N \,H_{ii}</math><br />
<br />
Here, we can tell that <math>\sum_{i=1}^N \,H_{ii}= \,Trace(H)</math>, the sum of the diagonal elements of <math>\,H</math>. Thus, we can obtain the further simplification that <math>\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1})\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=d</math>, where<math>\displaystyle d</math> is the dimension of <math>\displaystyle \Phi</math>. Since <math>\displaystyle \Phi</math> is a projection of input matrix <math>\,X</math> onto a basis set spanned by <math>\,M</math>, the number of basis functions. If considering intercept, then <math>\,Trace(H)= M+1</math>.<br />
<br />
Then,<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1)</math>.<br />
<br />
===SURE Algorithm===<br />
<br />
<br />
[[File:27.1.jpg|350px|thumb|right|Figure 27.1]]<br />
<br />
We use this method to find the optimum number of basis function by choosing the model with smallest MSE over the set of models considered. Given a set of models <math>\hat f_M(x)</math> indexed by the number of basis functions, <math>\displaystyle err(M)</math>. <br />
<br />
Then, <math>\displaystyle MSE(M)=err(M)-N\sigma^2+2\sigma^2(M+1)</math><br />
<br />
where <math>\displaystyle N</math> is the number of training samples and the noise,<math>\sigma^2</math>, can be estimated from the training data as<br />
<br />
<math>\hat \sigma^2=\frac {1}{N-1}\sum_{i=1}^N (\hat y-y)^2</math>.<br />
<br />
<br />
By applying SURE algorithm to SPECT Heart data, we get the optimal number of basis functions is <math>\displaystyle M=4</math>.<br />
<br />
<br />
Pls take a look at figure 27.1 on the right, which shows that<math>\displaystyle MSE</math> is smallest when <math>\displaystyle M=4</math>.<br />
<br />
<br />
Calculating the SURE value is easy if you have access to <math>\,\sigma</math>.<br />
<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
If <math>\,\sigma</math> is not known, it can be estimated using the error.<br />
<br />
error = (output - expected_output) .^ 2;<br />
sigma = (dot(err, ones(1, num_data_point)) / (num_data_point)) / (num_data_point - 1);<br />
sure_Err = error - num_data_point * sigma .^ 2 + 2 * sigma .^2 * (num_basis_functions + 1);<br />
<br />
=='''SURE for RBF network & Support Vector Machine - November 13th, 2009'''==<br />
<br />
===SURE for RBF network===<br />
<br />
====Minimizing MSE====<br />
<br />
By Stein's unbiased risk estimate (SURE) for Radial Basis Function (RBF) Network<br />
we get:<br />
<br />
<math>\displaystyle MSE=err-N\sigma^2+2\sigma^2(M+1) </math> (28.1)<br />
<br />
*<math>\displaystyle MSE</math>(mean square error)= <math>\sum_{i=1}^N (\hat y_i-y_i)^2 </math><br />
*<math>\displaystyle err</math>(training error)= <math>\sum_{i=1}^N (\hat f_i-f_i)^2 </math><br />
*<math>\displaystyle (M+1) </math>( number of hidden units)= <math>\sum_{i=1}^N \frac {\partial \hat f}{\partial y_i} </math><br />
<br />
<br />
'''Goal''': To minimize MSE<br />
<br />
1. If <math>\displaystyle \sigma </math> is known, then it is no impact (i.e. a constant),<br />
and we can ignore it. Only need to minimize <math>\displaystyle MSE=err +2\sigma^2(M+1)</math>.<br />
<br />
2. In reality, we do not know <math>\displaystyle \sigma</math>, and it changes when <math>\displaystyle (M+1) </math> changes. However, we can estimate <math>\displaystyle \sigma </math>.<br />
<br />
<math>\displaystyle y_i= f(x_i)+\epsilon_i</math>, where <math>\displaystyle \epsilon_i</math>is additive Gaussian noise <math>~N(0,\sigma^2)</math>. Suppose we do not know the variance of <math>\displaystyle \epsilon</math>. Then, <br />
<br />
<math>\displaystyle \sigma^2=\frac{1}{N-1}\sum_{i=1}^N (\hat y_i-y_i)^2 =\frac{1}{N-1}err</math> (28.2)<br />
<br />
Substitute (28.2) into (28.1), get<br />
<br />
<math>\displaystyle MSE=err-N\frac{1}{N-1}err+2\frac{1}{N-1}err(M+1)</math><br />
<br />
<math>\displaystyle MSE=err(1-\frac{N}{N-1}+\frac{2(M+1)}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{N-1-N+2M+2}{N-1})</math><br />
<br />
<math>\displaystyle MSE=err(\frac{2M+1}{N-1}) </math> (28.3) <br />
<br />
<br />
[[File:28.1.jpg|350px|thumb|Figure 28.1: MSE vs err]]<br />
<br />
Figure 28.1: the training error will decrease and the MSE will increase when increasing the number of hidden units (i.e. the model is more complex).<br />
<br />
<br />
When the number of hidden units gets larger and larger, the training error will decrease until it approaches to <math>\displaystyle 0 </math>. If training error approaches <math>\displaystyle 0 </math>, then no matter how large <math>\displaystyle (M+1) </math> is, from (28.3) we can see MSE will approaches to <math>\displaystyle 0 </math> as well. However, in fact it does not happen since when training error is close to <math>\displaystyle 0 </math> [http://en.wikipedia.org/wiki/Overfitting overfitting] happens, and MSE will increase instead of being close to <math>\displaystyle 0 </math>. We can see it from the Figure 28.1. <br />
<br />
<br />
We can see <math>\displaystyle \sigma^2 </math> is the average of <math>\displaystyle err </math>. In order to deal with this problem, we can take the average for <math>\displaystyle err</math> of each hidden unit. For example: we can first take 1 hidden unit, and take 10 hidden units in the next.<br />
<br />
We can also see that unlike the classical Cross Validation (CV) or Leave one out (LOO) techniques, the SURE technique does not need to do the validation to find the optimal model. Hence, SURE technique uses less data than CV or LOO. It is suitable for the case that there is not enough data for validation. However, to implement SURE we need to find <math>\frac {\partial \hat f}{\partial y_i}</math>, which may not be trivial for models that do not have a closed-form solution.<br />
<br />
====Kmeans Clustering====<br />
<br />
Discription:<br /> [http://en.wikipedia.org/wiki/K-means_clustering Kmeans clustering] is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.<br />
<br />
*The number of hidden units is same as the number of clusters, which is <math>\displaystyle \phi </math><br />
<br />
*<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math>, we set it same for all clusters.<br />
<br />
The basic details for <math>K</math>-means clustering are given:<br />
<br />
The <math>K</math> initial centers are randomly chosen from the training data.<br />
<br />
Then the following two steps are iterated alternatively until convergernce.<br />
<br />
1. for each existing center, we reidentify its cluster (every points in this cluster should be closer to this center than others).<br />
<br />
2. compute the mean for each cluster and make it as the new center for each cluster.<br />
<br />
<br />
Example:<br /><br />
Partition data into 2 clusters (2 hidden values)<br />
<br />
<br />
>> X=rand(30,80); <br />
>> [IDX,C,sumD,D]=kmeans(X,2); <br />
>> size(IDX) <br />
>> 30 1<br />
>> size(C) <br />
>> 2 80<br />
>> size(sumD) <br />
>> 2 1<br />
>> c1=sum(IDX==1)<br />
>> 14<br />
>> c2=sum(IDX==2)<br />
>> 16<br />
>> sumD<br />
>> 85.6643<br />
>> 101.0419<br />
>> v1=sumD(1,1)/c1 <br />
>> 6.1189<br />
>> v2=sumD(2,1)/c2 <br />
>> 6.3151 <br />
<br />
<br />
<br />
Comments:<br />
<br />
We create <math>X</math> randomly as a train set with 80 data points and 30 dimensions, and then apply “kmeans” method to separate X into 2 clusters.IDX is a vector contains 1 or 2 which indicates 2 clusters, and its size is 30*1. <math>\displaystyle C </math> is the center (mean) of each cluster with size 2*80; sumD is sum of the square distance between the data points and center of its cluster. The <math>\displaystyle c1 </math> and <math>\displaystyle c2 </math> indicate the number of data points in cluster 1 and 2. <math>\displaystyle V1 </math> is the variance of the first cluster <math>\displaystyle (v1=\sigma_1)</math>; <math>\displaystyle V2 </math> is the variance of the second cluster <math>\displaystyle (v2=\sigma_2)</math>. Now we can get <math>\displaystyle \phi </math>, <math>\displaystyle w </math>, hat matrix <math>\displaystyle H </math> and <math>\displaystyle \hat Y </math> by following equations. Finally, we will get the <math>\displaystyle MSE </math> and predict the test set. <br />
<br />
<math>\displaystyle \phi_{j}(x) = e^{\frac{-\Vert x - \mu_{j}\Vert ^2}{2\sigma_{j}^2}}</math><br />
<br />
<math>\displaystyle W = (\phi^{T}\Phi)^{-1}\phi^{T}Y</math><br />
<br />
<math>\displaystyle H = \phi(\Phi^{T}\Phi)^{-1}\phi^{T}</math><br />
<br />
<math>\displaystyle \hat{Y} = \phi W = \phi(\phi^{T}\phi)^{-1}\phi^{T}Y = HY</math><br />
<br />
<br />
<br />
Aside:<br />
<br />
Similar in spirit to <math>K</math>-means, there is EM algorithm with respect to Gaussian mixture model. Generally speaking, the Gaussian mixture model is referred to as a soft clustering while <math>K</math>-means is hard clustering.<br />
<br />
Similar to <math>K</math>-means, the following two steps are iterated alternatively until convergernce.<br />
<br />
E-step, each point is assigned a weight for each cluster based on the likelihood of each of the corresponding Gaussians. Observations is assigned 1 for one cluster if they are closer to the center of that cluster, and is assigned 0 for other clusters. <br />
<br />
M-step, compute the weighted means and covariances and make them as the new means and covariances for every cluster.<br />
<br />
>>[P,mu,phi,1Pxtr]=mdgEM(X,2,200,0);<br />
<br />
===Support Vector Machine===<br />
<br />
====Introduction====<br />
We have seen that linear discriminate analysis and logistic regression both estimate linear decision boundaries in similar but slightly different ways. Separating hyperplane classifiers provide the basis for support vector classifier. It constructs linear decision boundaries that explicitly try to separate the data into different classes as well as possible. The techniques that make the extensions to the nonseparable case, where the classes overlap are generalized to what is known as the support vector machine. It produces nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space.<br />
<br />
Definition: <br /><br />
[http://en.wikipedia.org/wiki/Support_vector_machine Support Vector Machine (SVM)] are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space, which can be used for classification, regression or other tasks.<br />
<br />
====Optimal Seperating Hyperplane====<br />
<br />
[[File:28.2.jpg|350px|thumb|right|Figure 28.2]]<br />
<br />
Figure 28.2 An example with two classes separable by a hyperplane. The blue line is the least squares solution, which misclassifies one of the training points. Also shown are the black separating hyperplanes found by the [http://en.wikipedia.org/wiki/Perceptron perceptron] learning algorithm with different random starts.<br /><br />
<br />
We can see the data points can be separated by a linear boundary are in two classes in <math>\displaystyle \mathbb{R}^{2} </math>. Suppose a dataset is indeed linearly separable, then there exits infinitely many possible separating hyperplanes including the black lines in the figure as two of them for training data. However, which solution is the best when we introduce the new data. <br /><br />
<br />
Aside: <br /><br />
The blue line is the least squares solution to the problem,obtained by regressing the <math>\displaystyle -1/+1 </math> response <math>\displaystyle Y </math> on <math>\displaystyle X </math> (with intercept); the line is given by<br />
<math>\displaystyle {X:\hat\beta_0+\hat\beta_1X_1+\hat\beta_2X_2=0}</math>.<br />
This least squares solution does not do a perfect job in separating the points, and makes one error. This is the same boundary found by linear discriminant analysis, in light of its equivalence with linear regression in the two-class case.<br />
<br />
Classifiers such as (28.4) that compute a linear combination of the input features and return the sign were called ''perceptrons'' in the engineering literature in the late 1950s. <br />
<br />
<br />
Identifications:<br />
<br />
*Hyperplane: seprate two classes <br />
<br />
<math>\displaystyle x^{T}\beta+\beta_0=0</math><br />
<br />
*Margin: the distance between the hyperplane and the closest point.<br />
<br />
<math>\displaystyle d_i=x_i^{T}\beta+\beta_0 </math> where <math>\displaystyle i=1,....,N</math><br />
<br />
Note: since distance is positive, if the data is on <math>\displaystyle +1 </math> side the distance is <math>\displaystyle d_i(+1)</math>. If the data is on the <math>\displaystyle -1 </math> side the distance is <math>\displaystyle d_i(-1)</math>.<br />
<br />
*Data points: <math>\displaystyle y_i\in\{-1,+1\}</math>we can classify points as<math>\displaystyle sign\{d_i\}</math> if <math>\displaystyle \beta,\beta_0 </math> is known.<br /><br />
<br />
====Maximum Margin Classfifiers====<br />
Choose the line farthest from both classes or Choose the line which has the maximum distance from the closest point. (i.e maximize the margin)<br /><br />
<br />
<math>\displaystyle Margin=min\{y_id_i\}</math> <math>\displaystyle i=1,2,....,N </math> <br />
where <math>\displaystyle y_i </math> is label and <math>\displaystyle d_i </math> is distance<br /><br />
<br />
[[File:28.3.jpg|350px|thumb|right|Figure 28.3 The linear algebra of a hyperplane]]<br />
<br />
<br />
<br />
Fiqure 28.3 depicts a hyperplane defined by the equation <math>\displaystyle x^{T}\beta+\beta_0=0</math>. Since they are in <math>\displaystyle \mathbb{R}^{2} </math>, the hyperplane is a line.<br /><br />
<br />
<br />
Properties:<br /><br />
<br />
1. <math>\displaystyle \beta </math> is orthogonal to the hyperplane <br /><br />
<br />
Two points<math>\displaystyle x_1,x_2</math> lying on the hyperplane.<br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1+\beta_0\beta^{T}x_2+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}(x_1-x_2)=0</math><br />
<br />
Hence,<math>\displaystyle \beta </math> is orthogonal to <math>\displaystyle (x_1-x_2)</math>, and<math>\displaystyle \beta^*=\frac{\beta}{\|\beta\|} </math> is the vector normal to the hyperplane.<br /><br />
<br />
2. For any point <math>\displaystyle x_0 </math> on the hyperplane, <br />
<br />
<math>\displaystyle \beta^{T}x_0+\beta_0=0</math><br />
<br />
<math>\displaystyle \beta^{T}x_1=-\beta_0</math><br />
<br />
<br />
3. The signed distance fo any point <math>\displaystyle x </math> to the hyperplane is: since the length of <math>\displaystyle \beta </math> is not known, let it be unit vector.<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}(x_i-x_0)}{\|\beta\|} </math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i-\beta^{T}x_0}{\|\beta\|} </math><br />
<br />
by property 2<br />
<br />
<math>\displaystyle d_i=\frac{\beta^{T}x_i+\beta_0}{\|\beta\|} </math><br />
<br />
<br />
<br />
[[File:4.jpg|350px|thumb|right|Figure 28.4]]<br />
<br />
<br />
<math>\displaystyle Margin=min(y_id_i)</math> <math>\displaystyle i=1,2,....,N </math><br />
<br />
<math>\displaystyle Margin=min\{y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\} </math><br />
<br />
Suppose <math>\displaystyle x_i </math> is not on the hyperplan<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)>0 </math><br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq c </math>for <math>\displaystyle c>0 </math><br />
<br />
<br />
<math>\displaystyle y_i(\frac{\beta^{T}x_i}{c}+\frac{\beta_0}{c})\geq1</math> <br />
<br />
This is known as the cnonical representation of the decision hyperplane.<br />
<br />
For <math>\displaystyle \beta^{T} </math> only the direction is important, so <math>\displaystyle \frac{\beta^{T}}{c} </math> does not change its direction, the hyperplance will be the same.<br />
<br />
<math>\displaystyle y_i(\beta^{T}x_i+\beta_0)\geq1 </math><br />
<br />
<math>\displaystyle y_i\frac{\beta^{T}x_i+\beta_0}{\|\beta\|}\geq\frac{1}{\|\beta\|} </math><br />
<br />
<math>\displaystyle Margin=min(\frac{1}{\|\beta\|} )</math><br />
<br />
which is equivalent as <math>\displaystyle Margin=max(\|\beta\| )</math><br />
<br />
<br />
<br />
<br />
Refrence:<br /><br />
Hastie,T.,Tibshirani,R., Friedman,J.,(2008).The Elements of Statistical Learning:129-130<br />
<br />
=='''Optimizing The Support Vector Machine - November 16th, 2009'''==<br />
The Support Vector Machine is used to find a maximum margin hyperplane, assuming the two classes are separable. This margin can be written as <math>\,min\{y_id_i\}</math>, or the distance of each point from the hyperplane, where <math>\,d_i</math> is the distance and <math>\,y_i</math> is used as the sign.<br />
===Maximizing the Support Vector Machine===<br />
<math>\,Margin=min\{y_id_i\}</math> can be rewritten as <math>\,min\left\{\frac{y_i\left(\beta^Tx_i+\beta_0\right)}{|\beta|}\right\}</math>. <br />
<br\>Note that the term <math>\,y_i\left(\beta^Tx_i+\beta_0\right) = 0</math> if <math>\,x_i</math> is on the hyperplane, but <math>\,y_i\left(\beta^Tx_i+\beta_0\right) > 0</math> if <math>\,x_i</math> is <math>\displaystyle not</math> on the hyperplane<br />
<br />
This implies <math>\,\exists C</math> such that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq C</math>.<br />
<br />
Divide through by C to produce <math>\,y_i\left(\frac{\beta^T}{C}x_i + \frac{\beta_0}{C}\right) \geq 1</math>. <br />
<br />
<math>\,\beta, \beta_0</math> compose a hyperplane that can have different values but we care about the direction, dividng through by a constant does not change the hyperplane. Thus, by scaling <math>\,\beta, \beta_0</math> we can eliminate C, so that <math>\,y_i\left(\beta^Tx_i+\beta_0\right) \geq 1</math>. Implying that the lower bound on <math>\,y_i\left(\beta^Tx_i+\beta_0\right)</math> is <math>\displaystyle 1</math><br />
<br />
Now in order to maximize the margin, we simply need to find <math>\,min\left\{\frac{1}{|\beta|}\right\}</math>. <br />
<br />
In other words, find maximum <math>\,|\beta|</math>, s.t. <math>\,min_i\{y_i(\beta^Tx_i+\beta_0)\} = 1</math>.<br />
<br />
Note that we're dealing with the norm of <math>\,\beta</math>. The 1-norm of a vector is simply the sum of the absolute value of each element (also known as the taxicab or Manhattan distance), and is apparently more accurate, but also has a discontinuity in the derivative. 2-norm, the Euclidean norm (the intuitive length of the vector), is easier to work with - that is <math>\,\|\beta\|_2 = (\beta^T\beta)^{1/2}</math>. For convenience, we will maximize <math>\,\frac{1}{2}\|\beta\|_2^2 = \frac{1}{2}\beta^T\beta</math>.<br />
<br />
This is an example of a quadratic programming problem and we will minimize a quadratic function respective to a linear inequality constraints.<br />
<br />
<br />
====Writing Lagrangian Form of Support Vector Machine====<br />
The Lagrangian form is introduced to ensure that the conditions are satisfied, as well as finding an optimal solution. <math>\,\alpha_i</math> are introduced as dual constraints. A [http://www.cs.wisc.edu/dmi/lsvm/ simple algorithm] for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.<br />
<br />
<math>\,L(\beta,\beta_0,\alpha) = \frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\alpha_i\left(y_i(\beta^Tx_i+\beta_0)-1\right)}</math>. To find the optimal value, set the derivative equal to zero.<br />
<br />
<math>\,\frac{\partial L}{\partial \beta} = 0</math>, <math>\,\frac{\partial L}{\partial \beta_0} = 0</math>. Note that <math>\,\frac{\partial L}{\partial \alpha_i}</math> is equivalent to the constraints <math>\left(y_i(\beta^Tx_i+\beta_0)-1\right) \geq 0, \,\forall\, i</math><br />
<br />
First, <math>\,\frac{\partial L}{\partial \beta} = \frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 - \sum_{i=1}^n{\left\{\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i)+\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0-\frac{\partial}{\partial \beta}\alpha_iy_i\right\}}</math><br />
<br />
: <math>\frac{\partial}{\partial \beta}\frac{1}{2}\|\beta\|_2^2 = \beta</math>.<br />
<br />
: <math>\,\frac{\partial}{\partial \beta}(\alpha_iy_i\beta^Tx_i) = \alpha_iy_ix_i</math><br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i\beta_0 = 0</math>. <br />
<br />
: <math>\,\frac{\partial}{\partial \beta}\alpha_iy_i = 0</math>.<br />
<br />
So this simplifies to <math>\,\frac{\partial L}{\partial \beta} = \beta - \sum_{i=1}^n{\alpha_iy_ix_i} = 0</math>. In other words,<br />
<br />
<math>\,\beta = \sum_{i=1}^n{\alpha_iy_ix_i}</math>, <math>\,\beta^T = \sum_{i=1}^n{\alpha_iy_ix_i^T}</math><br />
<br />
Similarly, <math>\,\frac{\partial L}{\partial \beta_0} = -\sum_{i=1}^n{\alpha_iy_i} = 0</math>.<br />
<br />
This allows us to rewrite the Lagrangian without <math>\,\beta</math>.<br />
<br />
<math>\,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} - \sum_{i=1}^n{\alpha_iy_i(\sum_{j=1}^n{\alpha_jy_jx_j^Tx_i} + \beta_0) - 1}</math>. <br />
<br />
Because <math>\,\sum_{i=1}^n{\alpha_iy_i} = 0</math>, and <math>\,\beta_0</math> is constant, <math>\,\sum_{i=1}^n{\alpha_iy_i\beta_0} = 0</math>. So this simplifies further, to<br />
<br />
<math>L(\alpha) = \,-\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}} + \sum_{i=1}^n{\alpha_i}</math><br />
A dual representation of the maximum margin.<br />
<br />
Because <math>\,\alpha_i</math> is the Lagrangian multiplier, <math>\,\alpha_i \geq 0 \forall i</math>.<br />
<br />
This is a much simpler optimization problem.<br />
<br />
=='''The Support Vector Machine algorithm - November 18, 2009'''==<br />
<br />
===Solving the Lagrangian===<br />
<br />
Continuing from the above derivation, we now have the equation that we need to minimize, as well as two constraints.<br />
<br />
The Support Vector Machine problem boils down to:<br />
<br />
<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math><br />
:such that <math>\alpha_i \geq 0</math><br />
:and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
<br />
We are looking to minimize <math>\,\alpha</math>, which is our only unknown. Once we know <math>\,\alpha</math>, we can easily find <math>\,\beta</math> and <math>\,\beta_0</math> (see the Support Vector algorithm below for complete details).<br />
<br />
If we examine the Lagrangian equation, we can see that <math>\,\alpha</math> is multiplied by itself; that is, the Lagrangian is quadratic with respect to <math>\,\alpha</math>. Our constraints are linear. This is therefore a problem that can be solved through [http://en.wikipedia.org/wiki/Quadratic_programming quadratic programming] techniques. We will examine how to do this in Matlab shortly.<br />
<br />
We can write the Lagrangian equation in matrix form:<br />
<br />
<math>L(\alpha) = \underline{\alpha}^T\underline{1} - \frac{1}{2}\underline{\alpha}^TS\underline{\alpha}</math><br />
:such that <math>\underline{\alpha} \geq \underline{0}</math><br />
:and <math>\underline{\alpha}^T\underline{y} = 0</math><br />
<br />
Where:<br />
* <math>\underline{\alpha}</math> denotes an <math>\,n \times 1</math> vector; <math>\underline{\alpha}^T = [\alpha_1, ..., \alpha_n]</math><br />
* Matrix <math>S = y_iy_jx_i^Tx_j = (y_ix_i)^T(y_ix_i)</math><br />
* <math>\,\underline{0}</math> and <math>\,\underline{1}</math> are vectors containing all 0s or all 1s respectively<br />
<br />
Using this matrix notation, we can use Matlab's build in quadratic programming routine, [http://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/quadprog.html quadprog].<br />
<br />
===Quadprog example===<br />
<br />
Let's use quadprog to find the solution to <math>\,L(\alpha)</math>.<br />
<br />
Matlab's quadprog function minimizes an equation of the following form:<br />
:<math>\min_x\frac{1}{2}x^THx+f^Tx</math><br />
:such that: <math>\,A \cdot x \leq b</math>, <math>\,Aeq \cdot x = beq</math> and <math>\,lb \leq x \leq ub</math><br />
<br />
We can now see why we kept the <math>\frac{1}{2}</math> constant in the original derivation of the equation.<br />
<br />
The function is called as such: <code>x = quadprog(H,f,A,b,Aeq,beq,lb,ub)</code>. The variables correspond to values in the equation above.<br />
<br />
We can now try to find the solution to <math>\,L(\alpha)</math> (though, it should be noted, that in <math>\,L(\alpha)</math> we subtract the first term rather than add it; I'm not sure if this difference would change how we call <code>quadprog</code>).<br />
<br />
We'll use a simple one-dimensional data set, which is essentially y = -1 or 1 + Gaussian noise. (Note: you could easily put the values straight into the quadprog equation; they are separated for clarity)<br />
<br />
x = [mvnrnd([-1],[0.01],100); mvnrnd([1],[0.01],100)]';<br />
y = [-ones(100,1); ones(100,1)];<br />
S = (y*x)' * (y*x);<br />
f = ones(200,1);<br />
A = -ones(1,200);<br />
b = 0;<br />
Aeq = y';<br />
beq = 0;<br />
lb = 0;<br />
ub = []; % There is no upper bound<br />
alpha = quadprog(S,f,A,b,Aeq,beq,lb,ub);<br />
<br />
This gives us the optimal <math>\,\alpha</math>... or at least I think it should, but it does not appear to work for me (that is, despite setting the lower boundary of 0, a number of values are still negative. Whether this is just the nature of <code>quadprog</code> or an error on my part is an exercise for the reader).<br />
<br />
===Examining K.K.T. conditions===<br />
<br />
[http://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions Karush-Kuhn-Tucker conditions] ([http://webrum.uni-mannheim.de/mokuhn/public/KarushKuhnTucker.pdf more info]) give us a closer look into the Lagrangian equation and the associated conditions.<br />
<br />
Suppose we are looking to minimize <math>\,f(x)</math> such that <math>\,g_i(x) \geq 0, \forall{x}</math>. If <math>\,f</math> and <math>\,g</math> are differentiable, then the ''necessary'' conditions for <math>\hat{x}</math> to be local minimum are:<br />
<br />
# At the optimal point, <math>\frac{\partial L}{\partial \hat{x}} = 0</math>; i.e. <math>f'(\hat{x}) - \sum{\alpha_ig'(\hat{x})}=0</math><br />
# <math>\alpha_i \geq 0</math><br />
# <math>\alpha_ig_i(\hat{x}) = 0, \forall{i}</math><br />
# <math>g_i(\hat{x}) \geq 0</math><br />
<br />
These all have names; condition 2 is called dual feasibility, condition 3 is called complimentary slackness, and condition 4 is called primal feasibility. If any of these conditions are violated, then the problem is not feasible.<br />
<br />
These are all trivial except for condition 4. Let's examine it further in our support vector machine problem.<br />
<br />
===Support Vectors===<br />
<br />
Basically, the support vectors are the training points that actually determine the optimal separating hyperplane which we are look for. Also, they are the most difficult points to classify or the most informative for the classification.<br />
<br />
In our case, the <math>g_i(\hat{x})</math> function is:<br />
:<math>\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1</math><br />
<br />
Substituting <math>\,g_i</math> into KKT condition 3, we get <math>\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0</math><br />
<br />
Points will either be 1 or greater than 1 away from the hyperplane.<br />
<br />
'''Case 1: a point > 1 away'''<br />
<br />
If <math>\,y_i(\beta^Tx_i+\beta_0) - 1 > 0</math> then <math>\,\alpha_i = 0</math>.<br />
<br />
In other words, if the point isn't on the margin, then the corresponding <math>\,\alpha</math> value is 0.<br />
<br />
'''Case 2: a point 1 away'''<br />
<br />
Conversely, an <math>\,\alpha</math> value can either be 0 or <math>\, > 0</math>. If <math>\,\alpha_i > 0</math>, then that point is on the margin.<br />
<br />
That is, if <math>\,\alpha_i > 0</math> then <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math>.<br />
<br />
Points on the margin -- points with corresponding <math>\,\alpha_i > 0</math> -- are called support vectors of that margin.<br />
<br />
===Using support vectors===<br />
<br />
Support vectors are important because they allow the support vector machine algorithm to be insensitive to outliers. If <math>\,\alpha = 0</math>, then the cost function is also 0, and won't contribute to the solution of the SVM problem; only points on the margin -- support vectors -- contribute.<br />
<br />
====The support vector machine algorithm====<br />
<br />
# Solve the quadratic programming problem:<math>\min_{\alpha} L(\alpha) = \sum_{i=1}^n{\alpha_i} - \frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}</math> such that <math>\alpha_i \geq 0</math> and <math>\sum_{i=1}^n{\alpha_i y_i} = 0</math><br />
## Use Matlab's quadprog to find the optimal <math>\,\underline{\alpha}</math><br />
# Find <math>\beta = \sum_{i=1}^n{\alpha_iy_i\underline{x_i}}</math><br />
# Find <math>\,\beta_0</math> by choosing a support vector (a point with <math>\,\alpha_i > 0</math> and solving <math>\,y_i(\beta^Tx_i+\beta_0) = 1</math><br />
<br />
===Example in Matlab===<br />
<br />
The following code, taken verbatim from the lecture, shows how to use Matlab built-in SVM routines (found in the Bioinformatics toolkit) to do classification through support vector machines.<br />
<br />
load 2_3;<br />
[U,Y] = princomp(X');<br />
data = Y(:,1:2);<br />
l = [-ones(1,200) ones(1,200)];<br />
[train,test] = crossvalind('holdOut',400);<br />
% Gives indices of train and test; so, train is a matrix of 0 or 1, 1 where the point should be used as part of the training set<br />
svmStruct = svmtrain(data(train,:), l(train), 'showPlot', true);<br />
<br />
[[File:Svm1.png|frame|center|The plot produced by training on some of the 2_3 data's first two features.]]<br />
<br />
yh = svmclassify(svmStruct, data(test,:), 'showPlot', true);<br />
<br />
[[File:Svm2.png|frame|center|The plot produced by testing some of the 2_3 data.]]</div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4500stat8412009-10-29T00:30:33Z<p>Ipargaru: /* Neural Networks (NN) - October 28, 2009 */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br />
<br />
Then we have that<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math> ,where <math>\,I</math> indicates the sign of the expression, returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math>(initial guess). It's goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
</ref>.<br />
*'''Comment on gradient descent algorithm'''<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction that the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and unfortunately you initially stand in the middle, then you will finally arrive at the saddle point(local minimum) and get stuck there.<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over i(all data points). Actually this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent), Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
<br />
<br />
<br />
<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection which branches "fans out" into as many connections as desired each carrying the same signal - the processing element output signal. <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref> It is a multistage regression or classificaiton model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
<br />
In a regression problem usually there is only one unit in the output layer but in a '''k'''-class classificaiton problem there could be '''k''' units in the output layer where unit '''k''' represents the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1)<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classificaiton by NN. <br />
<br />
In preceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] so the derivative cannot be taken. Thus, we replace it by a smooth continuous funciton <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it '''activation function'''.<br />
<br>Funciton <math>\displaystyle \sigma </math> can have any form, but typically <math>\sigma(a)=\frac {1}{1+e^{-a}}</math> (logit) form is used.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes in to the perceptron to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural networks is calls '''Feed-Forward Neural Network'''<br />
<br />
===Back-propagation===<br />
For a while Neural Network model was just an idea, since there were no algorithms for training the model until Geoffrey Hinton in 1986 <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> came up with an algorithm called '''back-propagation'''. After that, a number of other training algorithms and various configurations of Neural Networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied gradient descent algorithm for optimizing the weights. Back-propagation uses this idea of gradient descent to train neural network. <br />
<br />
Assume that last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each perceptron we have a funciton <math>\displaystyle z_i=\sigma(a_i)</math> that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classificaiton and the resulting classificaiton output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_j}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_j}=\delta_j \cdot z_l</math><br />
<br />
Where <math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} = \sigma_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math><br />
<br />
So <math>\delta_j = \sum_i \sigma_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \sigma_i \cdot u_{ij}</math><br />
<br />
Having calculated the error that the output creates, we can propagate this error back to the previous layers.<br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4499stat8412009-10-29T00:24:27Z<p>Ipargaru: /* Back-propagation */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br />
<br />
Then we have that<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math> ,where <math>\,I</math> indicates the sign of the expression, returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math>(initial guess). It's goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
</ref>.<br />
*'''Comment on gradient descent algorithm'''<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction that the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and unfortunately you initially stand in the middle, then you will finally arrive at the saddle point(local minimum) and get stuck there.<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over i(all data points). Actually this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent), Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
<br />
<br />
<br />
<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection which branches "fans out" into as many connections as desired each carrying the same signal - the processing element output signal <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref>. It is a multistage regression or classificaiton model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
<br />
<br />
In a regression problem usually there is only one unit in the output layer but in a '''k'''-class classificaiton problem there could be '''k''' units in the output layer where unit '''k''' represents the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1)<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classificaiton by NN. <br />
<br />
In preceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] so the derivative cannot be taken. Thus, we replace it by a smooth continuous funciton <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it '''activation function'''.<br />
<br>Funciton <math>\displaystyle \sigma </math> can have any form, but typically <math>\sigma(a)=\frac {1}{1+e^{-a}}</math> (logit) form is used.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes in to the perceptron to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural networks is calls '''Feed-Forward Neural Network'''<br />
<br />
===Back-propagation===<br />
For quite a while Neural Network model was just an idea, since there were not algorithms invented for training the model until Geoffrey Hinton in 1986 <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> came up with an algorithm called '''back-propagation'''. After that, a number of other training algorithms and various configurations of Neural Networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied gradient descent algorithms for optimizing the weights. Back-propagation uses this idea of gradient descent to train neural network. <br />
<br />
Assume that last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each perceptron we have a funciton that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math> which we called activaiton function <math>\displaystyle z_i=\sigma(a_i)</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classificaiton and the resulting classificaiton output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_j}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_j}=\delta_j \cdot z_l</math><br />
<br />
Where <math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} = \sigma_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math><br />
<br />
So <math>\delta_j = \sum_i \sigma_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \sigma_i \cdot u_{ij}</math><br />
<br />
Having calculated the error that the output creates, we can propagate this error back to the previous layers.<br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4498stat8412009-10-29T00:13:00Z<p>Ipargaru: /* Back-propagation */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br />
<br />
Then we have that<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math> ,where <math>\,I</math> indicates the sign of the expression, returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math>(initial guess). It's goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
</ref>.<br />
*'''Comment on gradient descent algorithm'''<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction that the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and unfortunately you initially stand in the middle, then you will finally arrive at the saddle point(local minimum) and get stuck there.<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over i(all data points). Actually this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent), Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
<br />
<br />
<br />
<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection which branches "fans out" into as many connections as desired each carrying the same signal - the processing element output signal <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref>. It is a multistage regression or classificaiton model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
<br />
<br />
In a regression problem usually there is only one unit in the output layer but in a '''k'''-class classificaiton problem there could be '''k''' units in the output layer where unit '''k''' represents the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1)<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classificaiton by NN. <br />
<br />
In preceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] so the derivative cannot be taken. Thus, we replace it by a smooth continuous funciton <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it '''activation function'''.<br />
<br>Funciton <math>\displaystyle \sigma </math> can have any form, but typically <math>\sigma(a)=\frac {1}{1+e^{-a}}</math> (logit) form is used.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes in to the perceptron to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural networks is calls '''Feed-Forward Neural Network'''<br />
<br />
===Back-propagation===<br />
For quite a while Neural Network model was just an idea, since there were not algorithms invented for training the model until Geoffrey Hinton in 1986 <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> came up with an algorithm called '''back-propagation'''. After that, a number of other training algorithms and various configurations of Neural Networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied gradient descent algorithms for optimizing the weights. Back-propagation uses this idea of gradient descent to train neural network. <br />
<br />
Assume that last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each perceptron we have a funciton that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math> which we called activaiton function <math>\displaystyle z_i=\sigma(a_i)</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classificaiton and the resulting classificaiton output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_j}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_j}=\delta_j \cdot z_l</math><br />
<br />
Where <math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} = \sigma_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math><br />
<br />
So <math>\delta_j = \sum_i \sigma_i \cdot u_{ij} \cdot \sigma'(a_j)</math><br />
<br><math>\delta_j = \sigma'(a_j)\sum_i \sigma_i \cdot u_{ij}</math><br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4497stat8412009-10-29T00:08:05Z<p>Ipargaru: /* Back-propagation */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br />
<br />
Then we have that<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math> ,where <math>\,I</math> indicates the sign of the expression, returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math>(initial guess). It's goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
</ref>.<br />
*'''Comment on gradient descent algorithm'''<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction that the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and unfortunately you initially stand in the middle, then you will finally arrive at the saddle point(local minimum) and get stuck there.<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over i(all data points). Actually this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent), Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
<br />
<br />
<br />
<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection which branches "fans out" into as many connections as desired each carrying the same signal - the processing element output signal <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref>. It is a multistage regression or classificaiton model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
<br />
<br />
In a regression problem usually there is only one unit in the output layer but in a '''k'''-class classificaiton problem there could be '''k''' units in the output layer where unit '''k''' represents the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1)<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classificaiton by NN. <br />
<br />
In preceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] so the derivative cannot be taken. Thus, we replace it by a smooth continuous funciton <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it '''activation function'''.<br />
<br>Funciton <math>\displaystyle \sigma </math> can have any form, but typically <math>\sigma(a)=\frac {1}{1+e^{-a}}</math> (logit) form is used.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes in to the perceptron to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural networks is calls '''Feed-Forward Neural Network'''<br />
<br />
===Back-propagation===<br />
For quite a while Neural Network model was just an idea, since there were not algorithms invented for training the model until Geoffrey Hinton in 1986 <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> came up with an algorithm called '''back-propagation'''. After that, a number of other training algorithms and various configurations of Neural Networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied gradient descent algorithms for optimizing the weights. Back-propagation uses this idea of gradient descent to train neural network. <br />
<br />
Assume that last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each perceptron we have a funciton that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math> which we called activaiton function <math>\displaystyle z_i=\sigma(a_i)</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classificaiton and the resulting classificaiton output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_j}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_j}=\delta_j \cdot z_l</math><br />
<br />
Where <math>\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} = \sigma_i \cdot \frac{\partial a_i}{\partial a_j}</math><br />
<br><math>\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)</math><br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4496stat8412009-10-28T23:35:41Z<p>Ipargaru: /* Back-propagation */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br />
<br />
Then we have that<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math> ,where <math>\,I</math> indicates the sign of the expression, returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math>(initial guess). It's goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
</ref>.<br />
*'''Comment on gradient descent algorithm'''<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction that the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and unfortunately you initially stand in the middle, then you will finally arrive at the saddle point(local minimum) and get stuck there.<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over i(all data points). Actually this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent), Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
<br />
<br />
<br />
<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection which branches "fans out" into as many connections as desired each carrying the same signal - the processing element output signal <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref>. It is a multistage regression or classificaiton model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
<br />
<br />
In a regression problem usually there is only one unit in the output layer but in a '''k'''-class classificaiton problem there could be '''k''' units in the output layer where unit '''k''' represents the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1)<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classificaiton by NN. <br />
<br />
In preceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] so the derivative cannot be taken. Thus, we replace it by a smooth continuous funciton <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it '''activation function'''.<br />
<br>Funciton <math>\displaystyle \sigma </math> can have any form, but typically <math>\sigma(a)=\frac {1}{1+e^{-a}}</math> (logit) form is used.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes in to the perceptron to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural networks is calls '''Feed-Forward Neural Network'''<br />
<br />
===Back-propagation===<br />
For quite a while Neural Network model was just an idea, since there were not algorithms invented for training the model until Geoffrey Hinton in 1986 <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> came up with an algorithm called '''back-propagation'''. After that, a number of other training algorithms and various configurations of Neural Networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied gradient descent algorithms for optimizing the weights. Back-propagation uses this idea of gradient descent to train neural network. <br />
<br />
Assume that last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
[[File:backpropagation.png|300px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each perceptron we have a funciton that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math> which we called activaiton function <math>\displaystyle z_i=\sigma(a_i)</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classificaiton and the resulting classificaiton output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
[[File:propagationhidden.png|300px|]]<br />
<br />
Notice that the weighted sum on the output of the perceptrons at layer <math>\displaystyle l</math> are the inputs into the perceptrons at layer <math>\displaystyle j</math> and so on for all hidden layers. <br />
<br />
So, using the chain rule<br />
<br><math>\frac{\partial err}{\partial u_j}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}</math><br />
<br><math>\frac{\partial err}{\partial u_j}=\delta_j \cdot z_l</math><br />
<br />
<br />
<math>\sum u_{jl} z_l=a_j</math><br />
<math>\sum u_{ij} z_j=a_i</math><br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Propagationhidden.png&diff=4495File:Propagationhidden.png2009-10-28T23:27:48Z<p>Ipargaru: </p>
<hr />
<div></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4494stat8412009-10-28T23:27:24Z<p>Ipargaru: /* Back-propagation */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br />
<br />
Then we have that<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math> ,where <math>\,I</math> indicates the sign of the expression, returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math>(initial guess). It's goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
</ref>.<br />
*'''Comment on gradient descent algorithm'''<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction that the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and unfortunately you initially stand in the middle, then you will finally arrive at the saddle point(local minimum) and get stuck there.<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over i(all data points). Actually this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent), Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
<br />
<br />
<br />
<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection which branches "fans out" into as many connections as desired each carrying the same signal - the processing element output signal <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref>. It is a multistage regression or classificaiton model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
<br />
<br />
In a regression problem usually there is only one unit in the output layer but in a '''k'''-class classificaiton problem there could be '''k''' units in the output layer where unit '''k''' represents the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1)<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classificaiton by NN. <br />
<br />
In preceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] so the derivative cannot be taken. Thus, we replace it by a smooth continuous funciton <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it '''activation function'''.<br />
<br>Funciton <math>\displaystyle \sigma </math> can have any form, but typically <math>\sigma(a)=\frac {1}{1+e^{-a}}</math> (logit) form is used.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes in to the perceptron to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural networks is calls '''Feed-Forward Neural Network'''<br />
<br />
===Back-propagation===<br />
For quite a while Neural Network model was just an idea, since there were not algorithms invented for training the model until Geoffrey Hinton in 1986 <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> came up with an algorithm called '''back-propagation'''. After that, a number of other training algorithms and various configurations of Neural Networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied gradient descent algorithms for optimizing the weights. Back-propagation uses this idea of gradient descent to train neural network. <br />
<br />
Assume that last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
[[File:backpropagation.png|400px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each perceptron we have a funciton that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math> which we called activaiton function <math>\displaystyle z_i=\sigma(a_i)</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classificaiton and the resulting classificaiton output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
<math>\displaystyle l</math><br />
<math>\displaystyle j</math><br />
<math>\displaystyle i</math><br />
<math>\displaystyle a_l</math><br />
<math>\displaystyle z_l</math><br />
<math>\displaystyle u_{jl}</math><br />
<math>\displaystyle a_j</math><br />
<math>\displaystyle z_j</math><br />
<math>\displaystyle u_{ij}</math><br />
<math>\displaystyle a_i</math><br />
<math>\displaystyle z_i</math><br />
<br />
<br />
<math>\sum u_{jl} z_l=a_j</math><br />
<math>\sum u_{ij} z_j=a_i</math><br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4493stat8412009-10-28T23:08:47Z<p>Ipargaru: /* Back-propagation */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br />
<br />
Then we have that<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math> ,where <math>\,I</math> indicates the sign of the expression, returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math>(initial guess). It's goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
</ref>.<br />
*'''Comment on gradient descent algorithm'''<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction that the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and unfortunately you initially stand in the middle, then you will finally arrive at the saddle point(local minimum) and get stuck there.<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over i(all data points). Actually this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent), Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
<br />
<br />
<br />
<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection which branches "fans out" into as many connections as desired each carrying the same signal - the processing element output signal <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref>. It is a multistage regression or classificaiton model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
<br />
<br />
In a regression problem usually there is only one unit in the output layer but in a '''k'''-class classificaiton problem there could be '''k''' units in the output layer where unit '''k''' represents the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1)<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classificaiton by NN. <br />
<br />
In preceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] so the derivative cannot be taken. Thus, we replace it by a smooth continuous funciton <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it '''activation function'''.<br />
<br>Funciton <math>\displaystyle \sigma </math> can have any form, but typically <math>\sigma(a)=\frac {1}{1+e^{-a}}</math> (logit) form is used.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes in to the perceptron to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural networks is calls '''Feed-Forward Neural Network'''<br />
<br />
===Back-propagation===<br />
For quite a while Neural Network model was just an idea, since there were not algorithms invented for training the model until Geoffrey Hinton in 1986 <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> came up with an algorithm called '''back-propagation'''. After that, a number of other training algorithms and various configurations of Neural Networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied gradient descent algorithms for optimizing the weights. Back-propagation uses this idea of gradient descent to train neural network. <br />
<br />
Assume that last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
[[File:backpropagation.png|400px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each perceptron we have a funciton that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math> which we called activaiton function <math>\displaystyle z_i=\sigma(a_i)</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classificaiton and the resulting classificaiton output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
<br>'''First find derivative of the model error with respect to output weights <math>\displaystyle w_i</math>'''<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math> <br />
<br><math>\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i</math><br />
<br />
<br>'''Now we need to find the derivative of the model error with respect to hidden weights <math>\displaystyle u_i's</math>'''<br />
<br>Consider the following diagram that opens up the hidden layers of the neural network:<br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4492stat8412009-10-28T22:56:11Z<p>Ipargaru: /* Neural Networks (NN) - October 28, 2009 */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br />
<br />
Then we have that<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math> ,where <math>\,I</math> indicates the sign of the expression, returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math>(initial guess). It's goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
</ref>.<br />
*'''Comment on gradient descent algorithm'''<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction that the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and unfortunately you initially stand in the middle, then you will finally arrive at the saddle point(local minimum) and get stuck there.<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over i(all data points). Actually this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent), Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
<br />
<br />
<br />
<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection which branches "fans out" into as many connections as desired each carrying the same signal - the processing element output signal <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref>. It is a multistage regression or classificaiton model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.<br />
[[File:NN.png|300px|thumb|right|Figure 1: General Structure of a Neural Network.]]<br />
<br />
<br />
In a regression problem usually there is only one unit in the output layer but in a '''k'''-class classificaiton problem there could be '''k''' units in the output layer where unit '''k''' represents the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1)<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classificaiton by NN. <br />
<br />
In preceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] so the derivative cannot be taken. Thus, we replace it by a smooth continuous funciton <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it '''activation function'''.<br />
<br>Funciton <math>\displaystyle \sigma </math> can have any form, but typically <math>\sigma(a)=\frac {1}{1+e^{-a}}</math> (logit) form is used.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes in to the perceptron to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural networks is calls '''Feed-Forward Neural Network'''<br />
<br />
===Back-propagation===<br />
For quite a while Neural Network model was just an idea, since there were not algorithms invented for training the model until Geoffrey Hinton in 1986 <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> came up with an algorithm called '''back-propagation'''. After that, a number of other training algorithms and various configurations of Neural Networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied gradient descent algorithms for optimizing the weights. Back-propagation uses this idea of gradient descent to train neural network. <br />
<br />
Assume that last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
[[File:backpropagation.png|400px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each perceptron we have a funciton that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math> which we called activaiton function <math>\displaystyle z_i=\sigma(a_i)</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classificaiton and the resulting classificaiton output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
To employ the gradient descent we need to find some derivatives. <br />
<br>First find derivative of the algorithm error with respect to output weights <math>\displaystyle w_i</math> and then with respect to hidden weights <math>\displaystyle u_i's</math><br />
<br><math>\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}</math><br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4491stat8412009-10-28T22:47:37Z<p>Ipargaru: /* Back-propagation */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br />
<br />
Then we have that<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math> ,where <math>\,I</math> indicates the sign of the expression, returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math>(initial guess). It's goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
</ref>.<br />
*'''Comment on gradient descent algorithm'''<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction that the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and unfortunately you initially stand in the middle, then you will finally arrive at the saddle point(local minimum) and get stuck there.<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over i(all data points). Actually this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent), Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
<br />
<br />
<br />
<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection which branches "fans out" into as many connections as desired each carrying the same signal - the processing element output signal <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref>. It is a multistage regression or classificaiton model represented by a network.<br />
<br />
<br />
''This diagram is an example of a typical neural network but it can have many different forms.''<br />
[[File:NN.png|500px|thumb|center|Figure 1: General Structure of a Neural Network.]]<br />
<br />
<br />
In a regression problem usually there is only one unit in the output layer but in a '''k'''-class classificaiton problem there could be '''k''' units in the output layer where unit '''k''' represents the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1)<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classificaiton by NN. <br />
<br />
In preceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] so the derivative cannot be taken. Thus, we replace it by a smooth continuous funciton <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it '''activation function'''.<br />
<br>Funciton <math>\displaystyle \sigma </math> can have any form, but typically <math>\sigma(a)=\frac {1}{1+e^{-a}}</math> (logit) form is used.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes in to the perceptron to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural networks is calls '''Feed-Forward Neural Network'''<br />
<br />
===Back-propagation===<br />
For quite a while Neural Network model was just an idea, since there were not algorithms invented for training the model until Geoffrey Hinton in 1986 <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> came up with an algorithm called '''back-propagation'''. After that, a number of other training algorithms and various configurations of Neural Networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied gradient descent algorithms for optimizing the weights. Back-propagation uses this idea of gradient descent to train neural network. <br />
<br />
Assume that last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
[[File:backpropagation.png|400px|]]<br />
<br />
Note that we make a distinction between the input weights <math>\displaystyle (w_i)</math> and hidden weights <math>\displaystyle (u_i)</math>. <br />
<br><br>Within each perceptron we have a funciton that takes input <math>\displaystyle a_i</math> and outputs <math>\displaystyle z_i's</math> which we called activaiton function <math>\displaystyle z_i=\sigma(a_i)</math>. The <math>\displaystyle z_i's</math> are the inputs into the final output of the model <math>\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i</math><br />
<br />
We can find the error of the neural network output by evaluating the squared difference between the true classificaiton and the resulting classificaiton output <math>\Rightarrow \displaystyle error=||y-\hat y ||^2 </math><br />
<br />
Goal: Find derivative of the algorithm error with respect to <math>\displaystyle w_i</math><br />
<math>\frac{}{}=\frac{}{} \cdot \frac{}{}</math><br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Backpropagation.png&diff=4490File:Backpropagation.png2009-10-28T22:31:32Z<p>Ipargaru: </p>
<hr />
<div></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4489stat8412009-10-28T22:15:54Z<p>Ipargaru: /* Activation Function */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br />
<br />
Then we have that<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math> ,where <math>\,I</math> indicates the sign of the expression, returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math>(initial guess). It's goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
</ref>.<br />
*'''Comment on gradient descent algorithm'''<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction that the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and unfortunately you initially stand in the middle, then you will finally arrive at the saddle point(local minimum) and get stuck there.<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over i(all data points). Actually this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent), Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
<br />
<br />
<br />
<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection which branches "fans out" into as many connections as desired each carrying the same signal - the processing element output signal <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref>. It is a multistage regression or classificaiton model represented by a network.<br />
<br />
<br />
''This diagram is an example of a typical neural network but it can have many different forms.''<br />
[[File:NN.png|500px|thumb|center|Figure 1: General Structure of a Neural Network.]]<br />
<br />
<br />
In a regression problem usually there is only one unit in the output layer but in a '''k'''-class classificaiton problem there could be '''k''' units in the output layer where unit '''k''' represents the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1)<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classificaiton by NN. <br />
<br />
In preceptron, we have a "sign" function that takes the sign of a weighted sum of input features. <br />
<br />
[[File:signfuncperceptron.png|200px|]]<br />
<br>The sign function is of the form [[File:signfunc1.png|30px|]] so the derivative cannot be taken. Thus, we replace it by a smooth continuous funciton <math>\displaystyle \sigma </math> of the form [[File:signfunc2.png|30px|]] and call it '''activation function'''.<br />
<br>Funciton <math>\displaystyle \sigma </math> can have any form, but typically <math>\sigma(a)=\frac {1}{1+e^{-a}}</math> (logit) form is used.<br />
<br />
By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes in to the perceptron to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural networks is calls '''Feed-Forward Neural Network'''<br />
<br />
===Back-propagation===<br />
For quite a while Neural Network model was just an idea, since there were not algorithms invented for training the model until Geoffrey Hinton in 1986 <ref><br />
http://www.cs.toronto.edu/~hinton/backprop.html<br />
</ref> came up with an algorithm called '''back-propagation'''. After that, a number of other training algorithms and various configurations of Neural Networks were implemented.<br />
<br />
When we were talking about perceptrons, we applied gradient descent algorithms for optimizing the weights. Back-propagation uses this idea of gradient descent to train neural network. <br />
<br />
Assume that last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.<br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Signfuncperceptron.png&diff=4488File:Signfuncperceptron.png2009-10-28T21:42:24Z<p>Ipargaru: </p>
<hr />
<div></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4487stat8412009-10-28T21:34:28Z<p>Ipargaru: /* Activation Function */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br />
<br />
Then we have that<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math> ,where <math>\,I</math> indicates the sign of the expression, returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math>(initial guess). It's goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
</ref>.<br />
*'''Comment on gradient descent algorithm'''<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction that the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and unfortunately you initially stand in the middle, then you will finally arrive at the saddle point(local minimum) and get stuck there.<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over i(all data points). Actually this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent), Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
<br />
<br />
<br />
<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection which branches "fans out" into as many connections as desired each carrying the same signal - the processing element output signal <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref>. It is a multistage regression or classificaiton model represented by a network.<br />
<br />
<br />
''This diagram is an example of a typical neural network but it can have many different forms.''<br />
[[File:NN.png|500px|thumb|center|Figure 1: General Structure of a Neural Network.]]<br />
<br />
<br />
In a regression problem usually there is only one unit in the output layer but in a '''k'''-class classificaiton problem there could be '''k''' units in the output layer where unit '''k''' represents the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1)<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classificaiton by NN. <br />
<br />
In preceptron, we have a "sign" function that takes the sign of a weighted sum of input features. The sign function is of the form<br />
[[File:signfunc1.png|30px|]], derivative cannot be taken. We replace it by a smooth continuous funciton of the form [[File:signfunc2.png|30px|]] and call this function <math>\displaystyle \sigma </math><br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Signfunc2.png&diff=4486File:Signfunc2.png2009-10-28T21:30:21Z<p>Ipargaru: Sign function, smooth and continuous</p>
<hr />
<div>Sign function, smooth and continuous</div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:Signfunc1.png&diff=4485File:Signfunc1.png2009-10-28T21:29:25Z<p>Ipargaru: Sign funciton, plus or minus</p>
<hr />
<div>Sign funciton, plus or minus</div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4484stat8412009-10-28T21:28:12Z<p>Ipargaru: /* Activation Function */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br />
<br />
Then we have that<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math> ,where <math>\,I</math> indicates the sign of the expression, returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math>(initial guess). It's goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
</ref>.<br />
*'''Comment on gradient descent algorithm'''<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction that the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and unfortunately you initially stand in the middle, then you will finally arrive at the saddle point(local minimum) and get stuck there.<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over i(all data points). Actually this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent), Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
<br />
<br />
<br />
<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection which branches "fans out" into as many connections as desired each carrying the same signal - the processing element output signal <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref>. It is a multistage regression or classificaiton model represented by a network.<br />
<br />
<br />
''This diagram is an example of a typical neural network but it can have many different forms.''<br />
[[File:NN.png|500px|thumb|center|Figure 1: General Structure of a Neural Network.]]<br />
<br />
<br />
In a regression problem usually there is only one unit in the output layer but in a '''k'''-class classificaiton problem there could be '''k''' units in the output layer where unit '''k''' represents the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1)<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classificaiton by NN. <br />
<br />
In preceptron, we have a "sign" function that takes the sign of a weighted sum of input features. The sign function is of the form<br />
[[File:signfunc1.png]]<br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4483stat8412009-10-28T21:15:28Z<p>Ipargaru: /* Neural Networks (NN) - October 28, 2009 */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(h) \le L(\overline{h})</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br />
<br />
Then we have that<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math> ,where <math>\,I</math> indicates the sign of the expression, returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math>(initial guess). It's goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
</ref>.<br />
*'''Comment on gradient descent algorithm'''<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction that the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and unfortunately you initially stand in the middle, then you will finally arrive at the saddle point(local minimum) and get stuck there.<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over i(all data points). Actually this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent), Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
<br />
<br />
<br />
<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection which branches "fans out" into as many connections as desired each carrying the same signal - the processing element output signal <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref>. It is a multistage regression or classificaiton model represented by a network.<br />
<br />
<br />
''This diagram is an example of a typical neural network but it can have many different forms.''<br />
[[File:NN.png|500px|thumb|center|Figure 1: General Structure of a Neural Network.]]<br />
<br />
<br />
In a regression problem usually there is only one unit in the output layer but in a '''k'''-class classificaiton problem there could be '''k''' units in the output layer where unit '''k''' represents the probability of class '''k''' and each <math>\displaystyle y_k</math> is coded (0,1)<br />
<br />
===Activation Function===<br />
Activation Function is a term that is frequently used in classificaiton by NN. <br />
<br />
In preceptron, we have a "sign" function that takes the sign of a weighted sum of input features. The sign function is of the form<br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=File:NN.png&diff=4482File:NN.png2009-10-28T21:00:54Z<p>Ipargaru: General Structure of a Neural Network</p>
<hr />
<div>General Structure of a Neural Network</div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4480stat8412009-10-28T20:51:59Z<p>Ipargaru: /* Neural Networks (NN) (Lecture October 28, 2009) */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(\overline{h}) \le L(h)</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br />
<br />
Then we have that<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1;1;1];<br />
rho=.5;<br />
for j=1:100;<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,i)+b_0)*y(i);<br />
if d<0<br />
b=b+rho*x(:,i)*y(i);<br />
b_0=b_0+rho*y(i);<br />
changed=1;<br />
end <br />
end<br />
if changed==0<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math> ,where <math>\,I</math> indicates the sign of the expression, returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math>(initial guess). It's goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
</ref>.<br />
*'''Comment on gradient descent algorithm'''<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction that the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and unfortunately you initially stand in the middle, then you will finally arrive at the saddle point(local minimum) and get stuck there.<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over i(all data points). Actually this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent), Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
<br />
<br />
<br />
<br />
<br />
==Neural Networks (NN) - October 28, 2009 ==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection which branches "fans out" into as many connections as desired each carrying the same signal - the processing element output signal <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref>. It is a multistage regression or classificaiton model represented by a network.<br />
<br />
[[File: neural network.png|300px|thumb|center|Figure 1: General Structure of a Neural Network.]]<br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4478stat8412009-10-28T20:26:37Z<p>Ipargaru: /* Neural Networks (NN) (Lecture October 28, 2009) */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(\overline{h}) \le L(h)</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br />
<br />
Then we have that<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1,1,1]';<br />
rho=.1;<br />
while 1<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,1)+b_0)*y(i);<br />
d<br />
if d<0<br />
b=b+rho*x(:,1)*d;<br />
b_0=b_0+d;<br />
b<br />
b_0<br />
changed=0;<br />
end <br />
end<br />
if changed-1<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math> ,where <math>\,I</math> indicates the sign of the expression, returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math>(initial guess). It's goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
</ref>.<br />
*'''Comment on gradient descent algorithm'''<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction that the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and unfortunately you initially stand in the middle, then you will finally arrive at the saddle point(local minimum) and get stuck there.<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over i(all data points). Actually this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent), Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
<br />
<br />
<br />
<br />
<br />
==Neural Networks (NN) (Lecture October 28, 2009)==<br />
<br />
A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection which branches "fans out" into as many connections as desired each carrying the same signal - the processing element output signal <ref><br />
Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref>. It is a multistage regression or classificaiton model represented by a network.<br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4477stat8412009-10-28T20:17:12Z<p>Ipargaru: /* The Perceptron (Lecture October 23, 2009) */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(\overline{h}) \le L(h)</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form <math>\,ax+b=0</math>. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form <math>\,ax^2+bx+c=0</math>.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
where <br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (In general when <math>\,X=USV^\top</math>, <math>\,U</math> is the eigenvectors of <math>\,XX^T</math> and <math>\,V</math> is the eigenvectors of <math>\,X^\top X</math>. <br />
So if <math>\, X</math> is symmetric, we will have <math>\, U=V</math>. Here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\<br />
& = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
& = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of [http://en.wikipedia.org/wiki/Generalized_linear_model Generalized Linear Model (GLM)].<br />
<br />
===The Logistic Regression Model===<br />
The logistic regression model for the two class case is defined as<br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br />
<br />
Then we have that<br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
:<math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Fitting a Logistic Regression===<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity<math>\frac{a}{1+a}=1-\frac{1}{1+a}</math><br />
<br />
And solving <math>\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right] </math><br />
<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
<br />
====A Perceptron Example====<br />
<br />
The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:<br />
{| class="wikitable"<br />
|-<br />
! Features:x1, x2, x3<br />
<br />
! Answer<br />
|-<br />
| 1,0,0<br />
| +1<br />
|-<br />
| 1,0,1<br />
| +1<br />
|-<br />
| 1,1,0<br />
| +1<br />
|-<br />
| 0,0,1<br />
| -1<br />
|-<br />
| 0,1,1<br />
| -1<br />
|-<br />
| 1,1,1<br />
| -1<br />
|}<br />
Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.<br />
<br />
y=[1;1;1;-1;-1;-1];<br />
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';<br />
b_0=0;<br />
b=[1,1,1]';<br />
rho=.1;<br />
while 1<br />
changed=0;<br />
for i=1:6<br />
d=(b'*x(:,1)+b_0)*y(i);<br />
d<br />
if d<0<br />
b=b+rho*x(:,1)*d;<br />
b_0=b_0+d;<br />
b<br />
b_0<br />
changed=0;<br />
end <br />
end<br />
if changed-1<br />
break;<br />
end<br />
end<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where<math>\,x_0</math> is the model intercept and <math>x_{1},\ldots,x_{d}</math> represent the feature data, <math>\sum_{i=0}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these inputs, and <math>I(\sum_{i=1}^d \beta_{j}x_{j})</math> ,where <math>\,I</math> indicates the sign of the expression, returns the label of the data point. <br />
<br />
<br />
The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The algorithm begins with an arbitrary hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0} </math>(initial guess). It's goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal <math>\underline\beta</math> by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
'''Derivation''':'' The distance between the decision boundary and misclassified points''. <br /><br /><br />
<br />
If <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary then,<br /><br /> <br />
<br />
:<math><br />
\begin{align}<br />
\underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\<br />
\underline{\beta}^T (x_{1}-x_{2})&=0<br />
\end{align}<br />
</math><br />
<br />
<math>\underline{\beta}^T (x_{1}-x_{2})</math> denotes an inner product. Since the inner product is 0 and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary, <math>\underline{\beta}</math> is orthogonal to the decision boundary. <br /><br /><br />
<br />
Let <math>\underline{x_{i}}</math> be a misclassified point. <br /><br /> <br />
<br />
Then the projection of the vector <math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed: <br />
<br />
:<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
:<math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
= <br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix}<br />
+\rho <br />
\begin{pmatrix} <br />
y_i \underline{x_i}\\ <br />
y_i<br />
\end{pmatrix}<br />
</math><br />
where <math>\displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math><br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{new}}\\ <br />
\underline{\beta_0}^{\mathrm{new}}<br />
\end{pmatrix}<br />
=<br />
\begin{pmatrix} <br />
\underline{\beta}^{\mathrm{old}}\\ <br />
\underline{\beta_0}^{\mathrm{old}}<br />
\end{pmatrix} </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.<br />
#The [http://annet.eeng.nuim.ie/intro/course/chpt2/convergence.shtml perceptron convergence theorem] states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref><br />
Pattern Recognition and Machine Learning,Christopher M. Bishop,194<br />
</ref>.<br />
*'''Comment on gradient descent algorithm'''<br />
Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction that the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and unfortunately you initially stand in the middle, then you will finally arrive at the saddle point(local minimum) and get stuck there.<br />
In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over i(all data points). Actually this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent), Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that <math>{\beta}</math> gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
<br />
<br />
<br />
<br />
<br />
==Neural Networks (NN) (Lecture October 28, 2009)==<br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=statf09841Proposal&diff=4472statf09841Proposal2009-10-26T19:05:02Z<p>Ipargaru: /* By: Maseeh Ghodsi, Soroush Ghodsi and Ali Ghodsi */</p>
<hr />
<div>''' Use the following format for your proposal (maximum one page)'''<br />
==Project 1 : How to Make a Birdhouse ==<br />
</noinclude><br />
===By: Maseeh Ghodsi, Soroush Ghodsi and Ali Ghodsi===<br />
Write your proposal here<br />
<noinclude><br />
<br />
==Project 1 : Recognizing Cheaters in Multi-Player Online Game Environment ==<br />
</noinclude><br />
===By: Mark Stuart, Mathieu Zerter, Iulia Pargaru ===<br />
<br />
Multiplayer online games constitute a very large market in the entertainment industry that generates billions in revenue.<ref><br />
S. F. Yeung, John C. S. Lui, Jianchuan Liu, Jeff Yan, Detecting Cheaters for Multiplayer Games: Theory, Design, and Implementation<br />
</ref> Multiplayer on-line games are games in which players use characters to perform specific actions and interact with other characters. The number of online game users is rapidly increasing. Computer play-programs are often used to automatically perform actions on behalf of a human player. This type of cheating gains the player unfair advantage, abusing resources, disrupting players’ gaming experience and even harming servers.<ref>Hyungil Kim, Sungwoo Hong, Juntae Kim, Detection of Auto Programs for MMORPGs</ref> Computer play-programs usually have a specific goal or a task that is repeated often. We suspect that sequences of events and actions created by play-programs are statistically different from the sequence of events generated by a human player. We will be using an on-line game called Tibia created by CIPSoft as a study case. <br />
<br />
We have recruited volunteers who agreed to provide us with their gaming information. We are gathering and parsing packets sent by the user to the game server that contain detailed information about the actions performed by the user. The original data consist of: User ID, length of event, time of event, action type, action details, cheating (0 or 1).<br />
The sequences of events produced by human and the play-programs will be transformed into a set of features to reveal additional information such as periodicity of events, common sequential actions, rare events or actions not performed often, creating a measure for complexity of an action. Various algorithms will be applied to classify the data represented by the set available attributes. Some similar studies suggest that the following methods perform an effective classification of human vs. machine in on-line game environment:<br />
*Dynamic Bayesian Network<br />
*Isomap<br />
*Desicion Tree <br />
*Artificial Neural Network<br />
*Support Vector Machines<br />
*K nearest neighbours<br />
*Naive Bayesian <br />
<br />
We intend to find a classification algorithm that detects in-game cheating in on-line game Tibia with reasonable accuracy. <br />
<noinclude></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4437stat8412009-10-24T22:16:17Z<p>Ipargaru: /* Find \underline{\beta} */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(\overline{h}) \le L(h)</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form ax+b=0. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form ax<sup>2</sup>+bx+c=0.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=y)</math> is Gaussian, the Bayes Classifier rule is:<br />
<br />
<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
<br />
where <br />
<br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes:<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (since if <math>\, U = XX^\top </math> and <math>\, V=X^\top X</math> , if <math>\, X</math> is symmetric, <math>\, U=V</math> , and here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math> \, (x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) </math><br />
:<math> \, = (x-\mu_k)^\top US^{-1}U^T(x-\mu_k) </math><br />
:<math> \, = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k) </math><br />
:<math> \, = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) </math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modeling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1.<br />
<br />
This logistic regression model for the two class case is defined as: <br/><br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br/> <br />
[[File:Picture1.png |frame|center]]<br />
Then we have that <math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
[[File:Picture2.png |frame|center]]<br />
<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^y \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted sum of squared errors<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>\displaystyle w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex), and does not converge to give a unique hyperplane. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where <math>x_{0}, x_{1},\ldots,x_{d}</math> represent the input data, <math>\sum_{i=1}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these features, and <math>sgn(\sum_{i=1}^d \beta_{j}x_{j})</math> returns the sign of the linear combination. <br />
<br />
<br />
Perceptron seeks a linear function between two classes. Since it is linear, the decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The Perceptron algorithm begins with a random hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0}. </math> The goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find a <math>\underline\beta</math> by iteratively rotating the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
*'''Derivation''' ''of the distance between the decision boundary and the misclassified points''. <br />
<br />
:Let <math>\underline{x_{i}}</math> be the misclassified point. <br />
<br />
:Assume <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary. <br />
<br />
:Then <math>\underline{\beta}^T\underline{x_{1}}+\beta_{0} = \underline{\beta}^T\underline{x_{2}}+\beta_{0}</math><br />
<br />
:which implies that <math>\underline{\beta}^T (x_{1}-x_{2})=0</math>.<br />
<br />
::Since <math> \underline{\beta}^T</math> is a vector and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary then <math>\underline{\beta}</math> is a vector orthogonal to the decision boundary. <br />
<br />
Then the projection of the vector<math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. Rosenblatt proposed a simple algorith to overcome this problem. To continue, the following derivatives are needed: <br />
<br />
<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
<math>[\underline{\beta}^{new}\ \ \beta^{new}]= [\underline{\beta}^{old}\ \ \beta_{0}] + \rho [y_{i}\underline{x_{i}}\ \ y_{i}] </math> <br />
<br />
where<math> \displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math>[\underline{\beta}^{new}\ \ \beta^{new}]= [\underline{\beta}^{old}\ \ \beta_{0}] </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exist infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscilating forever between the last two points, before and after the min.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4436stat8412009-10-24T22:03:36Z<p>Ipargaru: /* The Perceptron */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(\overline{h}) \le L(h)</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form ax+b=0. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form ax<sup>2</sup>+bx+c=0.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=y)</math> is Gaussian, the Bayes Classifier rule is:<br />
<br />
<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
<br />
where <br />
<br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes:<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (since if <math>\, U = XX^\top </math> and <math>\, V=X^\top X</math> , if <math>\, X</math> is symmetric, <math>\, U=V</math> , and here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math> \, (x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) </math><br />
:<math> \, = (x-\mu_k)^\top US^{-1}U^T(x-\mu_k) </math><br />
:<math> \, = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k) </math><br />
:<math> \, = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) </math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modeling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1.<br />
<br />
This logistic regression model for the two class case is defined as: <br/><br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br/> <br />
[[File:Picture1.png |frame|center]]<br />
Then we have that <math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
[[File:Picture2.png |frame|center]]<br />
<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^y \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted error sum of squares<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean (not convex), and does not converge to give a unique hyperplane. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where <math>x_{0}, x_{1},\ldots,x_{d}</math> represent the input data, <math>\sum_{i=1}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these features, and <math>sgn(\sum_{i=1}^d \beta_{j}x_{j})</math> returns the sign of the linear combination. <br />
<br />
<br />
Perceptron seeks a linear function between two classes. Since it is linear, the decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The Perceptron algorithm begins with a random hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0}. </math> The goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find a <math>\underline\beta</math> by iteratively rotating the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
*'''Derivation''' ''of the distance between the decision boundary and the misclassified points''. <br />
<br />
:Let <math>\underline{x_{i}}</math> be the misclassified point. <br />
<br />
:Assume <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary. <br />
<br />
:Then <math>\underline{\beta}^T\underline{x_{1}}+\beta_{0} = \underline{\beta}^T\underline{x_{2}}+\beta_{0}</math><br />
<br />
:which implies that <math>\underline{\beta}^T (x_{1}-x_{2})=0</math>.<br />
<br />
::Since <math> \underline{\beta}^T</math> is a vector and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary then <math>\underline{\beta}</math> is a vector orthogonal to the decision boundary. <br />
<br />
Then the projection of the vector<math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. Rosenblatt proposed a simple algorith to overcome this problem. To continue, the following derivatives are needed: <br />
<br />
<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
<math>[\underline{\beta}^{new}\ \ \beta^{new}]= [\underline{\beta}^{old}\ \ \beta_{0}] + \rho [y_{i}\underline{x_{i}}\ \ y_{i}] </math> <br />
<br />
where<math> \displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math>[\underline{\beta}^{new}\ \ \beta^{new}]= [\underline{\beta}^{old}\ \ \beta_{0}] </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exist infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscilating forever between the last two points, before and after the min.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4435stat8412009-10-24T21:55:43Z<p>Ipargaru: /* Multi-Class Logistic Regression */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(\overline{h}) \le L(h)</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form ax+b=0. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form ax<sup>2</sup>+bx+c=0.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=y)</math> is Gaussian, the Bayes Classifier rule is:<br />
<br />
<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
<br />
where <br />
<br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes:<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (since if <math>\, U = XX^\top </math> and <math>\, V=X^\top X</math> , if <math>\, X</math> is symmetric, <math>\, U=V</math> , and here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math> \, (x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) </math><br />
:<math> \, = (x-\mu_k)^\top US^{-1}U^T(x-\mu_k) </math><br />
:<math> \, = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k) </math><br />
:<math> \, = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) </math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modeling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1.<br />
<br />
This logistic regression model for the two class case is defined as: <br/><br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br/> <br />
[[File:Picture1.png |frame|center]]<br />
Then we have that <math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
[[File:Picture2.png |frame|center]]<br />
<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^y \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted error sum of squares<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math>. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier,shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to 1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean(not convex), and does not converge to give a unique hyperplane. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where <math>x_{0}, x_{1},\ldots,x_{d}</math> represent the input data, <math>\sum_{i=1}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these features, and <math>sgn(\sum_{i=1}^d \beta_{j}x_{j})</math> returns the sign of the linear combination. <br />
<br />
<br />
Perceptron seeks a linear function between two classes. Since it is linear, the decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The Perceptron algorithm begins with a random hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0}. </math> The goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find a <math>\underline\beta</math> by iteratively rotating the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
*'''Derivation''' ''of the distance between the decision boundary and the misclassified points''. <br />
<br />
:Let <math>\underline{x_{i}}</math> be the misclassified point. <br />
<br />
:Assume <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary. <br />
<br />
:Then <math>\underline{\beta}^T\underline{x_{1}}+\beta_{0} = \underline{\beta}^T\underline{x_{2}}+\beta_{0}</math><br />
<br />
:which implies that <math>\underline{\beta}^T (x_{1}-x_{2})=0</math>.<br />
<br />
::Since <math> \underline{\beta}^T</math> is a vector and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary then <math>\underline{\beta}</math> is a vector orthogonal to the decision boundary. <br />
<br />
Then the projection of the vector<math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. Rosenblatt proposed a simple algorith to overcome this problem. To continue, the following derivatives are needed: <br />
<br />
<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
<math>[\underline{\beta}^{new}\ \ \beta^{new}]= [\underline{\beta}^{old}\ \ \beta_{0}] + \rho [y_{i}\underline{x_{i}}\ \ y_{i}] </math> <br />
<br />
where<math> \displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math>[\underline{\beta}^{new}\ \ \beta^{new}]= [\underline{\beta}^{old}\ \ \beta_{0}] </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exist infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscilating forever between the last two points, before and after the min.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4434stat8412009-10-24T21:50:57Z<p>Ipargaru: /* Multi-Class Logistic Regression */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(\overline{h}) \le L(h)</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form ax+b=0. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form ax<sup>2</sup>+bx+c=0.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=y)</math> is Gaussian, the Bayes Classifier rule is:<br />
<br />
<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
<br />
where <br />
<br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes:<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (since if <math>\, U = XX^\top </math> and <math>\, V=X^\top X</math> , if <math>\, X</math> is symmetric, <math>\, U=V</math> , and here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math> \, (x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) </math><br />
:<math> \, = (x-\mu_k)^\top US^{-1}U^T(x-\mu_k) </math><br />
:<math> \, = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k) </math><br />
:<math> \, = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) </math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modeling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1.<br />
<br />
This logistic regression model for the two class case is defined as: <br/><br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br/> <br />
[[File:Picture1.png |frame|center]]<br />
Then we have that <math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
[[File:Picture2.png |frame|center]]<br />
<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^y \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted error sum of squares<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. In multi-class problem we no longer have the complement <math>\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)</math> that is used in the denominator. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier,shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to 1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean(not convex), and does not converge to give a unique hyperplane. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where <math>x_{0}, x_{1},\ldots,x_{d}</math> represent the input data, <math>\sum_{i=1}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these features, and <math>sgn(\sum_{i=1}^d \beta_{j}x_{j})</math> returns the sign of the linear combination. <br />
<br />
<br />
Perceptron seeks a linear function between two classes. Since it is linear, the decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The Perceptron algorithm begins with a random hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0}. </math> The goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find a <math>\underline\beta</math> by iteratively rotating the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
*'''Derivation''' ''of the distance between the decision boundary and the misclassified points''. <br />
<br />
:Let <math>\underline{x_{i}}</math> be the misclassified point. <br />
<br />
:Assume <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary. <br />
<br />
:Then <math>\underline{\beta}^T\underline{x_{1}}+\beta_{0} = \underline{\beta}^T\underline{x_{2}}+\beta_{0}</math><br />
<br />
:which implies that <math>\underline{\beta}^T (x_{1}-x_{2})=0</math>.<br />
<br />
::Since <math> \underline{\beta}^T</math> is a vector and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary then <math>\underline{\beta}</math> is a vector orthogonal to the decision boundary. <br />
<br />
Then the projection of the vector<math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. Rosenblatt proposed a simple algorith to overcome this problem. To continue, the following derivatives are needed: <br />
<br />
<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
<math>[\underline{\beta}^{new}\ \ \beta^{new}]= [\underline{\beta}^{old}\ \ \beta_{0}] + \rho [y_{i}\underline{x_{i}}\ \ y_{i}] </math> <br />
<br />
where<math> \displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math>[\underline{\beta}^{new}\ \ \beta^{new}]= [\underline{\beta}^{old}\ \ \beta_{0}] </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exist infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscilating forever between the last two points, before and after the min.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4433stat8412009-10-24T21:46:45Z<p>Ipargaru: /* Multi-Class Logistic Regression */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(\overline{h}) \le L(h)</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form ax+b=0. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form ax<sup>2</sup>+bx+c=0.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=y)</math> is Gaussian, the Bayes Classifier rule is:<br />
<br />
<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
<br />
where <br />
<br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes:<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (since if <math>\, U = XX^\top </math> and <math>\, V=X^\top X</math> , if <math>\, X</math> is symmetric, <math>\, U=V</math> , and here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math> \, (x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) </math><br />
:<math> \, = (x-\mu_k)^\top US^{-1}U^T(x-\mu_k) </math><br />
:<math> \, = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k) </math><br />
:<math> \, = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) </math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modeling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1.<br />
<br />
This logistic regression model for the two class case is defined as: <br/><br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br/> <br />
[[File:Picture1.png |frame|center]]<br />
Then we have that <math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
[[File:Picture2.png |frame|center]]<br />
<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^y \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted error sum of squares<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. In multi-class problem we no longer have the complement <math>P(Y=1|X=x)=1-P(Y=0|X=x)</math>. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier,shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to 1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean(not convex), and does not converge to give a unique hyperplane. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where <math>x_{0}, x_{1},\ldots,x_{d}</math> represent the input data, <math>\sum_{i=1}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these features, and <math>sgn(\sum_{i=1}^d \beta_{j}x_{j})</math> returns the sign of the linear combination. <br />
<br />
<br />
Perceptron seeks a linear function between two classes. Since it is linear, the decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The Perceptron algorithm begins with a random hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0}. </math> The goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find a <math>\underline\beta</math> by iteratively rotating the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
*'''Derivation''' ''of the distance between the decision boundary and the misclassified points''. <br />
<br />
:Let <math>\underline{x_{i}}</math> be the misclassified point. <br />
<br />
:Assume <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary. <br />
<br />
:Then <math>\underline{\beta}^T\underline{x_{1}}+\beta_{0} = \underline{\beta}^T\underline{x_{2}}+\beta_{0}</math><br />
<br />
:which implies that <math>\underline{\beta}^T (x_{1}-x_{2})=0</math>.<br />
<br />
::Since <math> \underline{\beta}^T</math> is a vector and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary then <math>\underline{\beta}</math> is a vector orthogonal to the decision boundary. <br />
<br />
Then the projection of the vector<math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. Rosenblatt proposed a simple algorith to overcome this problem. To continue, the following derivatives are needed: <br />
<br />
<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
<math>[\underline{\beta}^{new}\ \ \beta^{new}]= [\underline{\beta}^{old}\ \ \beta_{0}] + \rho [y_{i}\underline{x_{i}}\ \ y_{i}] </math> <br />
<br />
where<math> \displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math>[\underline{\beta}^{new}\ \ \beta^{new}]= [\underline{\beta}^{old}\ \ \beta_{0}] </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exist infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscilating forever between the last two points, before and after the min.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4432stat8412009-10-24T21:38:35Z<p>Ipargaru: /* The Perceptron (Lecture October 23, 2009) */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(\overline{h}) \le L(h)</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form ax+b=0. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form ax<sup>2</sup>+bx+c=0.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=y)</math> is Gaussian, the Bayes Classifier rule is:<br />
<br />
<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
<br />
where <br />
<br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes:<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (since if <math>\, U = XX^\top </math> and <math>\, V=X^\top X</math> , if <math>\, X</math> is symmetric, <math>\, U=V</math> , and here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math> \, (x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) </math><br />
:<math> \, = (x-\mu_k)^\top US^{-1}U^T(x-\mu_k) </math><br />
:<math> \, = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k) </math><br />
:<math> \, = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) </math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modeling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1.<br />
<br />
This logistic regression model for the two class case is defined as: <br/><br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br/> <br />
[[File:Picture1.png |frame|center]]<br />
Then we have that <math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
[[File:Picture2.png |frame|center]]<br />
<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^y \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted error sum of squares<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier,shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to 1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean(not convex), and does not converge to give a unique hyperplane. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where <math>x_{0}, x_{1},\ldots,x_{d}</math> represent the input data, <math>\sum_{i=1}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these features, and <math>sgn(\sum_{i=1}^d \beta_{j}x_{j})</math> returns the sign of the linear combination. <br />
<br />
<br />
Perceptron seeks a linear function between two classes. Since it is linear, the decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The Perceptron algorithm begins with a random hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0}. </math> The goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find a <math>\underline\beta</math> by iteratively rotating the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
*'''Derivation''' ''of the distance between the decision boundary and the misclassified points''. <br />
<br />
:Let <math>\underline{x_{i}}</math> be the misclassified point. <br />
<br />
:Assume <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary. <br />
<br />
:Then <math>\underline{\beta}^T\underline{x_{1}}+\beta_{0} = \underline{\beta}^T\underline{x_{2}}+\beta_{0}</math><br />
<br />
:which implies that <math>\underline{\beta}^T (x_{1}-x_{2})=0</math>.<br />
<br />
::Since <math> \underline{\beta}^T</math> is a vector and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary then <math>\underline{\beta}</math> is a vector orthogonal to the decision boundary. <br />
<br />
Then the projection of the vector<math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. Rosenblatt proposed a simple algorith to overcome this problem. To continue, the following derivatives are needed: <br />
<br />
<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
<math>[\underline{\beta}^{new}\ \ \beta^{new}]= [\underline{\beta}^{old}\ \ \beta_{0}] + \rho [y_{i}\underline{x_{i}}\ \ y_{i}] </math> <br />
<br />
where<math> \displaystyle\rho</math> is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until <math>[\underline{\beta}^{new}\ \ \beta^{new}]= [\underline{\beta}^{old}\ \ \beta_{0}] </math> <br />
or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exist infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscilating forever between the last two points, before and after the min.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4431stat8412009-10-24T21:28:38Z<p>Ipargaru: /* The Perceptron (Lecture October 23, 2009) */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(\overline{h}) \le L(h)</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form ax+b=0. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form ax<sup>2</sup>+bx+c=0.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=y)</math> is Gaussian, the Bayes Classifier rule is:<br />
<br />
<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
<br />
where <br />
<br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes:<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (since if <math>\, U = XX^\top </math> and <math>\, V=X^\top X</math> , if <math>\, X</math> is symmetric, <math>\, U=V</math> , and here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math> \, (x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) </math><br />
:<math> \, = (x-\mu_k)^\top US^{-1}U^T(x-\mu_k) </math><br />
:<math> \, = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k) </math><br />
:<math> \, = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) </math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modeling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1.<br />
<br />
This logistic regression model for the two class case is defined as: <br/><br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br/> <br />
[[File:Picture1.png |frame|center]]<br />
Then we have that <math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
[[File:Picture2.png |frame|center]]<br />
<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^y \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted error sum of squares<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier,shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to 1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean(not convex), and does not converge to give a unique hyperplane. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where <math>x_{0}, x_{1},\ldots,x_{d}</math> represent the input data, <math>\sum_{i=1}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these features, and <math>sgn(\sum_{i=1}^d \beta_{j}x_{j})</math> returns the sign of the linear combination. <br />
<br />
<br />
Perceptron seeks a linear function between two classes. Since it is linear, the decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The Perceptron algorithm begins with a random hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0}. </math> The goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find a <math>\underline\beta</math> by iteratively rotating the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
*'''Derivation''' ''of the distance between the decision boundary and the misclassified points''. <br />
<br />
:Let <math>\underline{x_{i}}</math> be the misclassified point. <br />
<br />
:Assume <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary. <br />
<br />
:Then <math>\underline{\beta}^T\underline{x_{1}}+\beta_{0} = \underline{\beta}^T\underline{x_{2}}+\beta_{0}</math><br />
<br />
:which implies that <math>\underline{\beta}^T (x_{1}-x_{2})=0</math>.<br />
<br />
::Since <math> \underline{\beta}^T</math> is a vector and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary then <math>\underline{\beta}</math> is a vector orthogonal to the decision boundary. <br />
<br />
Then the projection of the vector<math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. Rosenblatt proposed a simple algorith to overcome this problem. To continue, the following derivatives are needed: <br />
<br />
<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
<math>[\underline{\beta}^{new}\ \ \beta^{new}]= [\underline{\beta}^{old}\ \ \beta_{0}] + \rho [y_{i}\underline{x_{i}}\ \ y_{i}] </math> <br />
<br />
where<math> \displaystyle\rho</math> is called the learning rate. The algorithm continues until it converges or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exist infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841&diff=4430stat8412009-10-24T21:22:09Z<p>Ipargaru: /* The Perceptron (Lecture October 23, 2009) */</p>
<hr />
<div>==[[statf09841Proposal|Proposal]] ==<br />
<br />
==[http://spreadsheets.google.com/ccc?key=0Avbf0U1TJOcfdFFQR3NIc1pYUEdWeFdwbnNTUlRYZ3c&hl=en| Mark your contribution here]==<br />
==[[statf09841Scribe|Scribe sign up]] ==<br />
<br />
== ''' Classfication-2009.9.30''' ==<br />
<br />
=== Classification ===<br />
<br />
With the rising fields of data-mining, bioinformatics, machine learning and so on, classification has becomes a fast developing topic. In the age of information, vast amount of data is generated constantly, and the goal of classification is to ''learn from data''. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on. <br />
<br />
In classification we attempt to approximate a function <math>\,h</math>, by using a training data set, that will then be able to accurately classify new data inputs.<br />
<br />
Given <math>\mathcal{X} \subset \mathbb{R}^{d}</math>, a subset of the <math>D</math>-dimensional real vectors and <math> \mathcal{Y} </math>, a finite set of labels, We try to determine a ''''classification rule'''' <math>\,h</math> such that,<br />
:<math>\,h: \mathcal{X} \mapsto \mathcal{Y} </math><br />
<br />
We use <math>\,n</math> ordered pairs of training data, <math>\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}</math> where <math>\,X_{i} \in \mathcal{X}</math>,<math>\,Y_{i} \in \mathcal{Y} </math>, to approximate <math>\,h</math>.<br />
<br />
<br />
Thus, given a new input, <math>\,X \in \mathcal{X} </math><br />
by using the classification rule we can predict a corresponding <math>\,\hat{Y}=h(X)</math>.<br />
<br />
:'''Example''' Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, e.g, color, diameter, and weight.<br>Let <math>\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})</math> and <math>\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}</math>. The goal is to find a classification rule such that when a new fruit <math>\,X</math> is presented based on its features, <math>(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})</math>, our classification rule <math>\,h</math> can classify it as either an apple or an orange, i.e., <math>\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})</math> be the fruit type of <math>\,X</math>.<br />
<br />
=== Error rate ===<br />
<br />
:''''True error rate'''' of a classifier(h) is defined as the probability that <math>\,h</math> does not correctly classify the points of <math>\,\mathcal{X}</math>, i.e.,<br />
::<math>\, L(h)=P(h(X) \neq Y)</math><br />
<br />
:''''Empirical error rate(training error rate)'''' of a classifier(h) is defined as the frequency that <math>\,h</math> does not correctly classify the points in the training set, i.e.,<br />
::<math>\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})</math>, where <math>\,I</math> is an indicator that <math>\, I= \left\{\begin{matrix} <br />
1 & h(X_i) \neq Y_i \\ <br />
0 & h(X_i)=Y_i \end{matrix}\right.</math>.<br />
<br />
=== Bayes Classifier ===<br />
<br />
The principle of Bayes Classifier is to calculate posteriori probability of given object from its priors probability via Bayes formula, and then choose the class with biggest posteriori probability as the one what the object affiliated with. Intuitively speaking to classify <math>\,x\in \mathcal{X}</math> we find <math>y \in \mathcal{Y}</math> such that <math>\,P(Y=y|X=x)</math> is maximum over all the members of <math>\mathcal{Y}</math>.<br />
<br />
Mathematically, for <math>\,k</math> classes and given object <math>\,X=x</math>, we are going to find out <math>\,y_{i}\in \mathcal{Y}</math> which <br />
maximizes <math>\,P(Y=y_i|X=x)</math>, and classify <math>\,X</math> into class <math>\,y_{i}</math>. In order to calculate the value of <math>\,P(Y=y_{i}|X=x)</math>, we use the ''Bayes formula''<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\<br />
&=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
where <math>\,P(Y=y|X=x)</math> is referred to as the posteriori probability, <math>\,P(Y=y)</math> as the priors probability, <math>\,P(X=x|Y=y)</math> as the likelihood, and <math>\,P(X=x)</math> as the evidence.<br />
<br />
For the special case that <math>\,Y</math> has only two possible values, that is, <math>\, \mathcal{Y}=\{0, 1\}</math>. Consider the probability that <math>\,r(X)=P\{Y=1|X=x\}</math>. Given <math>\,X=x</math>, By ''Bayes formula'', we have<br />
<br />
:<math><br />
\begin{align}<br />
r(X)&=P(Y=1|X=x) \\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\<br />
&=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}<br />
\end{align}<br />
</math><br />
<br />
<br />
'''Definition''':<br />
<br />
The Bayes classification rule <math>\,h</math> is<br />
<br />
:<math>\, h(X)= \left\{\begin{matrix} <br />
1 & r(x)>\frac{1}{2} \\ <br />
0 & \mathrm{otherwise} \end{matrix}\right.</math><br />
<br />
'''Bayes classification rule optimality Theorem''': The Bayes rule is optimal in true error rate, that is for any other classification rule <math>\, \overline{h}</math>, we have <math>\,L(\overline{h}) \le L(h)</math>. Intuitively speaking this theorem is saying we cannot do better than classifying <math>\,x\in \mathcal{X}</math> to <math>\,y</math> when the probability of being of type <math>\,y</math> for <math>\,x</math> is more than probability of being any other type.<br />
<br />
'''Definition''':<br />
<br />
The set <math>\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}</math> is called the ''decision boundary''.<br />
<br />
'''Example''':<br /><br />
We’re going to predict if a particular student will pass STAT441/841.<br />
We have data on past student performance. For each student we know:<br />
If student’s GPA > 3.0 (G)<br />
If student had a strong math background (M)<br />
If student is a hard worker (H)<br />
If student passed or failed course<br /><br />
[[File:裁剪.jpg]]<br /><br />
<math>\, \mathcal{Y}= \{ 0,1 \} </math>, where 1 refers to ''pass'' and 0 refers to ''fail''. Assume that <math>\,P(Y=1)=P(Y=0)=0.5</math><br /><br />
For a new student comes along with values <math>\,G=0, M=1, H=0</math>, we calculate <math>\,r(X)=P(Y=1|X=(0,1,0))</math> as<br /><br />
<br />
<math>\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2<\frac{1}{2}</math><br /><br />
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.<br />
<br />
<br />
:''Notice'': Although the Bayes rule is optimal, we still need other methods, and the reason for the fact is that in the Bayes equation discussed before, it is generally impossible for us to know the <math>\,P(Y=1)</math>, and <math>\,P(X=x|Y=1)</math> and ultimately calculate the value of <math>\,r(X)</math>, which makes Bayes rule inconvenient in practice.<br />
<br />
Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[http://en.wikipedia.org/wiki/Naive_Bayes_classifier], TAN, BAN and GBN.<br /><br />
''useful link'':[http://moodle.cs.ualberta.ca/file.php/127/SDTheory.ppt#256,1,Statistical Decision Theory, Bayes Classifier]<br />
<br />
=== Bayes VS Frequentist ===<br />
<br />
Intuitively, to solve a two-class problem, we may have the following two approaches:<br />
<br />
1) If <math>\,P(Y=1|X=x)>P(Y=0|X=x)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
2) If <math>\,P(X=x|Y=1)>P(X=x|Y=0)</math>, then <math>\,h(x)=1</math>, otherwise <math>\,h(x)=0</math>.<br />
<br />
One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as objective existance. Actually, they represent two different schools in statistics.<br />
<br />
During the history of statistics, there are two major classification methods : Bayes and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.<br />
<br />
'''Frequentist'''<br />
#Probability is objective. <br />
#Data is a repeatable random sample(there is a frequency).<br />
#Parameters are fixed and unknown constant.<br />
#Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.<br />
<br />
'''Bayes'''<br />
#Probability is subjective.<br />
#Data are fixed.<br />
#Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them. <br />
#Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of <math>\,50%</math> of rain.<br />
<br />
'''Example'''<br />
<br />
Suppose there is a man named Jack. In bayes method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis - October 2,2009''' ==<br />
<br />
===Introduction===<br />
<br />
====Notation====<br />
<br />
Let us first introduce some new notation for the following sections.<br />
<br />
Recall that in the discussion of the Bayes Classifier, we introduced ''Bayes Formula'':<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)}<br />
\end{align}<br />
</math><br />
<br />
We will use new labels for the following equivalent formula:<br />
<br />
:<math><br />
\begin{align}<br />
P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k}<br />
\end{align}<br />
</math><br />
<br />
* <math>\,f_k</math> is called the '''class conditional density'''; also referred to previously as the [http://en.wikipedia.org/wiki/Likelihood_function likelihood function]. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.<br />
* <math>\,\pi_k</math> is called the [http://en.wikipedia.org/wiki/Prior_probability '''prior probability''']. This is a probability distribution that represents what we know (or believe we know) about a population.<br />
* <math>\,\Sigma_k</math> is the sum with respect to all <math>\,k</math> classes.<br />
<br />
====Approaches====<br />
<br />
Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.<br />
<br />
# Choose classifiers, find <math>\,h^* \epsilon H</math>, minimize some estimate of <math>\,L(H)</math>.<br />
# Regression<br />
# Density estimation, estimate <math>P(X = x | Y = 0)</math> and <math>P(X = x | Y = 1)</math> <br />
<br />
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2.<br />
Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.<br />
<br />
===LDA===<br />
<br />
====Motivation====<br />
The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.<br />
<br />
The simplest way to achieve this is to assume that all the class densities are approximately a [http://en.wikipedia.org/wiki/Multivariate_normal_distribution multivariate normal distribution], find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.<br />
<br />
====History====<br />
The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, [http://pages.stern.nyu.edu/~ealtman/Zscores.pdf revisited in 2000], has shown an 85-90% success rate predicting bankruptcy, and is still in use today.<br />
<br />
====Definition====<br />
To perform LDA we make two assumptions.<br />
<br />
# The clusters belonging to all classes each follow a multivariate normal distribution. <br /><math>x \in \mathbb{R}^d</math> <math>f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)</math><br />
# Each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math>.<br />
<br />
<br />
We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.<br />
<br />
So we solve <math>\,r_k(x)=r_l(x)</math> for all the pairwise combinations of classes.<br />
<br />
<br />
<math>\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}</math> using Bayes' Theorem<br />
<br />
<br />
<math>\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)</math> by canceling denominators<br />
<br />
<br />
<math>\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l</math> Since both <math>\Sigma</math> are equal based on the assumptions specific to LDA.<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)</math> taking the log of both sides.<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0</math> after canceling out like terms and factoring.<br />
<br />
We can see that this is a linear function in x with general form ax+b=0. <br />
<br />
Actually, this linear log function shows that the decision boundary between class <math>k</math> and class <math>l</math>, i.e. <math>Pr(G=k|X=x)=Pr(G=l|X=x)</math>, is linear in <math>x</math>. Given any pair of classes, decision boundaries are always linear. In <math>p</math> dimensions, we separate regions by hyperplanes. <br />
<br />
In the special case where the number of samples from each class are equal (<math>\,\pi_k=\pi_l</math>), the boundary surface or line lies halfway between <math>\,\mu_l</math> and <math>\,\mu_k</math><br />
<br />
===QDA===<br />
The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance <math>\,\Sigma</math> equal to the mean variance of <math>\Sigma_k \forall k</math> is removed.<br />
<br />
<br />
Following along from where QDA diverges from LDA.<br />
<br />
<math>\,f_k(x)\pi_k=f_l(x)\pi_l</math><br />
<br />
<math>\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math><br />
<br />
<br />
<math>\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l</math> by cancellation<br />
<br />
<br />
<math>\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)</math> by taking the log of both sides<br />
<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0</math> by expanding out<br />
<br />
<math>\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0</math> this time there are no cancellations, so we can only factor<br />
<br />
<br />
The final result is a quadratic equation specifying a curved boundary between classes with general form ax<sup>2</sup>+bx+c=0.<br />
<br />
== '''Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009''' ==<br />
<br />
<br />
===Summarizing LDA and QDA===<br />
<br />
We can summarize what we have learned on LDA and QDA so far into the following theorem.<br />
<br />
'''Theorem''': <br />
<br />
Suppose that <math>\,Y \in \{1,\dots,k\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=y)</math> is Gaussian, the Bayes Classifier rule is:<br />
<br />
<math>\,h(X) = \arg\max_{k} \delta_k(x)</math> <br />
<br />
where <br />
<br />
:::<math> \,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math> (quadratic)<br />
<br />
*'''Note''' The decision boundary between classes <math>k</math> and <math>l</math> is quadratic in <math>x</math>. <br />
<br />
If the covariance of the Gaussians are the same, this becomes:<br />
<br />
:::<math> \,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math> (linear)<br />
<br />
*'''Note''' <math>\,\arg\max_{k} \delta_k(x)</math>returns the set of k for which <math>\,\delta_k(x)</math> attains its largest value.<br />
<br />
===In practice===<br />
We need to estimate the prior, so in order to do this, we use the sample estimates of <math>\,\pi,\mu_k,\Sigma_k</math> in place of the true values, i.e.<br />
[[File:estimation.png|250px|thumb|right|Estimation of the probability of belonging to either class k or l]] <br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
In the case where we have a common covariance matrix, we get the ML estimate to be<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)} </math><br />
<br />
<br />
===Computation===<br />
<br />
<br />
'''Case 1: (Example) <math>\, \Sigma_k = I </math>'''<br />
<br />
This means that the data is distributed symmetrically around the center <math>\mu</math>, i.e. the isocontours are all circles.<br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
We see that the first term in the above equation, <math>\,\frac{1}{2}log(|I|)</math>, is zero. The second term contains <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math>, which is the squared Euclidean distance between <math>\,x</math> and <math>\,\mu_k</math>. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, <math>\,log(\pi_k)</math>. The class that has the minimum distance will maximise <math>\,\delta_k</math>. According to the theorem, we can then classify the point to a specific class <math>\,k</math>. In addition, <math>\, \Sigma_k = I </math> implies that our data is spherical. <br />
<br />
<br />
'''Case 2: (General Case) <math>\, \Sigma_k \ne I </math>'''<br />
<br />
We can decompose this as:<br />
<br />
<math> \, \Sigma_k = USV^\top = USU^\top </math> (since if <math>\, U = XX^\top </math> and <math>\, V=X^\top X</math> , if <math>\, X</math> is symmetric, <math>\, U=V</math> , and here <math>\, \Sigma </math> is symmetric)<br />
<br />
and the inverse of <math>\,\Sigma_k</math> is<br />
<br />
<math> \, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top </math> (since <math>\,U</math> is orthonormal)<br />
<br />
So from the formula for <math>\,\delta_k</math>, the second term is<br />
<br />
:<math> \, (x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) </math><br />
:<math> \, = (x-\mu_k)^\top US^{-1}U^T(x-\mu_k) </math><br />
:<math> \, = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k) </math><br />
:<math> \, = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) </math><br />
:<math> \, = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) </math><br />
<br />
where we have the Euclidean distance between <math> \, S^{-\frac{1}{2}}U^\top x </math> and <math>\, S^{-\frac{1}{2}}U^\top\mu_k</math>.<br />
<br />
A transformation of all the data points can be done from <math>\,x</math> to <math>\,x^*</math> where <math> \, x^* \leftarrow S^{-\frac{1}{2}}U^\top x </math>.<br />
<br />
It is now possible to do classification with <math>\,x^*</math>, treating it as in Case 1 above.<br />
<br />
Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.<br />
<br />
If the classes have different shapes, in another word, have different covariance <math>\,\Sigma_k</math>, can we use the same method to transform all data points <math>\,x</math> to <math>\,x^*</math>?<br />
<br />
The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.<br />
<br />
===The Number of Parameters in LDA and QDA===<br />
<br />
Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.<br />
<br />
LDA: Since we just need to compare the differences between one given class and remaining <math>\,K-1</math> classes, totally, there are <math>\,K-1</math> differences. For each of them, <math>\,a^{T}x+b</math> requires <math>\,d+1</math> parameters. Therefore, there are <math>\,(K-1)\times(d+1)</math> parameters.<br />
<br />
QDA: For each of the differences, <math>\,x^{T}ax + b^{T}x + c</math> requires <math>\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1</math> parameters. Therefore, there are <math>(K-1)(\frac{d(d+3)}{2}+1)</math> parameters.<br />
<br />
[[File:Lda-qda-parameters.png|frame|center|A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.]]<br />
<br />
== LDA and QDA in Matlab - October 7, 2009 ==<br />
<br />
We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] that allows us to perform LDA and QDA quickly and easily.<br />
<br />
In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.<br />
<br />
>> plot (sample(1:200,1), sample(1:200,2), '.');<br />
>> hold on;<br />
>> plot (sample(201:400,1), sample(201:400,2), 'r.');<br />
<br />
:Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.<br />
<br />
[[File:2-3-pca.png|frame|center|See [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/title.html <code>title</code>] and [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/legend.html <code>legend</code>] for information on adding the title and legend.]]<br />
<br />
:Before using [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/classify.html <code>classify</code>] we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the <code>group</code> vector should be an empty string or <code>NaN</code>. (See [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/bqziops.html grouping data] for more information.)<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
<br />
:We can now classify our data.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
<br />
:The full details of this line can be examined in the Matlab help file linked above. What we care about are <code>class</code>, which contains the labels that the algorithm thinks that each data point belongs to, and <code>coeff</code>, which contains information about the line that algorithm created to separate the data into each class.<br />
<br />
:We can see the efficacy of the algorithm by comparing <code>class</code> to <code>group</code>.<br />
<br />
>> sum (class==group)<br />
ans =<br />
369<br />
<br />
:This compares the value in <code>class</code> to the value in <code>group</code>. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an ''empirical error rate'' of 0.0775.<br />
<br />
:We can see the line produced by LDA using <code>coeff</code>.<br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
:Those familiar with the programming language C will find the <code>sprintf</code> line refreshingly familiar; those with no exposure to C are directed to Matlab's [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/sprintf.html <code>sprintf</code>] page. Essentially, this code sets up the equation of the line in the form <code>0 = a + bx + cy</code>. We then use the [http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/ref/ezplot.html <code>ezplot</code>] function to plot the line.<br />
<br />
[[File:2-3-lda.png|center|frame|The 2-3 data after LDA is performed. The line shows where the two classes are split.]]<br />
<br />
:Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to <code>classify</code>, and a more complicated procedure to plot the line.<br />
<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');<br />
>> sum (class==group)<br />
ans =<br />
371<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> q = coeff(1,2).quadratic;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));<br />
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File:2-3-qda.png|center|frame|The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.]]<br />
<br />
<code>classify</code> can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.<br />
<br />
'''Recall: An analysis of the function of <code>princomp</code> in matlab.'''<br />
<br />In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/princomp.html&http://www.google.cn/search?hl=zh-CN&q=mathwork+princomp&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= <code>princomp</code>] which can perform PCA conveniently. From the matlab help file on <code>princomp</code>, you can find the details about this function. But here we will analyze the code of the function of <code>princomp()</code> in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.<br />
<br />
function [pc, score, latent, tsquare] = princomp(x);<br />
% PRINCOMP Principal Component Analysis (centered and scaled data).<br />
% [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and<br />
% returns the principal components in PC, the so-called Z-scores in SC<br />
% ORES, the eigenvalues of the covariance matrix of X in LATENT,<br />
% and Hotelling's T-squared statistic for each data point in TSQUARE.<br />
% Reference: J. Edward Jackson, A User's Guide to Principal Components<br />
% John Wiley & Sons, Inc. 1991 pp. 1-25.<br />
% B. Jones 3-17-94<br />
% Copyright 1993-2002 The MathWorks, Inc.<br />
% $Revision: 2.9 $ $Date: 2002/01/17 21:31:45 $<br />
[m,n] = size(x); % get the lengh of the rows and columns of matrix x. <br />
r = min(m-1,n); % max possible rank of X <br />
avg = mean(x); % the mean of every column of X<br />
centerx = (x - avg(ones(m,1),:)); <br />
% centers X by subtracting off column means <br />
[U,latent,pc] = svd(centerx./sqrt(m-1),0); <br />
% "economy size" decomposition<br />
score = centerx*pc; <br />
% the representation of X in the principal component space<br />
if nargout < 3<br />
return;<br />
end<br />
latent = diag(latent).^2;<br />
if (r latent = [latent(1:r); zeros(n-r,1)];<br />
score(:,r+1:end) = 0;<br />
end<br />
if nargout < 4<br />
return;<br />
end<br />
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';<br />
tsquare = sum(tmp.*tmp)';<br />
<br />
From the above code, we should pay attention to the following aspects when comparing with SVD method:<br />
<br />
First, Rows of <math>\,X</math> correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of <math>\,X</math>.<br />
>> load 2_3;<br />
>> [U, score] = princomp(X');<br />
<br />
Second, princomp centers X by subtracting off column means.<br />
<br />
The third, when <math>\,X=UdV'</math>, princomp uses <math>\,V</math> as coefficients for principal components, rather than <math>\,U</math>.<br />
<br />
The following is an example to perform PCA using princomp and SVD respectively to get the same results.<br />
:SVD method<br />
>> load 2_3<br />
>> mn=mean(X,2);<br />
>> X1=X-repmat(mn,1,400);<br />
>> [s d v]=svd(X1');<br />
>> y=X1'*v;<br />
<br />
:princomp<br />
>>[U score]=princomp(X');<br />
<br />
Then we can see that y=score, v=U.<br />
<br />
== Trick: Using LDA to do QDA - October 7, 2009 ==<br />
<br />
There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the [http://en.wikipedia.org/wiki/Kernel_trick Kernel trick] that will be discussed later in the course.<br />
<br />
Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.<br />
<br />
=== Motivation ===<br />
<br />
Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.<br />
<br />
If we look back at the equations for LDA and QDA, we see that in LDA we must estimate <math>\,\mu_1</math>, <math>\,\mu_2</math> and <math>\,\Sigma</math>. In QDA we must estimate all of those, plus another <math>\,\Sigma</math>; the extra <math>\,\frac{d(d-1)}{2}</math> estimations make QDA less robust with fewer data points.<br />
<br />
=== Theoretically ===<br />
<br />
Suppose we can estimate some vector <math>\underline{w}^T</math> such that<br />
<br />
<math>y = \underline{w}^Tx</math><br />
<br />
where <math>\underline{w}</math> is a d-dimensional column vector, and <math style="vertical-align:0%;">x\ \epsilon\ \Re^d</math> (vector in d dimensions).<br />
<br />
We also have a non-linear function <math>g(x) = y = x^Tvx + \underline{w}^Tx</math> that we cannot estimate.<br />
<br />
Using our trick, we create two new vectors, <math>\,\underline{w}^*</math> and <math>\,x^*</math> such that:<br />
<br />
<math>\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]</math><br />
<br />
and<br />
<br />
<math>x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]</math><br />
<br />
We can then estimate a new function, <math>g^*(x,x^2) = y^* = \underline{w}^{*T}x^*</math>.<br />
<br />
Note that we can do this for any <math>x</math> and in any dimension; we could extend a <math>D \times n</math> matrix to a quadratic dimension by appending another <math>D \times n</math> matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a <math>\,sin(x)</math> dimension.<br />
<br />
=== By Example ===<br />
<br />
Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.<br />
<br />
>> load 2_3;<br />
>> [U, sample] = princomp(X');<br />
>> sample = sample(:,1:2);<br />
<br />
:We start off the same way, by using PCA to reduce the dimensionality of our data to 2.<br />
<br />
>> X_star = zeros(400,4);<br />
>> X_star(:,1:2) = sample(:,:);<br />
>> for i=1:400<br />
for j=1:2<br />
X_star(i,j+2) = X_star(i,j)^2;<br />
end<br />
end<br />
<br />
:This projects our sample into two more dimensions by squaring our initial two dimensional data set.<br />
<br />
>> group = ones(400,1);<br />
>> group(201:400) = 2;<br />
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');<br />
>> sum (class==group)<br />
ans =<br />
375<br />
<br />
:We can now display our results. <br />
<br />
>> k = coeff(1,2).const;<br />
>> l = coeff(1,2).linear;<br />
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));<br />
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
<br />
[[File: 2_3LDA.png|center|frame| The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.]]<br />
<br />
:Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with <math>x^4</math> (i.e. we set <code>X_star(i,j+2) = X_star(i,j)^4</code>) we can correctly classify 376 points.<br />
<br />
== Introduction to Fisher's Discriminant Analysis - October 7, 2009 ==<br />
<br />
'''Fisher's Discriminant Analysis (FDA)''', also known as '''Fisher's Linear Discriminant Analysis (LDA)''' in some sources, is a classical feature extraction technique. It was originally described in 1936 by [http://en.wikipedia.org/wiki/Ronald_A._Fisher Sir Ronald Aylmer Fisher], an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found [http://digital.library.adelaide.edu.au/dspace/handle/2440/15227 here]; a Wikipedia article summarizing the algorithm can be found [http://en.wikipedia.org/wiki/Linear_discriminant_analysis#Fisher.27s_linear_discriminant here].<br />
<br />
=== Contrasting FDA with PCA ===<br />
<br />
The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).<br />
* In PCA, we map data to lower dimensions to maximize the variation in those dimensions.<br />
* In FDA, we map data to lower dimensions to best separate data in different classes.<br />
<br />
[[File:Fda.png|frame|center|2 clouds of data, and the lines that might be produced by PCA and FDA.]]<br />
<br />
Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.<br />
<br />
Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.<br />
<br />
=== Intuitive Description of FDA ===<br />
<br />
An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.<br />
<br />
<br />
=== Example in R ===<br />
[[File:Pca-fda1_low.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using [http://www.r-project.org R].]]<br />
<br />
>> X = matrix(nrow=400,ncol=2)<br />
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))<br />
>> Y = c(rep("red",200),rep("blue",200))<br />
: Create 2 multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. Create <code>Y</code>, an index indicating which class they belong to.<br />
<br />
>> s <- svd(X,nu=1,nv=1)<br />
: Calculate the singular value decomposition of X. The most significant direction is in <code>s$v[,1]</code>, and is displayed as a black line.<br />
<br />
>> s2 <- lda(X,grouping=Y)<br />
: The <code>lda</code> function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in <code>s2$scaling</code>.<br />
<br />
Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.<br />
>> plot(X,col=Y,main="PCA vs. FDA example")<br />
: Plot the set of points, according to colours given in Y.<br />
>> slope = s$v[2]/s$v[1]<br />
>> intercept = mean(X[,2])-slope*mean(X[,1])<br />
>> abline(a=intercept,b=slope)<br />
: Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.<br />
>> slope2 = s2$scaling[2]/s2$scaling[1]<br />
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])<br />
>> abline(a=intercept2,b=slope2,col="red")<br />
: Plot the FLDA direction, again through the mean.<br />
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)<br />
: Labeling the lines directly on the graph makes it easier to interpret.<br />
<br />
== Ficher's Discriminant Analysis (FDA) ==<br />
<br />
The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.<br />
We can consider two kinds of problems:<br />
* 2-class problem<br />
* multi-class problem<br />
<br />
=== Two-class problem - October 9, 2009 ===<br />
In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.<br />
<br />
Assume <math>\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{1}</math>,<br />
represent the mean and covariance of the 1st class, and <br />
<math>\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}</math> and <math>\displaystyle\Sigma_{2}</math> represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:<br />
<br />
1.''To make the means of these two classes as far apart as possible''<br />
:In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math>and the projected points are <math>\underline{w}^T \underline{x_{i}}</math> then the mean of the projected points will be <math>\underline{w}^T \underline{\mu_{1}}</math> and <math>\underline{w}^T \underline{\mu_{2}}</math> for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, <math>(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})</math>. The steps of this maximization are given below. <br />
<br />
2.''We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes''<br />
: Notice that the variance of the projected classes 1 and 2 are given by <math>\underline{w}^T\Sigma_{1}\underline{w}</math> and <math>\underline{w}^T\Sigma_{2}\underline{w}</math>. The second goal is to minimize the sum of these two covariances. <br />
<br />
As is demonstrated below, both of these goals can be accomplished simultaneously.<br />
<br/><br />
<br/><br />
Original points are <math>\underline{x_{i}} \in \mathbb{R}^{d}</math><br /><br />
Projected points are <math>\underline{z_{i}} \in \mathbb{R}^{1}</math> with <math>\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}</math><br />
<br />
==== Between class covariance ====<br />
<br />
In this particular case, we want to project all the data points in one dimensional space.<br />
<br />
<br />
We want to maximize the Euclidean distance between projected means, which is<br />
:<math><br />
\begin{align}<br />
(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\<br />
&= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w}<br />
\end{align}<br />
</math><br />
<br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math> is called '''between class covariance''' or <math>\,S_{B}</math>.<br />
<br />
The goal is to maximize : <math>\underline{w}^T S_{B} \underline{w}</math><br />
<br />
==== Within class covariance ====<br />
<br />
Covariance of class 1 is <math>\,\Sigma_{1}</math><br />
Covariance of class 2 is <math>\,\Sigma_{2}</math><br />
So covariance of projected points will be <math>\,\underline{w}^T \Sigma_{1} \underline{w}</math> and <math>\underline{w}^T \Sigma_{2} \underline{w}</math><br />
<br />
If we sum this two quantities we have<br />
:<math><br />
\begin{align}<br />
\underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w}<br />
\end{align}<br />
</math><br />
<br />
The quantity <math>\,(\Sigma_{1} + \Sigma_{2})</math> is called '''within class covariance''' or <math>\,S_{W}</math><br />
<br />
The goal is to minimize <math>\underline{w}^T S_{W} \underline{w}</math><br />
<br />
==== Objective Function ====<br />
<br />
Instead of maximizing <math>\underline{w}^T S_{B} \underline{w}</math> and minimizing <math>\underline{w}^T S_{W} \underline{w}</math> we can define the following objective function:<br />
<br />
:<math>\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}</math><br />
<br />
This maximization problem is equivalent to <math>\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}</math> subject to constraint <math>\underline{w}^T S_{W} \underline{w} = 1</math>. We can use the Lagrange multiplier method to solve it:<br />
<br />
:<math>L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)</math><br />
<br /><br /><br />
With <math>\frac{\part L}{\part \underline{w}} = 0</math> we get:<br />
:<math><br />
\begin{align}<br />
&\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\<br />
&\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\<br />
&\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}<br />
\end{align}<br />
</math><br />
Note that <math>\, S_{W}=\Sigma_1+\Sigma_2</math> is sum of two positive matrices and so it has an inverse.<br />
<br />
Here <math>\underline{w}</math> is the eigenvector of <math>S_{w}^{-1}\ S_{B}</math> corresponding to the largest eigenvalue.<br />
<br />
In facts, this expression can be simplified even more.<br><br />
:<math>\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}</math> with <math>S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T</math><br />
:<math>\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}</math><br />
<br />
The quantity <math>(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}</math> and <math>\lambda</math> are scalars.<br><br />
So we can say the quantity <math>S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})</math> is proportional to <math>\underline{w}</math><br />
<br />
====FDA vs. PCA Example in Matlab ====<br />
<br />
We can compare PCA and FDA through the figure produced by matlab.<br />
<br />
The following are the code to produce the figure step by step and the explanation for steps.<br />
<br />
>>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);<br />
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);<br />
>>X=[X1;X2];<br />
: Create two multivariate normal random variables with <math>\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)</math>. <br />
<br />
>>plot(X(1:300,1),X(1:300,2),'.');<br />
>>hold on<br />
>>p1=plot(X(301:600,1),X(301:600,2),'r.');<br />
: Plot the the data of the two classes respectively.<br />
<br />
>>[U Y]=princomp(X);<br />
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);<br />
: Using PCA to find the principal component and plot it.<br />
<br />
>>sw=2*[1 1.5;1.5 3];<br />
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';<br />
>>g =inv(sw)*sb;<br />
>>[v w]=eigs(g);<br />
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')<br />
: Using FDA to find the principal component and plot it.<br />
<br />
Now we can compare them through the figure.<br />
<br />
[[File:PCA-VS-FDA.png|frame|center|PCA and FDA primary dimension for normal multivariate data, using matlab]]<br />
<br />
==== Practical example of 2_3 ====<br />
<br />
In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".<br />
<br />
X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".<br />
<br />
>>load 2_3<br />
>>X1 = X(:, 1:200);<br />
>>X2 = X(:, 201:400);<br />
<br />
Next we calculate within class covariance and between class covariance as before.<br />
<br />
>>mu1 = mean(X1, 2);<br />
>>mu2 = mean(X2, 2);<br />
>>sb = (mu1 - mu2) * (mu1 - mu2)';<br />
>>sw = cov(X1') + cov(X2');<br />
<br />
We use the first two eigenvectors to project the dato in a two-dimensional space.<br />
<br />
>>[v d] = eigs( inv(sw) * sb );<br />
>>w = v(:, 1:2);<br />
>>X_hat = w'*X;<br />
<br />
Finally we plot the data and visualize the effect of FDA.<br />
<br />
>> scatter(ones(1,200),X_hat(1:200))<br />
>> hold on<br />
>> scatter(ones(1,200),X_hat(201:400),'r')<br />
<br />
[[File:fda2-3.jpg|frame|center|FDA projection of data 2_3, using [http://www.mathwork.com Matlab].]]<br />
<br />
== FDA for Multi-class Problems - October 14, 2009 ==<br />
<br />
For the <math>k</math>-class problem, we need to find a projection from<br />
<math>d</math>-dimensional space to a <math>(k-1)</math>-dimensional space.<br />
<br />
Basically, the within class covariance matrix <math>\mathbf{S}_{W}</math> is easily to obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i})^{T}</math> and <math>\mathbf{\mu}_{i} = \frac{\sum_{j:<br />
y_{j}=i}\mathbf{x}_{j}}{n_{i}}</math>.<br />
<br />
However, the between class covariance matrix<br />
<math>\mathbf{S}_{B}</math> is not easy to obtain. One of the simplifications<br />
is that we may assume that the total covariance <math>\mathbf{S}_{T}</math> of the data is<br />
constant, since <math>\mathbf{S}_{W}</math> is easy to compute, we can get<br />
<math>\mathbf{S}_{B}</math> using the following relationship:<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W}<br />
\end{align}<br />
</math><br />
<br />
Actually, there is another generation for <math>\mathbf{S}_{B}</math>. Denote a<br />
total mean vector <math>\mathbf{\mu}</math> by<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} =<br />
\frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j}<br />
\end{align}<br />
</math><br />
<br />
Thus the total covariance matrix <math>\mathbf{S}_{T}</math> is<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} =<br />
\sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} -<br />
\mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} <br />
\\&<br />
= \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+<br />
\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\\&<br />
= \mathbf{S}_{W} + \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Since the total covariance <math>\mathbf{S}_{T}</math> is the sum of the within class covariance <math>\mathbf{S}_{W}</math><br />
and the between class covariance <math>\mathbf{S}_{B}</math>, we can denote the second term as<br />
the general between class covariance matrix <math>\mathbf{S}_{B}</math>, thus we obtain<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B} = \sum_{i=1}^{k}<br />
n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
Therefore,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B}<br />
\end{align}<br />
</math><br />
<br />
Recall that in the two class case problem, we have<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B^{\ast}} =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T}<br />
\\ & =<br />
((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T}<br />
\\ & =<br />
(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
<br />
From the general form,<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B} =<br />
n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}<br />
+<br />
n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T}<br />
\end{align}<br />
</math><br />
Apparently, they are very similar.<br />
<br />
Now, we are trying to find the optimal transformation. Basically, we have<br />
:<math><br />
\begin{align}<br />
\mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i},<br />
i=1,2,...,k-1<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{z}_{i}</math> is a <math>(k-1)\times 1</math> vector, <math>\mathbf{W}</math><br />
is a <math>d\times (k-1)</math> transformation matrix, i.e. <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>, and <math>\mathbf{x}_{i}</math><br />
is a <math>d\times 1</math> column vector.<br />
<br />
Thus we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T}<br />
\\ & = \sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j:<br />
y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
Similarly, we obtain<br />
:<math><br />
\begin{align}<br />
& \mathbf{S}_{B}^{\ast} =<br />
\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\left[<br />
\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W}<br />
\\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Now, we use the determinant of the matrix, i.e. the product of the<br />
eigenvalues of the matrix, as our measure.<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} =<br />
\frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}}<br />
\end{align}<br />
</math><br />
<br />
The solution for this question is that the columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
Also, note that we can use<br />
:<math><br />
\begin{align}<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\end{align}<br />
</math><br />
as our measure.<br />
<br />
Recall that<br />
:<math><br />
\begin{align}<br />
\|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X})<br />
\end{align}<br />
</math><br />
<br />
Thus we obtain that<br />
:<math><br />
\begin{align}<br />
&<br />
\sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2}<br />
\\ & =<br />
\sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}]<br />
\\ & =<br />
Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & =<br />
Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}]<br />
\\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]<br />
\end{align}<br />
</math><br />
<br />
Similarly, we can get <math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]</math>. Thus we have following criterion function<br />
:<math><br />
\begin{align}<br />
\phi(\mathbf{W}) =<br />
\frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}<br />
\end{align}<br />
</math><br />
Similar to the two class case problem, we have:<br />
<br />
max <math>Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]</math> subject to<br />
<math>Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1</math><br />
<br />
To solve this optimization problem a Lagrange multiplier <math>\Lambda</math>, which actually is a <math>d \times d</math> diagonal matrix, is introduced:<br />
:<math><br />
\begin{align}<br />
L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\}<br />
\end{align}<br />
</math><br />
<br />
Differentiating with respect to <math>\mathbf{W}</math> we obtain:<br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}<br />
\end{align}<br />
</math><br />
<br />
Note that the <math>\mathbf{S}_{B}</math> and <math>\mathbf{S}_{W}</math> are both symmetric matrices, thus set the first derivative to zero, we obtain:<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0<br />
\end{align}<br />
</math><br />
<br />
Thus,<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}<br />
\end{align}<br />
</math><br />
where<br />
:<math><br />
\mathbf{\Lambda} =<br />
\begin{pmatrix}<br />
\lambda_{1} & & 0\\<br />
&\ddots&\\<br />
0 & &\lambda_{d}<br />
\end{pmatrix}<br />
</math><br />
and <math>\mathbf{W} =<br />
[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]</math>.<br />
<br />
As a matter of fact, <math>\mathbf{\Lambda}</math> must have <math>\mathbf{k-1}</math> nonzero eigenvalues, because <math>rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1</math>.<br />
<br />
Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix<br />
<math>\mathbf{W}</math> are exactly the eigenvectors that correspond to largest <math>k-1</math><br />
eigenvalues with respect to<br />
:<math><br />
\begin{align}<br />
\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =<br />
\lambda_{i}\mathbf{w}_{i}<br />
\end{align}<br />
</math><br />
<br />
== Linear Regression Models - October 14, 2009 ==<br />
<br />
Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, <math>\,y</math>, changes according to changes in <math>\,X</math>.<br />
<br />
We will start by considering a very simple regression model, the linear regression model.<br />
<br />
General information on linear regression can be found at the [http://numericalmethods.eng.usf.edu/topics/linear_regression.html University of South Florida] and [http://academicearth.org/lectures/applications-to-linear-estimation-least-squares this MIT lecture].<br />
<br />
For the purpose of classification, the linear regression model assumes<br />
that the regression function <math>\,E(Y|X)</math> is linear in the inputs<br />
<math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math>.<br />
<br />
The linear regression model has the general form:<br />
<br />
:<math><br />
\begin{align}<br />
f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0}<br />
\end{align}<br />
</math><br />
where <math>\,\beta</math> is a <math>1 \times d</math> vector.<br />
<br />
Given input data <math>\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}</math> and <math>\,y_{1}, ..., y_{p}</math> our goal is to find <math>\,\beta</math> and <math>\,\beta_0</math> such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.<br />
<br />
Note that vectors <math>\mathbf{x}_{i}</math> could be numerical inputs,<br />
transformations of the original data, i.e. <math>\log \mathbf{x}_{i}</math> or <math>\sin<br />
\mathbf{x}_{i}</math>, or basis expansions, i.e. <math>\mathbf{x}_{i}^{2}</math> or<br />
<math>\mathbf{x}_{i}\times \mathbf{x}_{j}</math>.<br />
<br />
Denote <math>\mathbf{X}</math> as a <math>n\times(d+1)</math> matrix with each row an input<br />
vector (with 1 in the first position), <math>\,\beta = (\beta_0,<br />
\beta_1,..., \beta_{d})^{T}</math> and <math>\mathbf{y}</math> as a <math>n \times 1</math><br />
vector of outputs. We then try to minimize the residual<br />
sum-of-squares<br />
<br />
:<math><br />
\begin{align}<br />
\mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T}<br />
\end{align}<br />
</math><br />
<br />
This is a quadratic function in the <math>\,d+1</math> parameters. Differentiating<br />
with respect to <math>\,\beta</math> we obtain<br />
:<math><br />
\begin{align}<br />
\frac{\partial \mathrm{RSS}}{\partial \beta} =<br />
-2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)<br />
\end{align}<br />
</math><br />
<br />
:<math><br />
\begin{align}<br />
\frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial<br />
\beta^{T}}=2\mathbf{X}^{T}\mathbf{X}<br />
\end{align}<br />
</math><br />
<br />
Set the first derivative to zero<br />
:<math><br />
\begin{align}<br />
\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0<br />
\end{align}<br />
</math><br />
<br />
we obtain the solution<br />
:<math><br />
\begin{align}<br />
\hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
Thus the fitted values at the inputs are<br />
:<math><br />
\begin{align}<br />
\mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}<br />
\end{align}<br />
</math><br />
<br />
where <math>\mathbf{H} = \mathbf{X}<br />
(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}</math> is called the hat matrix.<br />
<br />
<br/><br />
*'''Note''' For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:<br/><br />
<math>r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}</math><br/><br />
It is clear that to make sense mathematically, <math>\displaystyle r(x)</math> must be a value between 0 and 1. If this is estimated with the <br />
regression function <math>\displaystyle r(x)=E(Y|X=x)</math> and <math>\mathbf{\hat\beta} </math> is learned as above, then there is nothing that would restrict <math>\displaystyle r(x)</math> to taking values between 0 and 1.<br />
<br />
====A linear regression example in Matlab====<br />
<br />
We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
==Logistic Regression- October 16, 2009==<br />
===Intuition behind Logistic Regression===<br />
Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force <math>\,r(x)</math> to be between 0 and 1 and sum to 1. Consider the following [http://en.wikipedia.org/wiki/Logit log odds] model (for two classes):<br />
<br />
:<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx</math><br />
<br />
Calculating <math>\,P(Y=1|X=x)</math> leads us to the logistic regression model, which as opposed to the linear regression model, allows the modeling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1.<br />
<br />
This logistic regression model for the two class case is defined as: <br/><br />
:<math>P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}</math> <br />
<br/> <br />
[[File:Picture1.png |frame|center]]<br />
Then we have that <math>P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}</math><br />
[[File:Picture2.png |frame|center]]<br />
<br />
Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of <math>\underline\beta</math> maximizes the probability of obtaining the data <math>\displaystyle{x_{1},...,x_{n}}</math> from the known distribution. Combining <math>\displaystyle P(Y=1 | X=x)</math> and <math>\displaystyle P(Y=0 | X=x)</math> as follows, we can consider the two classes at the same time:<br />
<br />
:<math>\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^y \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y}</math><br />
<br />
Assuming the data <math>\displaystyle {x_{1},...,x_{n}}</math> is drawn independently, the likelihood function is <br />
<br />
:<math><br />
\begin{align}<br />
\mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\<br />
&=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\<br />
&= \prod_{i=1}^n p(x_{i};\theta)<br />
\end{align}<br />
</math><br />
<br />
Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get<br />
:<math>\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)</math><br />
<br />
So,<br />
:<math><br />
\begin{align}<br />
l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
&=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\<br />
\end{align}<br />
</math><br />
<br />
<br />
To maximize the log-likelihood, set its derivative to 0.<br />
:<math><br />
\begin{align}<br />
\frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\<br />
&=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right]<br />
\end{align}<br />
</math><br />
<br />
To solve this equation, the [http://numericalmethods.eng.usf.edu/topics/newton_raphson.html Newton-Raphson algorithm] is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.<br />
<br />
== Logistic Regression(2) - October 19, 2009 ==<br />
<br />
===Logistic Regression Model===<br />
<br />
Recall that in the last lecture, we learned the logistic regression model.<br />
<br />
* <math>P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
* <math>P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math><br />
<br />
===Find <math>\underline{\beta}</math> ===<br />
<br />
'''Criteria''': find a <math>\underline{\beta}</math> that maximizes the conditional likelihood of Y given X using the training data.<br />
<br />
From above, we have the first derivative of the log-likelihood:<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right] </math><br />
<math>=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]</math><br />
<br />
The [http://en.wikipedia.org/wiki/Newton%27s_method Newton-Raphson algorithm] requires the second-derivative or [http://en.wikipedia.org/wiki/Hessian_matrix Hessian matrix].<br />
<br />
<math>\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }=<br />
\sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}</math> <br />
<br />
('''note''': <math>\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T</math> you can check it [http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html here], it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)<br />
<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}</math> (by cancellation)<br />
<br />
::<math>=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})])</math>(since <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> and <math>1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}</math>)<br />
<br />
Starting with <math>\,\underline{\beta}^{old}</math>, the Newton-Raphson update is<br />
<br />
<math>\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})</math> where the derivatives are evaluated at <math>\,\underline{\beta}^{old}</math><br />
<br />
The iteration will terminate when <math>\underline{\beta}^{new}</math> is very close to <math>\underline{\beta}^{old}</math>.<br />
<br />
The iteration can be described in matrix form.<br />
<br />
* Let <math>\,\underline{Y}</math> be the column vector of <math>\,y_i</math>. (<math>n\times1</math>)<br />
* Let <math>\,X</math> be the <math>{d}\times{n}</math> input matrix.<br />
* Let <math>\,\underline{P}</math> be the <math>{n}\times{1}</math> vector with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})</math>.<br />
* Let <math>\,W</math> be an <math>{n}\times{n}</math> diagonal matrix with <math>i</math>th element <math>P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]</math><br />
<br />
then<br />
<br />
<math>\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})</math><br />
<br />
<math>\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T</math><br />
<br />
The Newton-Raphson step is<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})</math><br />
<br />
This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.<br />
<br />
<math><br />
\begin{align}<br />
\underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\<br />
&=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\<br />
&=(XWX^T)^{-1}XWZ<br />
\end{align}</math><br />
<br />
where <math>Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})</math><br />
<br />
Recall that linear regression by least square finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-\underline{\beta}X)^T(\underline{y}-\underline{\beta}^TX)</math><br />
<br />
we have <math>\underline{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})</math><br />
<br />
====WLS====<br />
Actually, the weighted least squares estimator minimizes the weighted error sum of squares<br />
<math><br />
S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}<br />
</math><br />
where <math>w_{i}>0</math>.<br />
Hence the WLS estimator is given by <br />
<math><br />
\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]<br />
</math><br />
<br />
A weighted linear regression of the iteratively computed response<br />
<math><br />
\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})<br />
</math><br />
<br />
Therefore, we obtain<br />
:<math><br />
\begin{align}<br />
& \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] <br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right]<br />
\\&<br />
= \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\&<br />
= \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p})<br />
\end{align}<br />
</math><br />
<br />
<br />
'''note:'''Here we obtain <math>\underline{\beta}</math>, which is a <math>d\times{1}</math> vector, because we construct the model like <math>\underline{\beta}^T\underline{x}</math>. If we construct the model like <math>\underline{\beta}_0+ \underline{\beta}^T\underline{x}</math>, then similar to linear regression, <math>\underline{\beta}</math> will be a <math>(d+1)\times{1}</math> vector.<br />
<br/><br />
:Choosing <math>\displaystyle\beta=0</math> seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer <br />
2009),121.</ref><br />
<br />
====Pseudo Code====<br />
#<math>\underline{\beta} \leftarrow 0</math><br />
#Set <math>\,\underline{Y}</math>, the label associated with each observation <math>\,i=1...n</math>.<br />
#Compute <math>\,\underline{P}</math> according to the equation <math>P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}</math> for all <math>\,i=1...n</math>.<br />
#Compute the diagonal matrix <math>\,W</math> by setting <math>\,w_i,i</math> to <math>P(\underline{x}_i;\underline{\beta}))[1-P(\underline{x}_i;\underline{\beta})]</math> for all <math>\,i=1...n</math>.<br />
#<math>Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})</math>.<br />
#<math>\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ</math>.<br />
#If the new <math>\underline{\beta}</math> value is sufficiently close to the old value, stop; otherwise go back to step 3.<br />
<br />
===Comparison with Linear Regression===<br />
*'''Similarities'''<br />
#They are both to attempt to estimate <math>\,P(Y=k|X=x)</math> (For logistic regression, we just mentioned about the case that <math>\,k=0</math> or <math>\,k=1</math> now).<br />
#They are both have linear boundaris.<br />
:'''note:'''For linear regression, we assume the model is linear. The boundary is <math>P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5</math> (linear)<br />
<br />
::For logistic regression, the boundary is <math>P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0</math> (linear) <br />
<br />
*'''Differences'''<br />
#Linear regression: <math>\,P(Y=k|X=x)</math> is linear function of <math>\,x</math>, <math>\,P(Y=k|X=x)</math> is not guaranteed to fall between 0 and 1 and to sum up to 1.<br />
#Logistic regression: <math>\,P(Y=k|X=x)</math> is a nonlinear function of <math>\,x</math>, and it is guaranteed to range from 0 to 1 and to sum up to 1.<br />
<br />
===Comparison with LDA===<br />
#The linear logistic model only consider the conditional distribution <math>\,P(Y=k|X=x)</math>. No assumption is made about <math>\,P(X=x)</math>.<br />
#The LDA model specifies the joint distribution of <math>\,X</math> and <math>\,Y</math>.<br />
#Logistic regression maximizes the conditional likelihood of <math>\,Y</math> given <math>\,X</math>: <math>\,P(Y=k|X=x)</math><br />
#LDA maximizes the joint likelihood of <math>\,Y</math> and <math>\,X</math>: <math>\,P(Y=k,X=x)</math>.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in logistic regression is <math>\,d</math>. The number of parameters grows linearly w.r.t dimension.<br />
#If <math>\,\underline{x}</math> is d-dimensional,the number of adjustable parameter in LDA is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math>. The number of parameters grows quardratically w.r.t dimension.<br />
#As logistic regression relies on fewer assumptions, it seems to be more robust.<br />
#In practice, Logistic regression and LDA often give the similar results.<br />
<br />
====By example====<br />
<br />
Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.<br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
>>plot (sample(1:200,1), sample(1:200,2), '.');<br />
>>hold on;<br />
>>plot (sample(201:400,1), sample(201:400,2), 'r.'); <br />
:First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.<br />
<br />
>>group = ones(400,1);<br />
>>group(201:400) = 2;<br />
:Group the data points.<br />
<br />
>>[B,dev,stats] = mnrfit(sample,group);<br />
>>x=[ones(1,400); sample'];<br />
:Now we use [http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/access/helpdesk/help/toolbox/stats/mnrfit.html&http://www.google.cn/search?hl=zh-CN&q=mnrfit+matlab&btnG=Google+%E6%90%9C%E7%B4%A2&aq=f&oq= mnrfit] to use logistic regression to classfy the data. This function can return B which is a <math>(d+1)\times{(k–1)}</math> matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a <math>3\times{1}</math> matrix.<br />
<br />
>> B<br />
B =0.1861<br />
-5.5917<br />
-3.0547<br />
<br />
:This is our <math>\underline{\beta}</math>. So the posterior probabilities are:<br />
:<math>P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math>.<br />
:<math>P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}</math><br />
<br />
:The classification rule is:<br />
:<math>\hat Y = 1</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2>=0</math><br />
:<math>\hat Y = 2</math>, if <math>\,0.1861-5.5917X_1-3.0547X_2<0</math><br />
<br />
>>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));<br />
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])<br />
:Plot the decision boundary by logistic regression.<br />
[[File:Boundary-lr.png|frame|center|This is a decision boundary by logistic regression.The line shows how the two classes split.]]<br />
<br />
>>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');<br />
>>k = coeff(1,2).const;<br />
>>l = coeff(1,2).linear;<br />
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));<br />
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);<br />
:Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.<br />
<br />
[[File:Boundary-lda.png|frame|center| From this figure, we can see that the results of Logistic Regression and LDA are very similar.]]<br />
<br />
== ''' 2009.10.21''' ==<br />
<br />
=== Multi-Class Logistic Regression ===<br />
<br />
Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx</math><br /><br /><br />
<br />
We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.<br /><br /><br />
<br />
<math>\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\} </math><br /><br /><br />
<br />
The posteriors for each class are given by,<br /><br /><br />
<br />
<br />
<math>P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}</math><br /><br /><br />
<br />
<math>P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}</math><br /><br /><br />
<br />
Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.<br />
<br />
=== The Perceptron ===<br />
[[Image:Simpleperceptron.jpg|thumb|right|325px|Figure 1: Diagram of a linear perceptron.]]<br />
Recall the use of Least Squares regression as a classifier,shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to 1 or -1.<br />
<br />
In the 1950s [http://en.wikipedia.org/wiki/Frank_Rosenblatt Frank Rosenblatt] developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of [http://en.wikipedia.org/wiki/Biological_neural_network biological neurons]. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref><br />
<br />
As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>. <br />
<br />
Particular to the iterative nature of the solution, the problem has no global mean(not convex), and does not converge to give a unique hyperplane. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the ''perceptron convergence theorem''. However, for overlapping classes convergence to a local mean cannot be guaranteed.<br /><br /><br />
<br />
As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.<br />
<br />
==The Perceptron (Lecture October 23, 2009)==<br />
[[File:misclass.png|300px|thumb|right|Figure 2: This figure shows a misclassified point and the movement of the decision boundary.]]<br />
Perceptron can be modeled as shown in Figure 1 of the previous lecture where <math>x_{0}, x_{1},\ldots,x_{d}</math> represent the input data, <math>\sum_{i=1}^d \beta_{j}x_{j}</math> is a linear combination of some weights of these features, and <math>sgn(\sum_{i=1}^d \beta_{j}x_{j})</math> returns the sign of the linear combination. <br />
<br />
<br />
Perceptron seeks a linear function between two classes. Since it is linear, the decision boundary can be represented by<math> \underline{\beta}^T\underline{x}+\beta_{0}. </math> The Perceptron algorithm begins with a random hyperplane <math>\underline{\beta}^T\underline{x}+\beta_{0}. </math> The goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find a <math>\underline\beta</math> by iteratively rotating the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points. <br />
<br/><br />
<br/><br />
[[File:distance2.jpg|300px|thumb|right|Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points]]<br />
*'''Derivation''' ''of the distance between the decision boundary and the misclassified points''. <br />
<br />
:Let <math>\underline{x_{i}}</math> be the misclassified point. <br />
<br />
:Assume <math>\underline{x_{1}}</math> and <math>\underline{x_{2}}</math>both lie on the decision boundary. <br />
<br />
:Then <math>\underline{\beta}^T\underline{x_{1}}+\beta_{0} = \underline{\beta}^T\underline{x_{2}}+\beta_{0}</math><br />
<br />
:which implies that <math>\underline{\beta}^T (x_{1}-x_{2})=0</math>.<br />
<br />
::Since <math> \underline{\beta}^T</math> is a vector and <math>(\underline{x_{1}}-\underline{x_{2}})</math> is a vector lying on the decision boundary then <math>\underline{\beta}</math> is a vector orthogonal to the decision boundary. <br />
<br />
Then the projection of the vector<math> \underline{x_{i}}</math> on the direction that is orthogonal to the decision boundary is <math>\underline{\beta}^T\underline{x_{i}}</math>. <br />
Now, if <math>\underline{x_{0}}</math> is also on the decision boundary, then <math>\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0</math> and so <math>\underline{\beta}^T\underline{x_{0}}= -\beta_{0}</math>. Looking at Figure 3, it can be seen that the distance between <math>\underline{x_{i}}</math> and the decision boundary is the absolute value of <math>\underline{\beta}^T\underline{x_{i}}+\beta_{0}. </math> <br />
<br/><br />
<br/><br />
Consider <math>y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).</math><br />
:Notice that if <math>\underline{x_{i}}</math> is classified ''correctly'' then this product is positive. This is because if it is classified correctly, then either both (<math>\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and<math>\displaystyle y_{i}</math> are positive or they are both negative. However, if <math>\underline{x_{i}}</math> is classified ''incorrectly'' then one of <math>(\underline{\beta}^T\underline{x_{i}}+\beta_{0})</math> and <math>\displaystyle y_{i}</math> is positive and the other is negative. The result is that the above product is negative for a point that is misclassified. <br />
<br/><br />
<br />
For the algorithm, we need only consider the distance between the misclassified points and the decision boundary. <br />
<br />
:Consider <math>\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}) </math> <br />
which is a summation of positive numbers and where <math>\displaystyle M</math> is the set of all misclassified points. <br />
<br/><br />
The goal now becomes to <math>\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}). </math> <br />
<br />
This can be done using a [http://en.wikipedia.org/wiki/Gradient_descent gradient descent approach] which is a numberical method that takes one predetermined step in the direction of the gradient getting closer to a minimum at each step untill the gradient is zero. To continue, the following derivatives are needed: <br />
<br />
<math>\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} <br />
\ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}</math><br />
<br/><br />
<br />
Then the gradient descent type algorithm (Perceptron Algorithm) is<br />
<math>[\underline{\beta}^{new}\ \ \beta^{new}]= [\underline{\beta}^{old}\ \ \beta_{0}] + \rho [y_{i}\underline{x_{i}}\ \ y_{i}] </math> <br />
<br />
where<math> \displaystyle\rho</math> is called the learning rate. The algorithm continues until it converges or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points. <br />
<br/><br />
<br/><br />
*'''Problems with the Algorithm and Issues Affecting Convergence:'''<br />
#If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly. <br />
#Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. <br />
#If the classes are separable, there exist infinitely many solutions to Perceptron, all of which are hyperplanes. <br />
#The speed of convergence of the algorithm is also dependent on the value of <math>\displaystyle\rho</math>, the learning rate. A larger value of <math>\displaystyle\rho</math> could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find.<br />
<br />
<br/><br />
<br/><br />
*A perceptron applet can be found at http://isl.ira.uka.de/neuralNetCourse/2004/VL_11_5/Perceptron.html .<br />
<br />
==Notes==<br />
<references/></div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=statf09841Scribe&diff=4074statf09841Scribe2009-10-02T20:04:54Z<p>Ipargaru: </p>
<hr />
<div>{| class="wikitable"<br />
<br />
{| border="1" cellpadding="2"<br />
|-<br />
|width="100pt"|Date<br />
|width="200pt"|Name<br />
|-<br />
|Sep 30 || Liang Jiaxi <br />
|-<br />
|Oct 2|| Mark Stuart<br />
|-<br />
|Oct 5|| Nirvan Singh<br />
|-<br />
|Oct 7|| Trevor Bekolay<br />
|-<br />
|Oct 9|| Aurélien Quévenne<br />
|-<br />
|Oct 12|| Thanksgiving<br />
|-<br />
|Oct 14|| Mohammad Derakhshani<br />
|-<br />
|Oct 16|| Oana Suteu<br />
|-<br />
|Oct 19|| Weibei Li<br />
|-<br />
|Oct 21|| Mathieu Zerter<br />
|-<br />
|Oct 23|| Sabrina Bernardi<br />
|-<br />
|Oct 26|| Jiheng Wang<br />
|-<br />
|Oct 28|| Iulia Pargaru<br />
|-<br />
|Oct 30|| Nick Murdoch<br />
|-<br />
|Nov 2|| Joycelin Karel<br />
|-<br />
|Nov 4|| <br />
|-<br />
|Nov 6|| <br />
|-<br />
|Nov 9|| Min Chen<br />
|-<br />
|Nov 11|| Yao Yao<br />
|-<br />
|Nov 13|| zhenghui Wu<br />
|-<br />
|Nov 16|| <br />
|-<br />
|Nov 18|| <br />
|-<br />
|Nov 20|| <br />
|-</div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3342stat341 / CM 3612009-07-21T18:27:58Z<p>Ipargaru: /* General PCA Algorithm */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>i=1\dots n</math> from <math>f(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
<br />
The histogram shows exponential pattern as expected.<br />
<br />
[[File:HistRandNum.jpg]]<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult or impossible to sample from directly. Suppose also that we have a proposal distribution <math>g(x)</math> from which we have a reasonable method of sampling (e.g. the uniform distribution). Then, if there is a constant <math>c \ |\ c \cdot g(x) \geq f(x)\ \forall x</math>, accepting samples drawn in successions from <math> c \cdot g(x)</math> with ratio <math> \frac {f(x)}{c \cdot g(x)} </math> close to 1 will yield a sample that follows the target distribution <math>f(x)</math>; on the other hand we would reject the samples if the ratio is not close to 1.<br />
<br />
The following graph shows the pdf of <math>f(x)</math> (target distribution) and <math> c \cdot g(x)</math> (proposal distribution)<br />
<br />
[[File:fxcgx.JPG]]<br />
<br />
At x=7; sampling from <math> c \cdot g(x)</math> will yield a sample that follows the target distribution <math>f(x)</math><br />
<br />
At x=9; we will reject samples according to the ratio <math> \frac {f(x)}{c \cdot g(x)} </math> after sampling from <math> c \cdot g(x)</math><br />
<br />
'''Proof'''<br />
<br />
Note the following:<br />
*<math> Pr(X|accept) = \frac{Pr(accept|X) \times Pr(X)}{Pr(accept)} </math> (Bayes' theorem)<br />
*<math> Pr(accept|X) = \frac{f(x)}{c \cdot g(x)} </math><br />
*<math> Pr(X) = g(x)\frac{}{}</math><br />
<br />
So,<br />
<math> Pr(accept) = \int^{}_x Pr(accept|X) \times Pr(X) dx </math><br />
<math> = \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx </math><br />
<math> = \frac{1}{c} \int^{}_x f(x) dx </math><br />
<math> = \frac{1}{c} </math><br />
<br />
Therefore,<br />
<math> Pr(X|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)} \times g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from the distribution <math>\displaystyle f(x)</math>. Let <math>q(y\mid{x})</math> be a distribution that is easy to sample from, we call it the "Proposal Distribution".<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
:::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
::For the above example, the following output occurs when choosing B too large (B=1000)<br />
<br />
[[File:Blarge.jpg]]<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
::For the above example, the following output occurs when choosing B too small (B=0.001)<br />
<br />
[[File:Bsmall.JPG]]<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
::For the above example, the following output occurs when choosing a good value for B (B=2)<br />
<br />
[[File:Bgood.JPG]]<br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
This is essentially a method for optimization and an application of Metropolis Hastings.<br />
<br />
Consider the problem of <math>\displaystyle \min_{x}(h(x)) </math>, i.e. we need to find x that minimizes <math>\displaystyle h(x) </math>. But, this is the same problem as <math>\displaystyle \max(e^{\frac{-h(x)}{T}})</math> for some constant T (since the exponential function is a monotone function)<br />
<br />
We then consider some distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
[[File:ezplotf1.jpg]]<br />
<br />
[[File:ezplotf2.jpg]]<br />
<br />
At the end, we get T to be pretty small, our distribution that we're sampling from becomes sharper, and the points that we sample are close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1 \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2 \sim~ f(x_2|x_1, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n \sim~ f(x_n|x_1, x_2, ... , x_{n-1})<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1, x_2, \ldots, x_n) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0 for all x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0</math> for all <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in [http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis]).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction]).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math><br />
<br />
The first principal component is the vector that maximizes the variance<br />
<br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function.<br />
<br />
Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint. The problem then becomes,<br />
<br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
The noisy face:<br />
<br />
[[File:face1.jpg]]<br />
<br />
The de-noised face:<br />
<br />
[[File:face2.jpg]]<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (images etc). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
The matlab code is as follows.<br />
load 2_3 %the size of this file is 64 X 400<br />
[coefs , scores ] = princom (X') <br />
% performs principal components analysis on the data matrix X<br />
% returns the principal component coefficients and scores<br />
% scores is the low dimensional representatioation of the data X<br />
plot(scores(:,1),scores(:,2)) <br />
% plots the first most variant dimension on the x-axis <br />
% and the second highest on the y-axis <br />
plot(scores(1:200,1),scores(1:200,2))<br />
% same graph as above, only with the 2s (not 3s)<br />
hold on % this command allows us to add to the current plot<br />
plot (scores(201:400,1),scores(201:400,2),'ro')<br />
% this addes the data for the 3s<br />
% the 'ro' command makes them red Os on the plot<br />
% If We classify based on the position in this plot (feature), <br />
% its easier than looking at each of the 64 data pieces<br />
gname() % displays a figure window and <br />
% waits for you to press a mouse button or a keyboard key<br />
Figure<br />
subplot(1,2,1)<br />
imagesc(reshape(:,45),8,8)')<br />
subplot(1,2,2)<br />
imagesc(reshape(:,93),8,8)')<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
[[File:PCAalgorithm.JPG]]<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean.<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
<br />
<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes:<br />
<br><br />
For a 2 class problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k class problem, we want k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Goal: Maximum Separation ===<br />
<br />
====1. Minimize the within class variance====<br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> : within classes covariance.<br />
Then, this problem can be rewritten: <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
====2. Maximize the distance between the means of the projected data after projection====<br />
<br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 </math><br />
<br />
<math>\displaystyle = (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) </math><br />
<br />
which is a scalar. Therefore,<br />
<br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br />
Thus, we get our original problem equivalent to<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math><br />
<br />
Let <math>\displaystyle \ s_B=(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math> : between classes covariance.<br />
Then, this problem can be rewritten: <math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
===Objective Function===<br />
We want an objective function which satisifies both of the goals outlined above (at the same time).<br /><br />
(1) <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math> or <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
(2) <math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math> <br />
<br />
We take the ratio of the two -- we wish to maximize<br /><br />
<br />
<math>\displaystyle [(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) ] / [(w^T(\sum_1 + \sum_2)w)] </math><br />
<br />
or<br />
<br />
<math>\displaystyle \max (w^Ts_Bw)/(w^Ts_ww)</math><br />
<br />
<br />
<br />
This is a very famous problem. We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we can solve the following constrained optimization problem<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
<br />Subject To: <br />
<math>\displaystyle (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)=1 </math> or <math>\displaystyle (w^Ts_Bw=1)</math> <br />
<br />
<br />
<br />
<br />
Therefore, the function that we want to maximize is<br />
<br />
<math>\displaystyle (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) - \lambda * [(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)-1] </math><br />
<br />
or <br />
<br />
<math>\displaystyle (w^Ts_Bw) - \lambda * [(w^Ts_ww)-1] </math><br />
<br />
<br />
<br />
<br />
<br />
== Continuation of Fisher's Linear Discriminant Analysis (FDA) - July 21 ==<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Example: FDA<br />
<br />
== Classification - July 21 (cont) ==<br />
<br />
<br />
<br />
<br />
<br />
<br />
'''Error Rate'''<br />
<br />
<math>\, \hat E(h) =Pr(h(x)\neq y) </math><br />
<br />
Given test points, how can we how can we find the error rate?<br />
<br />
We simply count the number of points that have been misclassified and divide bu the total number of points.<br />
<br />
<math>\, \hat E(h) = \frac{1}{N} \sum_{i=1}^N I (Y_i \neq h(x_i)) </math><br />
<br />
=== Bayes Classification Rule ===<br />
<br />
Considering the case of a two-class problem where <math> \mathcal{Y} = {0,1} </math><br />
<br />
<math>\, r(x)= Pr(Y=1 \mid X=x)= \frac {Pr(X=x \mid Y=1)P(Y=1)} {Pr(X=x)} </math><br />
<br />
Where the denominator <math>\, Pr(X=x) = Pr(X=x \mid Y=1)P(Y=1)+Pr(X=x \mid Y=0)P(Y=0) </math><br />
<br />
So our classifier function<br />
<math>h(x) = \begin{cases}<br />
1 & r(x) \geq \frac{1}{2} \\<br />
0 & o/w\\<br />
\end{cases}</math><br />
<br />
This function is considered the best classifier in terms of error rate. <br />
<br />
A problem is that we do not know the joint and marginal probability distributions to calculate <math>\, r(x)</math><br />
<br />
This function is viewed as a theoretical bound - the best that can be achieved by various classification techniques.<br />
<br />
One such technique is Decision Boundary<br />
<br />
=== Decision Boundary ===</div>Ipargaruhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341_/_CM_361&diff=3341stat341 / CM 3612009-07-21T18:27:07Z<p>Ipargaru: /* General PCA Algorithm */</p>
<hr />
<div>'''Computational Statistics and Data Analysis''' is a course offered at the University of Waterloo<br /><br />
Spring 2009<br /><br />
Instructor: Ali Ghodsi <br />
<br />
<br />
<br />
==Sampling (Generating random numbers)==<br />
<br />
===[[Generating Random Numbers]] - May 12, 2009===<br />
<br />
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).<br />
<br />
We begin by considering the simplest case: the uniform distribution.<br />
<br />
====Multiplicative Congruential Method====<br />
<br />
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:<br />
<br />
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math><br />
<br />
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = 13x_{i} \mod{31}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
The above generator of pseudorandom numbers is called a '''Mixed Congruential Generator''' or '''Linear Congruential Generator''', as they involve both an additive and a muliplicative term. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.<br />
<br />
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers. <br />
<br />
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:<br />
<br />
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math><br />
<br />
So,<br />
<br />
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math><br />
:<math>\begin{align}<br />
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
:<math>\begin{align}<br />
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\<br />
\end{align}</math><br />
<br />
etc.<br />
<br />
For an ideal situation, we want m to be a very large prime number, <math>x_{n}\not= 0</math> for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.<br />
<br />
Java's Random class is based on a generator with ''a'' = 25214903917, ''b'' = 11, and ''m'' = 2<sup>48</sup><ref>http://java.sun.com/javase/6/docs/api/java/util/Random.html#next(int)</ref>. The class returns at most 32 leading bits from each ''x<sub>i</sub>'', so it is possible to get the same value twice in a row, (when ''x<sub>0</sub>'' = 18698324575379, for instance) without repeating it forever.<br />
<br />
====General Methods====<br />
<br />
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.<br />
<br />
=====Inverse Transform Method=====<br />
<br />
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:<br />
<br />
'''Theorem''':<br />
<br />
If <math>U \sim~ \mathrm{Unif}[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).<br />
<br />
'''Proof''':<br />
<br />
Recall that, if ''f'' is the pdf corresponding to F, then,<br />
<br />
:<math>F(x) = P(X \leq x) = \int_{-\infty}^x f(x)</math><br />
<br />
:<math>\int_1^{\infty} \frac{x^k}{x^2} dx</math><br />
<br />
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.<br />
<br />
Note also that in the uniform distribution on [0, 1], we have for all ''a'' within [0, 1], <math>P(U \leq a) = a</math>.<br />
<br />
So,<br />
<br />
:<math>\begin{align}<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:<br />
<br />
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.<br />
*Step 2: Compute <math> X = F^{-1}(U) </math>.<br />
<br />
'''Example''':<br />
<br />
Suppose we want to draw a sample from <math>f(x) = \lambda e^{-\lambda x} </math> where <math>x > 0</math> (the exponential distribution).<br />
<br />
We need to first find <math>F(x)</math> and then its inverse, <math>F^{-1}</math>.<br />
<br />
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math><br />
<br />
:<math> F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta} </math><br />
<br />
Now we can generate our random sample <math>i=1\dots n</math> from <math>f(x)</math> by:<br />
<br />
:<math>1)\ u_i \sim Unif[0, 1]</math><br />
:<math>2)\ x_i = \frac{-\log(u_i)}{\theta}</math><br />
<br />
The <math>x_i</math> are now a random sample from <math>f(x)</math>.<br />
<br />
<br />
This example can be illustrated in Matlab using the code below. Generate <math>u_i</math>, calculate <math>x_i</math> using the above formula and letting <math>\theta=1</math>, plot the histogram of <math>x_i</math>'s for <math>i=1,...,100,000</math>.<br />
<br />
u=rand(1,100000);<br />
x=-log(1-u)/1;<br />
hist(x)<br />
<br />
The histogram shows exponential pattern as expected.<br />
<br />
[[File:HistRandNum.jpg]]<br />
<br />
The major problem with this approach is that we have to find <math>F^{-1}</math> and for many distributions it is too difficult (or impossible) to find the inverse of <math>F(x)</math>. Further, for some distributions it is not even possible to find <math>F(x)</math> (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find <math>F^{-1}</math>).<br />
<br />
'''Procedure (Discrete Case)'''<br />
<br />
The above method can be easily adapted to work on discrete distributions as well.<br />
<br />
In general in the discrete case, we have <math>x_0, \dots , x_n</math> where:<br />
<br />
:<math>\begin{align}P(X = x_i) &{}= p_i \end{align}</math><br />
:<math>x_0 \leq x_1 \leq x_2 \dots \leq x_n</math><br />
:<math>\sum p_i = 1</math><br />
<br />
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):<br />
<br />
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.<br />
*Step 2:<br />
**If <math>U < p_0</math>, return <math>X = x_0</math><br />
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math><br />
** ...<br />
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math><br />
<br />
'''Example''' (from class):<br />
<br />
Suppose we have the following discrete distribution:<br />
<br />
:<math>\begin{align}<br />
P(X = 0) &{}= 0.3 \\<br />
P(X = 1) &{}= 0.2 \\<br />
P(X = 2) &{}= 0.5<br />
\end{align}</math><br />
<br />
The cumulative density function (cdf) for this distribution is then:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0, & \text{if } x < 0 \\<br />
0.3, & \text{if } 0 \leq x < 1 \\<br />
0.5, & \text{if } 1 \leq x < 2 \\<br />
1, & \text{if } 2 \leq x<br />
\end{cases}</math><br />
<br />
Then we can generate numbers from this distribution like this, given <math>u_0, \dots, u_n</math> from <math>U \sim~ Unif[0, 1]</math>:<br />
<br />
:<math><br />
x_i = \begin{cases}<br />
0, & \text{if } u_i \leq 0.3 \\<br />
1, & \text{if } 0.3 < u_i \leq 0.5 \\<br />
2, & \text{if } 0.5 < u_i \leq 1<br />
\end{cases}</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
p=[0.3,0.2,0.5];<br />
for i=1:1000;<br />
u=rand;<br />
if u <= p(1)<br />
x(i)=0;<br />
elseif u < sum(p(1,2))<br />
x(i)=1;<br />
else<br />
x(i)=2;<br />
end<br />
end<br />
<br />
===[[Acceptance-Rejection Sampling]] - May 14, 2009===<br />
<br />
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the '''Acceptance/Rejection Method'''.<br />
<br />
====Acceptance/Rejection Method====<br />
<br />
Suppose we wish to sample from a target distribution <math>f(x)</math> that is difficult or impossible to sample from directly. Suppose also that we have a proposal distribution <math>g(x)</math> from which we have a reasonable method of sampling (e.g. the uniform distribution). Then, if there is a constant <math>c \ |\ c \cdot g(x) \geq f(x)\ \forall x</math>, accepting samples drawn in successions from <math> c \cdot g(x)</math> with ratio <math> \frac {f(x)}{c \cdot g(x)} </math> close to 1 will yield a sample that follows the target distribution <math>f(x)</math>; on the other hand we would reject the samples if the ratio is not close to 1.<br />
<br />
The following graph shows the pdf of <math>f(x)</math> (target distribution) and <math> c \cdot g(x)</math> (proposal distribution)<br />
<br />
[[File:fxcgx.JPG]]<br />
<br />
At x=7; sampling from <math> c \cdot g(x)</math> will yield a sample that follows the target distribution <math>f(x)</math><br />
<br />
At x=9; we will reject samples according to the ratio <math> \frac {f(x)}{c \cdot g(x)} </math> after sampling from <math> c \cdot g(x)</math><br />
<br />
'''Proof'''<br />
<br />
Note the following:<br />
*<math> Pr(X|accept) = \frac{Pr(accept|X) \times Pr(X)}{Pr(accept)} </math> (Bayes' theorem)<br />
*<math> Pr(accept|X) = \frac{f(x)}{c \cdot g(x)} </math><br />
*<math> Pr(X) = g(x)\frac{}{}</math><br />
<br />
So,<br />
<math> Pr(accept) = \int^{}_x Pr(accept|X) \times Pr(X) dx </math><br />
<math> = \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx </math><br />
<math> = \frac{1}{c} \int^{}_x f(x) dx </math><br />
<math> = \frac{1}{c} </math><br />
<br />
Therefore,<br />
<math> Pr(X|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)} \times g(x)}{\frac{1}{c}} = f(x) </math> as required.<br />
<br />
'''Procedure (Continuous Case)'''<br />
<br />
*Choose <math>g(x)</math> (a density function that is simple to sample from)<br />
*Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math><br />
#Let <math>Y \sim~ g(y)</math> <br />
#Let <math>U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{c \cdot g(x)}</math> then X=Y; else reject and go to step 1<br />
<br />
'''Example:'''<br />
<br />
Suppose we want to sample from Beta(2,1), for <math> 0 \leq x \leq 1 </math>.<br />
Recall:<br />
:<math> Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x </math><br />
*Choose <math> g(x) \sim~ Unif [0,1] </math><br />
*Find c: c = 2 (see notes below)<br />
#Let <math> Y \sim~ Unif [0,1] </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{2Y}{2} = Y </math>, then X=Y; else go to step 1<br />
<br />
<math>c</math> was chosen to be 2 in this example since <math> \max \left(\frac{f(x)}{g(x)}\right) </math> in this example is 2. This condition is important since it helps us in finding a suitable <math>c</math> to apply the Acceptance/Rejection Method.<br />
<br />
<br />
In MATLAB, the code that demonstrates the result of this example is:<br />
<br />
j = 1;<br />
while i < 1000<br />
y = rand;<br />
u = rand;<br />
if u <= y<br />
x(j) = y;<br />
j = j + 1;<br />
i = i + 1;<br />
end<br />
end<br />
hist(x);<br />
<br />
<br />
The histogram produced here should follow the target distribution, <math>f(x) = 2x</math>, for the interval <math> 0 \leq x \leq 1 </math>.<br />
<br />
The histogram shows the PDF of a Beta(2,1) distribution as expected.<br />
<br />
[[File:BetaDistn.jpg]]<br />
<br />
<br />
'''The Discrete Case'''<br />
<br />
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution <math>g(x)</math> must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.<br />
<br />
'''Example'''<br />
<br />
Suppose we want to sample from a distribution with the following probability mass function (pmf):<br />
P(X=1) = 0.15<br />
P(X=2) = 0.55<br />
P(X=3) = 0.20<br />
P(X=4) = 0.10 <br />
*Choose <math>g(x)</math> to be the discrete uniform distribution on the set <math>\{1,2,3,4\}</math><br />
*Find c: <math> c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2 </math><br />
#Generate <math> Y \sim~ Unif \{1,2,3,4\} </math><br />
#Let <math> U \sim~ Unif [0,1] </math><br />
#If <math>U \leq \frac{f(x)}{2.2 \times 0.25} </math>, then X=Y; else go to step 1<br />
<br />
'''Limitations'''<br />
<br />
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such <math>g(x)</math> that satisfies all the conditions of the procedure.<br />
<br />
We will now begin to discuss sampling from specific distributions.<br />
<br />
====Special Technique for sampling from Gamma Distribution====<br />
<br />
Recall that the cdf of the Gamma distribution is:<br />
<br />
<math> F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.<br />
<br />
<br />
'''Additive Property of Gamma Distribution'''<br />
<br />
Recall that if <math>X_1, \dots, X_t</math> are independent exponential distributions with mean <math> \lambda </math> (in other words, <math> X_i\sim~ Exp (\lambda) </math>), then <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean <math> \lambda </math> (using the Inverse Method) and add them up. Details will be discussed in the next lecture.<br />
<br />
<br />
===[[Techniques for Normal and Gamma Sampling]] - May 19, 2009===<br />
<br />
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,<br> Gamma distributions and Normal distributions, where such practical methods exist.<br />
<br />
====Gamma Distribution====<br />
<br />
<br />
Given the additive property of the gamma distribution,<br />
<br />
<br />
If <math>X_1, \dots, X_t</math> are independent random variables with <math> X_i\sim~ Exp (\lambda) </math> then,<br />
: <math> \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda) </math><br />
<br />
We can use the Inverse Transform Method and sample from independent uniform distributions seen before to generate a sample following a Gamma distribution.<br />
<br />
<br />
:'''Procedure '''<br />
<br />
:#Sample independently from a uniform distribution <math>t</math> times, giving <math> u_1,\dots,u_t</math> <br />
:#Sample independently from an exponential distribution <math>t</math> times, giving <math> x_1,\dots,x_t</math> such that,<br> <math> \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}<br />
</math> <br><br> Using the Inverse Transform Method, <br> <math> \begin{align} x_i = -\frac {1}{\lambda}\log(u_i) \end{align}</math><br />
:#Using the additive property,<br><math> \begin{align} X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2) \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda) \end{align} </math><br />
<br />
<br><br />
This procedure can be illustrated in Matlab using the code below with <math>t = 5, \lambda = 1 </math> : <br />
<br />
U = rand(10000,5);<br />
X = sum( -log(U), 2);<br />
hist(X)<br />
<br />
[[File:Gamma1.jpg]]<br />
<br />
====Normal Distribution====<br />
[[Image:Box_Muller.png|right|thumb|150px|"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]]]<br />
<br />
The cdf for the Standard Normal distribution is:<br />
<br />
:<math> F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx </math><br />
<br />
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.<br />
<br />
=====Box-Muller Method===== <br />
<br />
[Add a picture [[User:WikiSysop|WikiSysop]] 19:25, 1 June 2009 (UTC)]<br />
<br />
<br />
The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.<br />
<br />
<br />
:'''Properties of Polar and Cartesian Coordinates'''<br />
: If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,<br />
::* <math> \begin{align} r^2 = x^2 + y^2 \end{align} </math><br />
::* <math> \tan(\theta) = \frac{y}{x} </math><br />
<br><br />
::* <math> \begin{align} x = r \cos(\theta) \end{align}</math><br />
::* <math> \begin{align} y = r \sin(\theta) \end{align}</math><br />
<br />
<br />
<br />
Let X and Y be independent random variables with a standard normal distribution,<br />
:<math> X \sim~ N(0,1) </math><br />
:<math> Y \sim~ N(0,1) </math><br />
<br />
also, let <math>r</math> and <math>\theta</math> be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,<br />
<br />
:<math>\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ <br />
&{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}<br />
</math><br />
<br />
It can also be shown that the joint distribution of r and θ is given by,<br />
<br />
:<math>\begin{matrix} f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad d = r^2 \end{matrix} </math><br />
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent<br />
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2), \theta \sim~ Unif[0,2\pi] \end{matrix} </math><br />
<br />
<br><br />
:'''Procedure for using Box-Muller Method'''<br />
<br />
:# Sample independently from a uniform distribution twice, giving <math> \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
:# Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,<br><math> \begin{align}<br />
d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2 \end{align} </math><br />
:# Transform r and θ back to x and y, <br> <math> \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align} </math><br />
<br><br />
Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.<br />
<br />
This procedure can be illustrated in Matlab using the code below:<br />
<br />
u1 = rand(5000,1);<br />
u2 = rand(5000,1);<br />
<br />
d = -2*log(u1);<br />
theta = 2*pi*u2;<br />
<br />
x = d.^(1/2).*cos(theta);<br />
y = d.^(1/2).*sin(theta);<br />
<br />
figure(1);<br />
<br />
subplot(2,1,1);<br />
hist(x);<br />
title('X');<br />
subplot(2,1,2);<br />
hist(y);<br />
title('Y');<br />
<br />
[[File:Stdnorm.jpg]]<br />
<br />
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:<br />
<br />
subplot(2,1,1);<br />
hist(d);<br />
title('d follows an exponential distribution');<br />
subplot(2,1,2);<br />
hist(theta);<br />
title('theta follows a uniform distribution over [0, 2*pi]');<br />
<br />
[[File:BothMay19.jpg]]<br />
<br />
=====Useful Properties (Single and Multivariate)=====<br />
<br />
Box-Muller can be used to sample a standard normal distribution. However, there are many properties of Normal distributions that allow us to use the samples from Box-Muller method to sample any Normal distribution in general.<br />
<br />
<br />
:'''Properties of Normal distributions ''' <br />
::* <math> \begin{align} \text{If } & X = \mu + \sigma Z, & Z \sim~ N(0,1) \\ &\text{then } X \sim~ N(\mu,\sigma ^2) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, & Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then } \vec{Z} \sim~ N(\vec{0},I) \end{align} </math><br />
<br />
::* <math> \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, & \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then } \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align} </math><br />
<br><br />
These properties can be illustrated through the following example in Matlab using the code below:<br />
<br />
Example: For a Multivariate Normal distribution <math>u=\begin{bmatrix} -2 \\ 3 \end{bmatrix}</math> and <math>\Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}</math><br />
<br />
<br />
u = [-2; 3];<br />
sigma = [ 1 1/2; 1/2 1];<br />
<br />
r = randn(15000,2);<br />
ss = chol(sigma);<br />
<br />
X = ones(15000,1)*u' + r*ss;<br />
plot(X(:,1),X(:,2), '.');<br />
<br />
[[File:MultiVariateMay19.jpg]]<br />
<br />
Note: In the example above, we had to generate the square root of <math>\Sigma</math> using the Cholesky decomposition, <br />
<br />
ss = chol(sigma);<br />
<br />
which gives <math>ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}</math>. Matlab also has the sqrtm function:<br />
<br />
ss = sqrtm(sigma);<br />
<br />
which gives a different matrix, <math>ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}</math>, but the plot looks about the same (X has the same distribution).<br />
<br />
===[[Bayesian and Frequentist Schools of Thought]] - May 21, 2009===<br />
<br />
==[[Monte Carlo Integration]] - May 26, 2009==<br />
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces '''Basic Monte Carlo Integration'''.<br><br><br />
<br />
====Frequentist vs Bayesian Example - Estimating Parameters====<br />
<br />
Estimating parameters of a univariate Gaussian:<br />
<br />
Frequentist: <math>f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}</math><br><br />
Bayesian: <math>f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}</math><br />
<br />
=====Frequentist Approach=====<br />
<br />
Let <math>X^N</math> denote <math>(x_1, x_2, ..., x_n)</math>. Using the Maximum Likelihood Estimation approach for estimating parameters we get:<br><br />
:<math>L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}</math><br />
:<math>l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2 </math><br />
:<math>\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)</math><br />
Setting <math>\frac{dl}{d\mu} = 0</math> we get<br />
:<math>\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu</math><br />
:<math>\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i</math><br><br />
<br />
=====Bayesian Approach=====<br />
<br />
Assuming the prior is Gaussian:<br />
:<math>P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
:<math>f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}</math><br />
By completing the square we conclude that the posterior is Gaussian as well:<br />
:<math>f(\theta|x)=\frac{1}{\sqrt{2\pi}\tilde{\sigma}}e^{-\frac{1}{2}(\frac{x-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
Where<br />
:<math>\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0</math><br />
The expectation from the posterior is different from the MLE method.<br />
Note that <math>\displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}</math>. Also note that when <math>N = 0</math> we get <math>\tilde{\mu} = \mu_0</math>.<br />
<br />
====Basic Monte Carlo Integration====<br />
<br />
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term '''Monte Carlo''' is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. ''Stochastic simulation'' refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as ''Monte Carlo methods''. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [http://www.chem.unl.edu/zeng/joy/mclab/mcintro.html]<br />
<br />
<br />
<!-- EDITING, BACK OFF --><br />
We start with a simple example:<br />
:<math>I = \displaystyle\int_a^b h(x)\,dx</math><br />
::<math> = \displaystyle\int_a^b w(x)f(x)\,dx</math><br />
where<br />
:<math>\displaystyle w(x) = h(x)(b-a)</math><br />
:<math>f(x) = \frac{1}{b-a} \rightarrow</math> the p.d.f. is Unif<math>(a,b)</math><br />
:<math>\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
If <math>x_i \sim~ Unif(a,b)</math> then by the '''Law of Large Numbers''' <math>\frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]</math><br />
<br />
=====Process=====<br />
Given <math>I = \displaystyle\int^b_ah(x)\,dx</math><br />
# <math>\begin{matrix} w(x) = h(x)(b-a)\end{matrix}</math><br />
# <math> \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
From this we can compute other statistics, such as<br />
# <math> SE=\frac{s}{\sqrt{N}}</math> where <math>s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} </math> with <math> \begin{matrix}Y_i=w(x_i)\end{matrix}</math><br />
# <math>\begin{matrix} 1-\alpha \end{matrix}</math> CI's can be estimated as <math> \hat{I}\pm Z_\frac{\alpha}{2}*SE</math><br />
<br />
'''Example 1'''<br />
<br />
Find <math> E[\sqrt{x}]</math> for <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
<br />
# We need to draw from <math>\begin{matrix} f(x) = e^{-x}\end{matrix} </math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
<br />
u=rand(100,1)<br />
x=-log(u)<br />
h= x.* .5<br />
mean(h)<br />
%The value obtained using the Monte Carlo method<br />
F = @ (x) sqrt (x). * exp(-x)<br />
quad(F,0,50)<br />
%The value of the real function using Matlab<br />
<br />
'''Example 2'''<br />
Find <math> I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3 </math><br />
# <math> \displaystyle I = x^4/4 = 1/4 </math><br />
# <math>\displaystyle W(x) = x^3*(1-0)</math><br />
# <math> Xi \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
x = rand (1000)<br />
mean(x^3)<br />
<br />
'''Example 3'''<br />
To estimate an infinite integral<br />
such as <math> I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x} </math><br />
# Substitute in <math> y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx </math><br />
# <math> I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy </math><br />
# <math> w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)</math><br />
# <math> Y_i \sim~Unif(0,1)</math><br />
# <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})</math><br />
<br />
==[[Importance Sampling and Monte Carlo Simulation]] - May 28, 2009==<br />
<!-- UNDER CONSTRUCTION! --><br />
<br />
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.<br />
<br />
====Binomial Probability Monte Carlo Simulations====<br />
<br />
=====Example 1:=====<br />
You are given two independent Binomial distributions with probabilities <math>\displaystyle p_1\text{, }p_2</math>. Using a Monte Carlo simulation, approximate the value of <math>\displaystyle \delta</math>, where <math>\displaystyle \delta = p_1 - p_2</math>.<br><br />
:<math>\displaystyle X \sim BIN(n, p_1)</math>; <math>\displaystyle Y \sim BIN(n, p_2)</math>; <math>\displaystyle \delta = p_1 - p_2</math><br><br><br />
<br />
So <math>\displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)*f(p_1,p_2)}{f(x,y)}</math> where <math>\displaystyle f(x,y)</math> is a flat distribution and the expected value of <math>\displaystyle \delta</math> is as follows:<br><br />
:<math>\displaystyle \hat{\delta} = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Since X, Y are independent, we can split the conditional probability distribution:<br><br />
:<math>\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)</math><br><br><br />
<br />
We need to find conditional distribution functions for <math>\displaystyle p_1, p_2</math> to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let<br><br />
:<math>\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1)</math> and <math>\displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, n-y+1)</math>, where<br><br />
:<math>\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}</math><br><br><br />
<br />
'''Process:'''<br />
# Draw samples for <math>\displaystyle p_1</math> and <math>\displaystyle p_2</math>: <math>\displaystyle (p_1,p_2)^{(1)}</math>, <math>\displaystyle (p_1,p_2)^{(2)}</math>, ..., <math>\displaystyle (p_1,p_2)^{(n)}</math>;<br />
# Compute <math>\displaystyle \delta = p_1 - p_2</math> in order to get n values for <math>\displaystyle \delta</math>;<br />
# <math>\displaystyle \hat{\delta}=\frac{\displaystyle\sum_{\forall i}\delta^{(i)}}{N}</math>.<br><br><br />
<br />
'''Matlab Code:'''<br><br />
:The Matlab code for recreating the above example is as follows:<br />
n=100; %number of trials for X<br />
m=100; %number of trials for Y<br />
x=80; %number of successes for X trials<br />
y=60; %number of successes for y trials<br />
p1=betarnd(x+1, n-x+1, 1, 1000);<br />
p2=betarnd(y+1, m-y+1, 1, 1000);<br />
delta=p1-p2;<br />
mean(delta);<br />
<br />
The mean in this example is given by 0.1938.<br />
<br />
A 95% confidence interval for <math>\delta</math> is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:<br />
q1=quantile(delta,0.025);<br />
q2=quantile(delta,0.975);<br />
<br />
The interval is approximately <math> 95% CI \approx (0.06606, 0.32204) </math><br />
<br />
The histogram of delta is:<br><br />
[[File:Delta_hist.jpg]]<br />
<br />
Note: In this case, we can also find <math>E(\delta)</math> analytically since <br />
<math>E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 </math>. Compare this with the maximum likelihood estimate for <math>\delta</math>: <math>\frac{x}{n} - \frac{y}{m} = 0.2</math>.<br />
<br />
=====Example 2:=====<br />
Bayesian Inference for Dose Response<br />
<br />
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:<br />
:<math>\displaystyle x_1<x_2<...<x_{10}</math><br><br />
For each dose <math>\displaystyle x_i</math> we test n rats and observe <math>\displaystyle Y_i</math>, the number of rats that survive. Therefore,<br><br />
:<math>\displaystyle Y_i \sim~ BIN(n, p_i)</math><br>.<br />
We can assume that the probability of death grows with the concentration of drug given, i.e. <math>\displaystyle p_1<p_2<...<p_{10}</math>. Estimate the dose at which the animals have at least 50% chance of dying.<br><br />
:Let <math>\displaystyle \delta=x_j</math> where <math>\displaystyle j=min\{i|p_i\geq0.5\}</math><br />
:We are interested in <math>\displaystyle \delta</math> since any higher concentrations are known to have a higher death rate.<br><br><br />
<br />
'''Solving this analytically is difficult:'''<br />
:<math>\displaystyle \delta = g(p_1, p_2, ..., p_{10})</math> where g is an unknown function<br />
:<math>\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}</math><br><br />
:: where <math>\displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}</math><br><br><br />
<br />
'''Process: Monte Carlo'''<br><br />
We assume that<br />
# Draw <math>\displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)</math><br />
# Keep sample only if it satisfies <math>\displaystyle p_1\leq p_2\leq ...\leq p_{10}</math>, otherwise discard and try again.<br />
# Compute <math>\displaystyle \delta</math> by finding the first <math>\displaystyle p_i</math> sample with over 50% deaths.<br />
# Repeat process n times to get n estimates for <math>\displaystyle \delta_1, \delta_2, ..., \delta_N </math>.<br />
# <math>\displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}</math>.<br />
<br />
For instance, for each dose level <math>X_i</math>, for <math>1<=i<=10</math>, 10 rats are used and it is observed that the numbers that are dying is <math>Y_i</math>, where <math>Y_1 = 4, Y_2 = 3, </math>etc.<br />
<br />
====Importance Sampling====<br />
<br />
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. [Importance Sampling, Wikipedia] The following diagram illustrates a Monte Carlo approximation for g(x):<br />
<br><br />
<br><br />
[[File:ImpSampling.PNG]] <br />
<br />
As the figure above shows, the uniform distribution <math>U\sim~Unif[0,1]</math> is a proposal distribution to sample from and g(x) is the target distribution. Here we cast the integral <math>\int_{0}^1 g(x)dx</math>, as the expectation with respect to U such that <math>\int_{0}^1 g(x)= E(g(U))</math>. Hence we can approximate by <math>\frac{1}{n}\displaystyle\sum_{i=1}^{n} g(u_i)</math>. <br><br />
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]<br />
<br><br />
<br><br />
In <math>I = \displaystyle\int h(x)f(x)\,dx</math>, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then <math>I</math> can be written as<br><br />
:: <math>I = \displaystyle\int h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle E_g(w(x)) \rightarrow</math>the expectation of w(x) with respect to g(x) and therefore <math>\displaystyle x_1,x_2,...,x_N \sim~ g</math><br />
:: <math>= \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math> where <math>\displaystyle w(x) = \frac{h(x)f(x)}{g(x)}</math><br><br><br />
<br />
'''Process'''<br><br />
# Choose <math>\displaystyle g(x)</math> such that it's easy to sample from.<br />
# Compute <math>\displaystyle w(x)=\frac{h(x)f(x)}{g(x)}</math><br />
# <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N}</math><br><br><br />
<br />
<br />
Note: By the law of large number, we can say that <math>\hat{I}</math> converges in probability to <math>I </math>.<br />
<br />
'''"Weighted" average'''<br><br />
:The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from <math>\displaystyle g(x)</math> that are closer to the original distribution <math>\displaystyle f(x)</math>, which we would ideally like to sample from (but cannot because it is too difficult).<br><br />
:<math>\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx</math><br />
:<math>=\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx</math> which is the same thing as saying that we are applying a regular Monte Carlo Simulation method to <math>\displaystyle\int h(x)g(x)\,dx </math>, taking each result from this process and weighting the more accurate ones (i.e. the ones for which <math>\displaystyle \frac{f(x)}{g(x)}</math> is high) higher.<br />
<br />
One can view <math> \frac{f(x)}{g(x)}\ = B(x)</math> as a weight. <br />
<br />
Then <math>\displaystyle \hat{I} = \frac{\displaystyle\sum_{i=1}^{N} w(x_i)}{N} = \frac{\displaystyle\sum_{i=1}^{N} B(x_i)*h(x_i)}{N}</math><br><br><br />
<br />
i.e. we are computing a weighted sum of <math> h(x_i) </math> instead of a sum<br />
<br />
===[[A Deeper Look into Importance Sampling]] - June 2, 2009 ===<br />
From last class, we have determined that an integral can be written in the form <math>I = \displaystyle\int h(x)f(x)\,dx </math> <math>= \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx</math> We continue our discussion of Importance Sampling here.<br />
<br />
====Importance Sampling====<br />
<br />
We can see that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.<br />
<br />
=====Problem=====<br />
The method of Importance Sampling is simple but can lead to some problems. The <math> \displaystyle \hat I </math> estimated by Importance Sampling could have infinite standard error.<br />
<br />
Given <math>\displaystyle I= \int w(x) g(x) dx </math><br />
<math>= \displaystyle E_g(w(x)) </math><br />
<math>= \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) </math><br />
where <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math>.<br />
<br />
Obtaining the second moment,<br />
::<math>\displaystyle E[(w(x))^2] </math><br />
::<math>\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx</math><br />
::<math>\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx </math><br />
::<math>\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx </math><br />
<br />
We can see that if <math>\displaystyle g(x) \rightarrow 0 </math>, then <math>\displaystyle E[(w(x))^2] \rightarrow \infty </math>. This occurs if <math>\displaystyle g </math> has a thinner tail than <math>\displaystyle f </math> then <math>\frac{h^2(x)f^2(x)}{g(x)} </math> could be infinitely large. The general idea here is that <math>\frac{f(x)}{g(x)} </math> should not be large.<br />
<br />
=====Remark 1=====<br />
It is evident that <math>\displaystyle g(x) </math> should be chosen such that it has a thicker tail than <math>\displaystyle f(x) </math>.<br />
If <math>\displaystyle f</math> is large over set <math>\displaystyle A</math> but <math>\displaystyle g</math> is small, then <math>\displaystyle \frac{f}{g} </math> would be large and it would result in a large variance.<br />
<br />
=====Remark 2=====<br />
It is useful if we can choose <math>\displaystyle g </math> to be similar to <math>\displaystyle f</math> in terms of shape. Ideally, the optimal <math>\displaystyle g </math> should be similar to <math>\displaystyle \left| h(x) \right|f(x)</math>, and have a thicker tail. It's important to take the absolute value of <math>\displaystyle h(x)</math>, since a variance can't be negative. Analytically, we can show that the best <math>\displaystyle g</math> is the one that would result in a variance that is minimized.<br />
<br />
=====Remark 3=====<br />
Choose <math>\displaystyle g </math> such that it is similar to <math>\displaystyle \left| h(x) \right| f(x) </math> in terms of shape. That is, we want <math>\displaystyle g \propto \displaystyle \left| h(x) \right| f(x) </math><br />
<br />
<br />
====Theorem (Minimum Variance Choice of <math>\displaystyle g</math>) ====<br />
The choice of <math>\displaystyle g</math> that minimizes variance of <math>\hat I</math> is <math>\displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}</math>.<br />
<br />
=====Proof:=====<br />
We know that <math>\displaystyle w(x)=\frac{f(x)h(x)}{g(x)} </math><br />
<br />
The variance of <math>\displaystyle w(x) </math> is<br />
:: <math>\displaystyle Var[w(x)] </math><br />
:: <math>\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2 </math><br />
:: <math>\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2 </math><br />
<br />
As we can see, the second term does not depend on <math>\displaystyle g(x) </math>. Therefore to minimize <math>\displaystyle Var[w(x)] </math> we only need to minimize the first term. In doing so we will use '''Jensen's Inequality'''.<br />
<br />
<br />
::::::::::<math>\displaystyle ======Aside: Jensen's Inequality====== </math><br />
::<br />
If <math>\displaystyle g </math> is a convex function ( twice differentiable and <math>\displaystyle g''(x) \geq 0 </math> ) then <math>\displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)</math><br /><br />
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:<br />
::<math>\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) </math> where <math>\displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1 </math><br />
::::::::::=======================================================<br />
<br />
=====Proof (cont)=====<br />
Using Jensen's Inequality, <br /><br />
::<math>\displaystyle g(E[x]) \leq E[g(x)] </math> as <math>\displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)] </math><br />
Therefore<br />
::<math>\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2 </math><br />
::<math>\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math> <br /><br />
and<br />
::<math>\displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 </math><br />
::<math>\displaystyle = \left(\int \left| h(x) \right| f(x) dx \right)^2 </math> since <math>\displaystyle f </math> and <math>\displaystyle g</math> are density functions, <math>\displaystyle f, g </math> cannot be negative. <br /><br />
<br />
Thus, this is a lower bound on <math>\displaystyle E[(w(x))^2]</math>. If we replace <math>\displaystyle g^*(x) </math> into <math>\displaystyle E[g^*(x)]</math>, we can see that the result is as we require. Details omitted.<br /><br />
<br />
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute <math>\displaystyle g^*</math>.<br />
<br />
Note: Jensen's inequality is actually unnecessary here. We just use it to get <math>E[(w(x))^2] \geq (E[|w(x)|])^2</math>, which could be derived using variance properties: <math>0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2</math>.<br />
<br />
===[[Importance Sampling and Markov Chain Monte Carlo (MCMC)]] - June 4, 2009 ===<br />
Remark 4:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
:: <math>= \displaystyle\int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b(x_i)</math> where <math>\displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}</math><br />
:: <math>=\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}</math><br />
Apply the idea of importance sampling to both the numerator and denominator:<br />
:: <math>=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}</math><br />
:: <math>= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}</math><br />
:: <math>= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i)</math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
The above results in the following form of Importance Sampling:<br />
::<math> \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) </math> where <math>\displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}</math><br />
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.<br />
==== Example of Importance Sampling ====<br />
Estimate <math> I = \displaystyle\ Pr (Z>3) </math> when <math>Z \sim~ N(0,1) </math><br />
::<math> I = \displaystyle\int^\infty_3 f(x)\,dx \approx 0.0013 </math><br />
::<math> = \displaystyle\int^\infty_3 h(x)f(x)\,dx </math><br />
:Define <math><br />
h(x) = \begin{cases}<br />
0, & \text{if } x < 3 \\<br />
1, & \text{if } 3 \leq x<br />
\end{cases}</math><br />
<br />
<br>'''Approach I: Monte Carlo'''<br><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math> where <math>X \sim~ N(0,1) </math><br />
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.<br />
<br />
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.<br />
<br />
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):<br />
<br />
format long<br />
for i = 1:100<br />
a(i) = sum(randn(100,1)>=3)/100;<br />
end<br />
meanMC = mean(a)<br />
varMC = var(a)<br />
<br />
On running this, we get <math> meanMC = 0.0005 </math> and <math> varMC \approx 1.31313 * 10^{-5} </math><br />
<br />
<br>'''Approach II: Importance Sampling'''<br><br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)}</math> where <math>f(x)</math> is standard normal and <math>g(x)</math> needs to be chosen wisely so that it is similar to the target distribution.<br />
<br />
:Let <math>g(x) \sim~ N(4,1) </math><br />
:<math>b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}</math><br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)</math><br />
<br />
This example can be illustrated in Matlab using the code below:<br />
for j = 1:100<br />
N = 100;<br />
x = randn (N,1) + 4;<br />
for ii = 1:N<br />
h = x(ii)>=3;<br />
b = exp(8-4*x(ii));<br />
w(ii) = h*b;<br />
end<br />
I(j) = sum(w)/N;<br />
end<br />
MEAN = mean(I)<br />
VAR = var(I)<br />
<br />
Running the above code gave us <math> MEAN \approx 0.001353 </math> and <math> VAR \approx 9.666 * 10^{-8} </math> which is very close to 0, and is much less than the variability observed when using Monte Carlo<br />
<br />
==== Markov Chain Monte Carlo (MCMC) ==== <br />
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain. <br />
<br />
Consider the same integral:<br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
Idea: If <math>\displaystyle X_1, X_2,...X_N</math> is a Markov Chain with stationary distribution f(x), then under some conditions<br />
<br />
:<math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I</math><br />
<br>'''Stochastic Process:'''<br><br />
A Stochastic Process is a collection of random variables <math>\displaystyle \{ X_t : t \in T \}</math><br />
*'''State Space Set:'''<math>\displaystyle X </math>is the set that random variables <math>\displaystyle X_t</math> takes values from.<br />
*'''Indexed Set:'''<math>\displaystyle T </math>is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.<br />
<br />
<br>'''Example 1'''<br><br />
i.i.d random variables<br />
:<math> \{ X_t : t \in T \}, X_t \in X </math><br />
:<math> X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrow</math>'''State Space'''<br />
:<math> T = \{1, 2, 3, 4, 5\} \rightarrow</math>'''Indexed Set'''<br />
<br />
<br>'''Example 2'''<br><br />
:<math>\displaystyle X_t</math>: price of a stock<br />
:<math>\displaystyle t</math>: opening date of the market<br />
::<br />
'''Basic Fact:''' In general, if we have random variables <math>\displaystyle X_1,...X_n</math><br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)</math><br />
:<math>\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i)</math> where <math>\displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)</math><br />
<br>'''Markov Chain:'''<br><br />
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> X_t-1</math>.<br />
<br />
For example,<br />
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_n-1)</math><br />
<br />
<br>'''Transition Probability:'''<br><br />
The probability of going from one state to another state.<br />
:<math>p_{ij} = \Pr(X=X_j\mid X= X_i). \,</math><br />
<br />
<br>'''Transition Matrix:'''<br><br />
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:<br />
<br />
:<math>P=\left(\begin{matrix}p_{1,1}&p_{1,2}&\dots&p_{1,j}&\dots\\<br />
p_{2,1}&p_{2,2}&\dots&p_{2,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots<br />
\end{matrix}\right)</math><br />
<br />
<br>'''Example:'''<br><br />
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is <math>\displaystyle Pr(heads) = p</math>. And the probability of moving to the left is <math>\displaystyle Pr(tails) = q = 1-p </math>. Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:<br />
:<math>P=\left(\begin{matrix}1&0&\dots&0&\dots\\<br />
p&0&q&0&\dots\\<br />
0&p&0&q&0\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
p_{i,1}&p_{i,2}&\dots&p_{i,j}&\dots\\<br />
\vdots&\vdots&\ddots&\vdots&\ddots\\<br />
0&0&\dots&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br /><br /><br /><br />
<br />
===<big>'''[[Markov Chain Definitions]]''' - June 9, 2009</big>===<br />
Practical application for estimation:<br />
The general concept for the application of this lies within having a set of generated <math>x_i</math> which approach a distribution <math>f(x)</math> so that a variation of importance estimation can be used to estimate an integral in the form<br /><br />
<math> I = \displaystyle\int^\ h(x)f(x)\,dx </math> by <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)</math><br /><br />
All that is required is a Markov chain which eventually converges to <math>f(x)</math>.<br />
<br /><br /><br />
In the previous example, the entries <math>p_{ij}</math> in the transition matrix <math>P</math> represent the probability of reaching state <math>j</math> from state <math>i</math> after one step. For this reason, the sum over all entries j in a particular column sum to 1, as this itself must be a pmf if a transition from <math>i</math> is to lead to a state still within the state set for <math>X_t</math>.<br />
<br />
'''Homogeneous Markov Chain'''<br /><br />
The probability matrix <math>P</math> is the same for all indicies <math>n\in T</math>.<br />
<math>\displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)</math><br />
<br />
If we denote the pmf of <math>X_n</math> by a probability vector <math>\frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]</math> <br /><br />
where <math>i</math> denotes an ordered index of all possible states of <math>X</math>.<br /><br />
Then we have a definition for the<br /><br />
'''marginal probabilty''' <math>\frac{}{}\mu_n(i) = P(X_n=i)</math><br /><br />
where we simplify <math>X_n</math> to represent the ordered index of a state rather than the state itself.<br />
<br /><br /><br />
From this definition it can be shown that,<br />
<math>\frac{}{}\mu_{n-1}P=\mu_0P^n</math><br />
<br />
<big>'''Proof:'''</big><br />
<br />
<math>\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}]</math><br />
and since<br />
<blockquote><br />
<math>\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_n=x_i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_n=x_i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_n=x_i)}</math><br />
<math>=\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)</math> <br />
<br />
</blockquote><br />
Therefore,<br /><br />
<math>\frac{}{}\mu_{n-1}P=[\mu_{n}(1),\mu_{n}(2),...,\mu_{n}(i)]=\mu_{n}</math><br />
<br />
With this, it is possible to define <math>P(n)</math> as an n-step transition matrix where <math>\frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)</math><br /><br />
<br />
'''Theorem''': <math>\frac{}{}\mu_n=\mu_0P^n</math><br /><br />
'''Proof''': <math>\frac{}{}\mu_n=\mu_{n-1}P</math> From the previous conclusion<br /><br />
<math>\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP</math> And since this is a homogeneous Markov chain, <math>P</math> does not depend on <math>i</math> so<br /><br />
<math>\frac{}{}=\mu_0P^n</math><br />
<br />
From this it becomes easy to define the n-step transition matrix as <math>\frac{}{}P(n)=P^n</math><br />
<br />
====Summary of definitions====<br />
*'''transition matrix''' is an NxN when <math>N=|X|</math> matrix with <math>P_{ij}=Pr(X_1=j|X_0=i)</math> where <math>i,j \in X</math><br /><br />
*'''n-step transition matrix''' also NxN with <math>P_{ij}(n)=Pr(X_n=j|X_0=i)</math><br /><br />
*'''marginal (probability of X)'''<math>\mu_n(i) = Pr(X_n=i)</math><br /><br />
*'''Theorem:''' <math>P_n=P^n</math><br /><br />
*'''Theorem:''' <math>\mu_n=\mu_0P^n</math><br /><br />
---<br />
<br />
====Definitions of different types of state sets====<br />
Define <math>i,j \in</math> State Space<br /><br />
If <math>P_{ij}(n) > 0</math> for some <math>n</math> , then we say <math>i</math> reaches <math>j</math> denoted by <math>i\longrightarrow j</math> <br /><br />
This also mean j is accessible by i: <math>j\longleftarrow i</math> <br /><br />
If <math>i\longrightarrow j</math> and <math>j\longrightarrow i</math> then we say <math>i</math> and <math>j</math> communicate, denoted by <math>i\longleftrightarrow j</math><br />
<br /><br /><br />
'''Theorems'''<br /><br />
1) <math>i\longleftrightarrow i</math><br /><br />
2) <math>i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math><br /><br />
3) If <math>i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k</math><br /><br />
4) The set of states of <math>X</math> can be written as a unique disjoint union of subsets (equivalence classes) <math>X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 </math> where two states <math>i</math> and <math>j</math> communicate <math>IFF</math> they belong to the same subset<br />
<br /><br /><br />
'''More Definitions'''<br /><br />
A set is '''Irreducible''' if all states communicate with each other (has only one equivalence class).<br /><br />
A subset of states is '''Closed''' if once you enter it, you can never leave.<br /><br />
A subset of states is '''Open''' if once you leave it, you can never return.<br /><br />
An '''Absorbing Set''' is a closed set with only 1 element (i.e. consists of a single state).<br /><br />
<br />
<b>Note</b><br />
*We cannot have <math>\displaystyle i\longleftrightarrow j</math> with i recurrent and j transient since <math>\displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i</math>.<br />
*All states in an open class are transient.<br />
*A Markov Chain with a finite number of states must have at least 1 recurrent state.<br />
*A closed class with an infinite number of states has all transient or all recurrent states.<br /><br />
<br />
===[[Again on Markov Chain]] - June 11, 2009===<br />
<br />
<br />
====Decomposition of Markov chain====<br />
<br />
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.<br />
<br />
<br>'''Example:'''<br><br />
Let <math>\displaystyle X = \{1, 2, 3, 4\}</math> and the transition matrix be:<br />
<br />
<br />
:<math>P=\left(\begin{matrix}1/3&2/3&0&0\\<br />
2/3&1/3&0&0\\<br />
1/4&1/4&1/4&1/4\\<br />
0&0&0&1<br />
\end{matrix}\right)</math><br />
<br />
<br />
The decomposition in classes is:<br />
::::class 1: <math>\displaystyle \{1, 2\} \rightarrow </math> From the matrix we see that the states 1 and 2 have only <math>\displaystyle P_{12}</math> and <math>\displaystyle P_{21}</math> as nonzero transition probability<br />
::::class 2: <math>\displaystyle \{3\} \rightarrow </math> The state 3 can go to every other state but none of the others can go to it<br />
::::class 3: <math>\displaystyle \{4\} \rightarrow </math> This is an absorbing state since it is a close class and there is only one element<br />
::<br />
<br />
====Recurrent and Transient states====<br />
<br />
A state i is called <math>\emph{recurrent}</math> or <math>\emph{persistent}</math> if<br />
:<math>\displaystyle Pr(x_{n}=i</math> for some <math>\displaystyle n\geq 1 | x_{0}=i)=1 </math><br />
That means that the probability to come back to the state i, starting from the state i, is 1.<br />
<br />
If it is not the case (ie. probability less than 1), then state i is <math>\emph{transient} </math>.<br />
<br />
It is straight forward to prove that a finite irreducible chain is recurrent.<br />
::<br />
<br>'''Theorem'''<br><br />
Given a Markov chain, <br />
<br>A state <math>\displaystyle i</math> is <math>\emph{recurrent}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)=\infty</math><br />
<br>A state <math>\displaystyle i</math> is <math>\emph{transient}</math> if and only if <math>\displaystyle \sum_{\forall n}P_{ii}(n)< \infty</math><br />
<br />
<br>'''Properties'''<br><br />
*If <math>\displaystyle i</math> is <math>\emph{recurrent}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{recurrent}</math><br />
*If <math>\displaystyle i</math> is <math>\emph{transient}</math> and <math>i\longleftrightarrow j</math> then <math>\displaystyle j</math> is <math>\emph{transient}</math><br />
*In an equivalence class, either all states are recurrent or all states are transient<br />
*A finite Markov chain should have at least one recurrent state<br />
*The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)<br />
<br />
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.<br />
<br />
<br>'''Example'''<br><br />
Let <math>\displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\}</math> and suppose that <math>\displaystyle P_{i,i+1}=p </math>, <math>\displaystyle P_{i,i-1}=q=1-p</math> and <math>\displaystyle P_{i,j}=0</math> otherwise.<br />
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.<br />
<br />
We now see other properties of this particular Markov chain:<br />
*Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.<br />
*Consider now the case in which we are in state <math>\displaystyle 0</math>. If we move of n steps to the right or to the left, the only way to go back to <math>\displaystyle 0</math> is to have n steps on the opposite direction.<br />
<math>\displaystyle Pr(x_{2n}=0/X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n}</math><br />
We now want to know if this event is transient or recurrent or, equivalently, whether <math>\displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty</math> or not.<br />
<br />
To proceed with the analysis, we use the <math>\emph{Stirling }</math> <math>\displaystyle\emph{formula}</math>:<br />
<br />
<math>\displaystyle n!\sim~n^{n}\sqrt(n)e^{-n}\sqrt(2\pi)</math><br />
<br />
The probability could therefore be approximated by:<br />
<br />
<math>\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
And the formula becomes:<br />
<br />
<math>\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)}</math><br />
<br />
We can conclude that if <math>\displaystyle 4pq < 1</math> then the state is transient, otherwise is recurrent.<br />
<br />
<math>\displaystyle<br />
\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt(n\pi)} = \begin{cases}<br />
\infty, & \text{if } p = \frac{1}{2} \\<br />
< \infty, & \text{if } p\neq \frac{1}{2} <br />
\end{cases}</math><br />
<br />
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n<br />
</math><br />
<br />
Then substitute in <math>x = pq</math>.<br />
<br />
<math><br />
\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)<br />
</math><br />
<br />
So we reach the same conclusion: all states are recurrent iff <math>p = q = \frac{1}{2}</math>.<br />
<br />
====Convergence of Markov chain====<br />
We define the <math>\displaystyle \emph{Recurrence}</math> <math>\emph{time}</math><math>\displaystyle T_{i,j}</math> as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.<br />
<br />
<math>\displaystyle T_{ij}=\begin{cases}<br />
min\{n: x_{n}=i\}, & \text{if }\exists n \\<br />
\infty, & \text{otherwise } <br />
\end{cases}</math><br />
<br />
The mean of the recurrent time <math>\displaystyle m_{i}</math>is defined as:<br />
<br />
<math>m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii} </math><br />
<br />
where <math>\displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)</math><br />
<br />
<br />
Using the objects we just introduced, we say that:<br />
<br />
<math>\displaystyle \text{state } i=\begin{cases}<br />
\text{null}, & \text{if } m_{i}=\infty \\<br />
\text{non-null or positive} , & \text{otherwise } <br />
\end{cases}</math><br />
<br />
<br>'''Lemma'''<br><br />
In a finite state Markov Chain, all the recurrent state are positive<br />
<br />
====Periodic and aperiodic Markov chain====<br />
A Markov chain is called <math>\emph{periodic}</math> of period <math>\displaystyle n</math> if, starting from a state, we will return to it every <math>\displaystyle n</math> steps with probability <math>\displaystyle 1</math>.<br />
<br />
<br>'''Example'''<br><br />
Considerate the three-state chain:<br />
<br />
<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math><br />
<br />
It's evident that, starting from the state 1, we will return to it on every <math>3^{rd}</math> step and so it works for the other two states. The chain is therefore periodic with perdiod <math>d=3</math><br />
<br />
<br />
An irreducible Markov chain is called <math>\emph{aperiodic}</math> if:<br />
<br />
<math>\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{ for some } n\ge 0 </math><br />
<br />
<br>'''Another Example'''<br><br />
Consider the chain<br />
<math>P=\left(\begin{matrix}<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
0&0.5&0&0.5\\<br />
0.5&0&0.5&0\\<br />
\end{matrix}\right)</math><br />
<br />
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps <math>\Rightarrow</math> period <math>d=2</math><br />
<br />
<br />
==Markov Chains and their Stationary Distributions - June 16, 2009==<br />
====New Definition:Ergodic====<br />
A state is '''Ergodic''' if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.<br />
<br />
Define a vector <math>\pi</math> where <math>\pi_i > 0 \forall i</math> and <math>\sum_i \pi_i = 1</math>(ie. <math>\pi</math> is a pmf)<br />
<br />
<math>\pi</math> is a stationary distribution if <math>\pi=\pi P</math> where P is a transition matrix.<br />
<br />
====Limiting Distribution====<br />
If as <math>n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}<br />
\pi\\<br />
\pi\\<br />
\vdots\\<br />
\pi\\<br />
\end{matrix}\right]</math><br />
then <math>\pi</math> is the limiting distribution of the Markov Chain represented by P.<br /><br />
'''Theorem:''' An irreducible, ergodic Markov Chain has a unique stationary distribution <math>\pi</math> and there exists a limiting distribution which is also <math>\pi</math>.<br />
<br />
====Detailed Balance====<br />
<br />
The condition for detailed balanced is <math>\displaystyle \pi_i p_{ij} = p_{ji} \pi_j </math><br />
<br />
=====Theorem=====<br />
If <math>\pi</math> satisfies detailed balance then it is a stationary distribution.<br />
<br />
'''Proof.<br />
'''<br><br />
We need to show <math>\pi = \pi P</math><br />
<math>\displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j </math> as required<br />
<br />
Warning! A chain that has a stationary distribution does not necessarily converge.<br />
<br />
For example,<br />
<math>P=\left(\begin{matrix}<br />
0&1&0\\<br />
0&0&1\\<br />
1&0&0<br />
\end{matrix}\right)</math> has a stationary distribution <math>\left(\begin{matrix}<br />
1/3&1/3&1/3<br />
\end{matrix}\right)</math> but it will not converge.<br />
<br />
====Stationary Distribution====<br />
<math>\pi</math> is stationary (or invariant) distribution if <math>\pi</math> = <math>\pi * p</math><br />
[0.5 0 0.5] <br />
Half of time their chain will spend half time in 1st state and half time in 3rd state.<br />
<br />
====Theorem====<br />
<br />
An irreducible ergodic Markov Chain has a unique stationary distribution <math>\pi</math>.<br />
The limiting distribution exists and is equal to <math>\pi</math>.<br />
<br />
If g is any bounded function, then with probability 1:<br />
<math>lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j</math><br />
<br />
<br />
====Example====<br />
<br />
Find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/2&1/2&0\\<br />
1/2&1/4&1/4\\<br />
0&1/3&2/3<br />
\end{matrix}\right)</math><br />
<br />
Solve <math>\pi=\pi P</math><br />
<br />
<math>\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1</math><br /><br />
<math>\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2</math><br /><br />
<math>\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2</math><br /><br />
<br />
Also <math>\displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
We can solve the above system of equations and obtain <br /> <br />
<math>\displaystyle \pi_2 = 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_0 = \pi_1</math><br /><br />
<br />
Thus, <math>\displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1</math><br />
and we get <math>\displaystyle \pi_1 = 4/11</math><br />
<br />
Subbing <math>\displaystyle \pi_1 = 4/11</math> back into the system of equations we obtain <br /><br />
<math>\displaystyle \pi_0 = 4/11</math> and <math>\displaystyle \pi_2 = 3/11</math><br />
<br />
Therefore the limiting distribution is <math>\displaystyle \pi = (4/11, 4/11, 3/11)</math><br />
<br />
==Monte Carlo using Markov Chain - June 18, 2009==<br />
<br />
Consider the problem of computing <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
<math>\bullet</math> Generate <math>\displaystyle X_1</math>, <math>\displaystyle X_2</math>,... from a Markov Chain with stationary distribution <math>\displaystyle f(x)</math><br />
<br />
<math>\bullet</math> <math>\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}</math><br />
<br />
====''''' Metropolis Hastings Algorithm''''' ====<br />
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.<br />
<br />
Suppose we wish to sample from the distribution <math>\displaystyle f(x)</math>. Let <math>q(y\mid{x})</math> be a distribution that is easy to sample from, we call it the "Proposal Distribution".<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Initialize <math>\displaystyle X_0</math>, this is the starting point of the chain, choose it randomly and set index <math>\displaystyle i=0</math><br />
:<br\>2. <math>Y~ \sim~ q(y\mid{x})</math><br />
:<br\>3. Compute <math>\displaystyle r(X_i,Y)</math>, where <math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
:<br\>4. <math> U~ \sim~ Unif [0,1] </math><br />
:<br\>5. If <math>\displaystyle U<r </math> <br />
:::then <math>\displaystyle X_{i+1}=Y </math> <br />
:::else <math>\displaystyle X_{i+1}=X_i </math><br />
:<br\>6. Update index <math>\displaystyle i=i+1</math>, and go to step 2<br />
<br />
<br />
'''''A couple of remarks about the algorithm'''''<br />
<br />
'''Remark 1:''' A good choice for <math>q(y\mid{x})</math> is <math>\displaystyle N(x,b^2)</math> where <math>\displaystyle b>0 </math> is a constant. The starting point of the algorithm <math>X_0=x</math>, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.<br />
<br />
'''Remark 2:''' If the proposal distribution is symmetric, <math>q(y\mid{x})=q(x\mid{y})</math>, then <math>r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}</math>. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).<br />
<br />
'''Remark 3:''' <math>\displaystyle N(x,b^2)</math> is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.<br />
<br />
<br />
<br />
'''Example:''' The Cauchy distribution has density <math> f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}</math><br />
<br />
Let the proposal distribution be <math>q(y\mid{x})=N(x,b^2) </math><br />
<br />
<math>r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}</math><br />
::<math>=min{\{\frac{f(y)}{f(x)},1}\}</math> since <math>q(y\mid{x})</math> is symmetric <math>\Rightarrow</math> <math>\frac{q(x\mid{y})}{q(y\mid{x})}=1</math><br />
::<math>=min{\{\frac{ \frac{1}{\pi}\frac{1}{1+y^2} }{ \frac{1}{\pi} \frac{1}{1+x^2} },1}\}</math><br />
::<math>=min{\{\frac{1+x^2 }{1+y^2},1}\}</math><br />
<br />
Now, having calculated <math>\displaystyle r(x,y)</math>, we complete the problem in Matlab using the following code:<br />
b=2; % let b=2 for now, we will see what happens when b is smaller or larger<br />
X(1)=randn;<br />
for i=2:10000<br />
Y=b*randn+X(i-1); % we want to decide whether we accept this Y<br />
r=min( (1+X(i-1)^2)/(1+Y^2),1); <br />
u=rand;<br />
if u<r<br />
X(i)=Y; % accept Y<br />
else<br />
X(i)=X(i-1); % reject Y remaining in the current state<br />
end;<br />
end;<br />
<br />
'''''We need to be careful about choosing b!'''''<br />
<br />
:'''If b is too large'''<br />
<br />
::Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small aswell. <br />
<br />
::It is highly unlikely that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.<br />
<br />
::It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math><br />
<br />
:: Most likely we reject y and the chain will get stuck.<br />
<br />
::For the above example, the following output occurs when choosing B too large (B=1000)<br />
<br />
[[File:Blarge.jpg]]<br />
<br />
:'''If b is too small<br />
<br />
::Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower then than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit majority of the states of the target <math>\displaystyle f(x)</math>.<br />
<br />
::For the above example, the following output occurs when choosing B too small (B=0.001)<br />
<br />
[[File:Bsmall.JPG]]<br />
<br />
:'''If b is just right'''<br />
::Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".<br />
<br />
::For the above example, the following output occurs when choosing a good value for B (B=2)<br />
<br />
[[File:Bgood.JPG]]<br />
<br />
'''Mathematical explanation for why this algorithm works:'''<br />
<br />
We talked about <math>\emph{discrete}</math> MC so far. <br />
<br />
<br\> We have seen that: <br\>- <math>\displaystyle \pi</math> satisfies detailed balance if <math>\displaystyle \pi_iP_{ij}=P_{ji}\pi_j</math> and <br\>- if <math>\displaystyle\pi</math> satisfies <math>\emph{detailed}</math> <math>\emph{balance}</math>then it is a stationary distribution <math>\displaystyle \pi=\pi P</math><br />
<br />
<br />
In <math>\emph{continuous}</math>case we write the Detailed Balance as <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> and say that <br\><math>\displaystyle f(x)</math> is <math>\emph{stationary}</math> <math>\emph{distribution}</math> if <math>f(x)=\int f(y)P(y,x)dy</math>. <br />
<br />
<br />
We want to show that if Detailed Balance holds (i.e. assume <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>) then <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
That is to show: <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow </math> <math>\displaystyle f(x)</math> is stationary distribution.<br />
<br />
:<math>f(x)=\int f(y)P(y,x)dy</math><br />
:::<math>=\int f(x)P(x,y)dy</math><br />
:::<math>=f(x)\int P(x,y)dy</math> and since <math>\int P(x,y)dy=1</math><br />
:::<math>=\displaystyle f(x)</math> <br />
<br />
<br />
'''''Now, we need to show that detailed balance holds in the Metropolis-Hastings...'''''<br />
<br />
Consider 2 points <math>\displaystyle x</math> and <math>\displaystyle y</math>:<br />
<br />
:'''Either''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1</math> '''OR''' <math>\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1</math> (ignoring that it might equal to 1)<br />
<br />
Without loss of generality. suppose that the product is <math>\displaystyle<1</math>. <br />
<br />
<br />
In this case <math>r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}</math> and <math>\displaystyle r(y,x)=1</math><br />
<br />
<br />
:''Some intuitive meanings before we continue:'' <br />
:<math>\displaystyle P(x,y)</math> is jumping from <math>\displaystyle x</math> to <math>\displaystyle y</math> if proposal distribution generates <math>\displaystyle y</math> '''and''' <math>\displaystyle y</math> is accepted<br />
:<math>\displaystyle P(y,x)</math> is jumping from <math>\displaystyle y</math> to <math>\displaystyle x</math> if proposal distribution generates <math>\displaystyle x</math> '''and''' <math>\displaystyle x</math> is accepted<br />
:<math>q(y\mid{x})</math> is the probability of generating <math>\displaystyle y</math><br />
:<math>q(x\mid{y})</math> is the probability of generating <math>\displaystyle x</math><br />
:<math>\displaystyle r(x,y)</math> probability of accepting <math>\displaystyle y</math><br />
:<math>\displaystyle r(y,x)</math> probability of accepting <math>\displaystyle x</math>.<br />
<br />
<br />
With that in mind we can show that <math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as follows:<br />
<br />
<br />
<br />
<math>P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})}</math> Cancelling out <math>\displaystyle q(y\mid{x})</math> and bringing <math>\displaystyle f(x)</math> to the other side we get<br />
<br\><math>f(x)P(x,y)=f(y)q(y\mid{x})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit</math><br />
<br />
<br />
<br />
<math>P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1</math> Multiplying both sides by <math>\displaystyle f(y)</math> we get<br />
<br\><math>f(y)P(y,x)=f(y)q(x\mid{y})</math> <math>\Leftarrow</math> '''equation''' <math>\clubsuit\clubsuit</math><br />
<br />
<br />
<br />
Noticing that the right hand sides of the '''equation''' <math>\clubsuit</math> and '''equation''' <math>\clubsuit\clubsuit</math> are equal we conclude that:<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math> as desired and thus showing that Metropolis-Hastings satisfies detailed balance. <br />
<br />
<br />
Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.<br />
<br />
==Metropolis Hastings Algorithm Continued - June 25 ==<br />
<br />
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.<br />
<br />
In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.<br />
<br />
<br\><math>\displaystyle f(x)P(x,y)=P(y,x)f(y)</math>, which means <math>\displaystyle f(x) </math> is the stationary distribution of the chain.<br />
<br />
But this is not enough, we want the chain to converge to the stationary distribution as well.<br />
<br />
Thus, we also need it to be:<br />
<br />
<b>Irreducible:</b> There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of <math>\emph{q}</math> including the one that we used in the example in the previous lecture (which was normally distributed)<br />
<br />
<b>Aperiodic:</b> The chain will not oscillate between different set of states. In the previous example, <math> q(y\mid{x}) </math> is <math> \displaystyle N(x,b^2)</math>, which will clearly not oscillate.<br />
<br />
Next we discuss a couple of variations of Metropolis Hastings<br />
<br />
====''''' Random Walk Metropolis Hastings''''' ====<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:<br\>1. Draw <math>\displaystyle Y = X_i + \epsilon</math>, where <math>\displaystyle \epsilon </math> has distribution <math>\displaystyle g </math>; <math>\epsilon = Y-X_i \sim~ g </math>; <math>\displaystyle X_i </math> is current state & <math>\displaystyle Y </math> is going to be close to <math>\displaystyle X_i </math> <br />
:<br\>2. It means <math>q(y\mid{x}) = g(y-x)</math>. (Note that <math>\displaystyle g </math> is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form <math>\displaystyle g(|y-x|) </math>. Hence we know in this version that <math>\displaystyle q </math> is symmetric <math>\Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y})</math>)<br />
:<br\>3. <math>Y=min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was <math> q(y\mid{x}) </math> <math>\sim~ N(x,b^2) </math><br />
<br />
In matlab, we defined this as <br />
<br />
<math>\displaystyle Y = b* randn + x </math> (i.e <math>\displaystyle Y = X_i + randn*b) </math><br />
<br />
In this case, we need <math>\displaystyle \epsilon \sim~ N(0,b^2) </math><br />
<br />
The hard problem is to choose b so that the chain will mix well.<br />
<br />
<b>Rule of thumb: </b> choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)<br />
<br />
<b> Example </b><br />
<br />
[[File:Figure.JPG]]<br />
<br />
<br />
<br />
If we draw <math>\displaystyle y_1 </math> then <math>{\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1</math>, accept <math>\displaystyle y_1</math> with probability 1<br />
<br />
If we draw <math>\displaystyle y_2 </math> then <math>{\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}</math>, accept <math>\displaystyle y_2</math> with probability <math>\frac{f(y_2)}{f(x)}</math><br />
<br />
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.<br />
<br />
====''''' Independence Metropolis Hastings''''' ====<br />
<br />
In this case, the proposal distribution is independent of the current state, i.e. <math>\displaystyle q(y\mid{x}) = q(y)</math><br />
<br />
We draw from a fixed distribution<br />
<br />
And define <math>r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}</math><br />
<br />
And, this does not work unless <math>\displaystyle q </math> is very similar to the target distribution <math>\displaystyle f </math> (i.e. usually used when <math>\displaystyle f </math> is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)<br />
<br />
Now, we pose the question: if <math>\displaystyle q(y\mid{x}) </math> does not depend on <math>\displaystyle X</math>, does it mean that the sequence generated from this chain is really independent?<br />
<br />
Answer: Even though <math> Y \sim~ g(y) </math> does not depend on <math>\displaystyle X </math>, but <math>\displaystyle r </math> depends on <math>\displaystyle X </math>. So it's not really an independent sequence! <br />
<br />
:<math>x_{i+1} = \begin{cases}<br />
x_i \\<br />
y \\<br />
\end{cases}</math><br />
<br />
Thus, the sequence is not really independent because acceptance probability <math>\displaystyle r </math> depends on the state <math>\displaystyle X_i </math><br />
<br />
====''''' Simulated Annealing ''''' ====<br />
<br />
This is essentially a method for optimization and an application of Metropolis Hastings.<br />
<br />
Consider the problem of <math>\displaystyle \min_{x}(h(x)) </math>, i.e. we need to find x that minimizes <math>\displaystyle h(x) </math>. But, this is the same problem as <math>\displaystyle \max(e^{\frac{-h(x)}{T}})</math> for some constant T (since the exponential function is a monotone function)<br />
<br />
We then consider some distribution function <math>\displaystyle f</math> such that <math>\displaystyle f \propto e^{\frac{-h(x)}{T}}</math>, where T is called the temperature, and define the following procedure:<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:#Set T to be a large number <br />
:#Start with some random <math>\displaystyle X_0,</math> <math>\displaystyle i = 0</math><br />
:#<math> Y \sim~ q(y\mid{x}) </math> (note that <math>\displaystyle q </math> is usually chosen to be symmetric)<br />
:#<math> U \sim~ Unif[0,1] </math><br />
:#Define <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math> (when <math>\displaystyle q </math> is symmetric)<br />
:#<math>X_{i+1} = \begin{cases}<br />
Y, & \text{with probability r} \\<br />
X_i & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:#Decrease T and go to Step 2<br />
<br />
Now, we know that <math>r = min{\{\frac{f(y)}{f(x)},1}\}</math><br />
<br />
Consider <math> \frac{f(y)}{f(x)}<br />
<br />
= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}<br />
= e^{\frac{h(x)-h(y)}{T}} </math><br />
<br />
<b>Now, suppose T is large,</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 </math> and we therefore accept <math>\displaystyle y </math> with probability <math>\displaystyle <1 </math><br />
<br />
<b>On the other hand, suppose T is small (<math> T \rightarrow 0 </math>),</b><br />
<br />
<math>\rightarrow </math> if <math> h(y)<h(x) \Rightarrow r = 1 </math> and we therefore accept <math>\displaystyle y </math> with probability 1<br />
<br />
<math>\rightarrow </math> if <math> h(y)>h(x) \Rightarrow r = 0 </math> and we therefore reject <math>\displaystyle y </math><br />
<br />
<br />
<br />
<b>Example 1</b><br />
<br />
Consider the problem of minimizing the function <math>\displaystyle f(x) = -2x^3 - x^2 + 40x + 3 </math><br />
<br />
We can plot this function and observe that it makes a local minimum near <math>\displaystyle x = -3 </math><br />
<br />
[[File:ezplotf0.jpg]]<br />
<br />
We then plot the graphs of <math>\displaystyle \frac{f(x)}{T}</math> for <math>\displaystyle T = 100, 0.1</math> and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.<br />
<br />
[[File:ezplotf1.jpg]]<br />
<br />
[[File:ezplotf2.jpg]]<br />
<br />
At the end, we get T to be pretty small, our distribution that we're sampling from becomes sharper, and the points that we sample are close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).<br />
<br />
====''''' Example 2 (from June 30th lecture) ''''' ====<br />
<br />
Suppose we want to minimize the function <math>\displaystyle f(x) = (x - 2)^2 </math><br />
<br />
<br />
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use <math>\displaystyle Y \sim~ N(x, b^2)</math> and we begin with <math>\displaystyle T = 10</math><br />
<br />
Then the problem may be solved in MATLAB using the following:<br />
function v = obj(x)<br />
v = (x - 2).^2;<br />
<br />
T = 10; %this is the initial value of T, which we must gradually decrease<br />
b = 2;<br />
X(1) = 0;<br />
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)<br />
Y = b*randn + X(i-1); <br />
r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );<br />
U = rand;<br />
if U < r<br />
X(i) = Y; %accept Y<br />
else<br />
X(i) = X(i-1); %reject Y<br />
end;<br />
end;<br />
<br />
The first run (with <math>\displaystyle T = 10 </math>) gives us <math>\displaystyle X = 1.2792 </math><br />
<br />
Next, if we let <math>\displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001}</math> in the order displayed, then we get the following graph when we plot X:<br />
<br />
[[File:SA_Example2.jpg]]<br />
<br />
<br />
<br />
i.e. it converges to the minimum of the function<br />
<br />
<b>Travelling Salesman Problem</b><br />
<br />
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).<br />
<br />
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.<ref><br />
Applegate, D.L., Bixby, R.E., Chvátal, V., Cook, W.J., ''The Travelling Salesman Problem: A Computational Study'' Copyright 2007 Princeton University Press<br />
</ref><br />
<br />
<br /><br />
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled ''Introduction to TSP - Infrastructure for the Travelling Salesman Problem''. [http://cran.r-project.org/web/packages/TSP/vignettes/TSP.pdf]<br />
The examples are particularly useful because they are implemented using R (a statistical computing software environment).<br />
<br />
<br /><br />
<br />
==Gibbs Sampling - June 30, 2009==<br />
<br />
This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).<ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref> In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings<br />
:<math><br />
\,f(x) p_{xy} = f(y) p_{yx}<br />
</math><br />
<br />
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.<br />
<br />
<br /><br />
<b>Advantages</b><br />
<br />
*The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.<br />
*It is useful for high-dimensional distributions.<br />
<br />
<br /><br />
<b>Disadvantages</b><ref><br />
Gentle, James E. ''Elements of Computational Statistics'' Copyright 2002 Springer Science +Business Media, LLC <br />
</ref><br />
<br />
*We rarely know how to sample from the conditional distributions.<br />
*The algorithm can be extremely slow to converge.<br />
*It is often difficult to know when convergence has occurred.<br />
*The method is not practical when there are relatively small correlations between the random variables.<br />
<br />
<b> Example: </b> Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. <math>\displaystyle f(x_1, x_2, ... , x_n) </math><br />
<br />
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing<br />
<br />
<math>\displaystyle <br />
<br />
x_1 \sim~ f(x_1|x_2, x_3, ... , x_n)</math><br />
<br />
<math>x_2 \sim~ f(x_2|x_1, x_3, ... , x_n)</math><br />
<br />
<math><br />
\vdots<br />
</math><br />
<br />
<math><br />
x_n \sim~ f(x_n|x_1, x_2, ... , x_{n-1})<br />
</math><br />
<br />
and the resulting set of points drawn <math>\displaystyle (x_1, x_2, \ldots, x_n) </math> follows the required multivariate distribution.<br />
<br />
<br /><br />
Suppose we want to sample from a bivariate distribution <math>\displaystyle f(x,y) </math> with initial point <math>\displaystyle(x_i, y_i) = (0,0) </math>, i = 0 <br /><br />
Furthermore, suppose that we know how to sample from the conditional distributions <math>\displaystyle f_{X|Y}(x|y)</math> and <math>\displaystyle f_{Y|X}(y|x)</math><br />
<br />
<math>\emph{Procedure:}</math><br />
<br />
# <math>\displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) </math> (i.e. given the previous point, sample a new point)<br />
# <math>\displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y)</math> (note: it must be <math>\displaystyle Y_{i+1}</math> not <math>Y_{i}</math>, otherwise detailed balance may not hold)<br />
# Repeat Steps 1 and 2<br />
<br />
<b>Note</b> This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last <math>\displaystyle X_i </math> rather than all of them.<br />
<br />
<b>Example</b><br />
Suppose we want to generate samples from a bivariate normal distribution where <math>\displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right]</math> and <math>\sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]</math><br />
<br />
<br /><br />
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, <math>\displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2)</math> and <math>\displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)</math><br />
<br />
The problem (for a specified value <math>\displaystyle \rho</math>) may be solved in MATLAB using the following:<br />
Y = [1 ; 2];<br />
rho = 0.01;<br />
sigma = sqrt(1 - rho^2);<br />
X(1,:) = [0 0];<br />
<br />
for i = 2:5000<br />
mu = Y(1) + rho*(X(i-1,2) - Y(2));<br />
X(i,1) = mu + sigma*randn;<br />
mu = Y(2) + rho*(X(i-1,1) - Y(1));<br />
X(i,2) = mu + sigma*randn;<br />
end;<br />
%plot(X(:,1),X(:,2),'.') plots all of the points<br />
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> <br />
this demonstrates that convergence occurs after a while <br />
(this is called the burning time)<br />
<br />
The output of plotting all points is:<br />
<br />
[[File:Gibbs_Sampling.jpg]]<br />
<br />
==Metropolis-Hastings within Gibbs Sampling - July 2==<br />
<br />
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.<br />
<br />
*Choose <math>\displaystyle q </math> as a proposal distribution for X (assuming Y fixed).<br />
*Choose <math>\displaystyle \tilde{q} </math> as a proposal distribution for Y (assuming X fixed).<br />
*Do a Metropolis-Hastings step for X, treating Y as fixed.<br />
*Do a Metropolis-Hastings step for Y, treating X as fixed.<br />
<br />
:'''<math>\emph{Procedure:}</math> '''<br />
<br />
:# Start with some random variables <math>\displaystyle X_0, Y_0, n = 0</math><br />
:# Draw <math>Z~ \sim~ q(Z\mid{X_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)} \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \} </math><br />
:# <math>X_{n+1} = \begin{cases}<br />
Z, & \text{with probability r}\\ <br />
X_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Draw <math>Z~ \sim~ \tilde{q}(Z\mid{Y_n})</math><br />
:# Set <math>r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)} \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}</math><br />
:# <math>Y_{n+1} = \begin{cases}<br />
Z, & \text{with probability r} \\<br />
Y_n, & \text{with probability 1-r}\\<br />
\end{cases}</math><br />
:# Set <math>\displaystyle n = n + 1 </math>, return to step 2 and repeat the same procedure<br />
<br />
==Page Ranking and Review of Linear Algebra - July 7==<br />
<br />
===Page Ranking===<br />
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.<br />
<br />
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page. <br />
<br />
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.<br />
<br />
====Factors relating to importance of links====<br />
1) Importance (rank) of linking web page (higher importance is better).<br />
<br />
2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).<br />
<br />
====Definitions====<br />
<math>L_{i,j} = \begin{cases}<br />
1 , & \text{if j links to i}\\<br />
0 , & \text{else}\\ \end{cases}</math><br />
<br />
<br />
<math>c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j} </math><br />
<br />
<math>P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} <br />
\text{ where } 0 \leq d \leq 1 </math> <br />
<br />
Under this formula, <math>\displaystyle P_i</math> is never zero. We weight the sum and the constant using <math>\displaystyle d </math>(which is just a coefficient between 0 and 1 used to balance the objective function).<br />
<br />
<br />
'''In Matrix Form'''<br />
<br />
<br />
<math>\displaystyle P = (1-d)\times e + d \times L \times D^{-1} \times P </math><br />
<br />
<br />
where <br />
<math>P=\left(\begin{matrix}P_{1}\\<br />
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right)</math><br />
<math>e=\left(\begin{matrix} 1\\<br />
1\\ \vdots \\1 \end{matrix}\right)</math><br />
<br />
are both <math>\displaystyle N</math> X <math>\displaystyle 1</math> matrices <br />
<br />
<math>L=\left(\begin{matrix}L_{1,1}&L_{1,2}&\dots&L_{1,N}\\<br />
L_{2,1}&L_{2,2}&\dots&L_{2,N}\\<br />
\vdots&\vdots&\ddots&\vdots\\<br />
L_{N,1}&L_{N,2}&\dots&L_{N,N}<br />
\end{matrix}\right)</math><br />
<br />
<math>D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\<br />
0 & c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&c_{N} \end{matrix}\right)</math><br />
<br />
<math>D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\<br />
0 & 1/c_{2}&\dots&0\\<br />
\vdots&\vdots&\ddots&\vdots&\\<br />
0&0&\dots&1/c_{N} \end{matrix}\right)</math><br />
<br />
====Solving for P====<br />
Since rank is a relative term, if we make an assumption that <br />
<br />
<math>\sum_{i=1}^N P_i = 1</math> <br />
<br />
then we can solve for P (in matrix form this is <math>\displaystyle e^T \times P = 1</math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times 1 + d \times L \times D^{-1} \times P </math><br />
<br />
<math>\displaystyle P = (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{ by replacing 1 with } e^T \times P </math> <br />
<br />
<math>\displaystyle P = [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P \text{ by factoring out the } P </math><br />
<br />
<math>\displaystyle P = A \times P \text{ by defining A (notice that everything in A is known )} </math><br />
<br />
<br />
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.<br />
<br />
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.<br />
<br />
<br />
To solve for P, we start with a random guess <math>\displaystyle P_0</math> and repeatedly apply<br />
<br />
<math>\displaystyle P_i <= A \times P_i-1 </math><br />
<br />
Since this is a stationary series, for large n <math>\displaystyle P_n = P</math>.<br />
<br />
===Linear Algebra Review===<br />
<br />
<br />
<b>Inner Product</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Note that the inner product is also referred to as the dot product.<br />
If <math> \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{ and } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] </math> then the inner product is :<br />
<br />
<math> \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n</math><br />
<br />
<br />
The <b>length (or norm)</b> of <math>\displaystyle \vec{v} </math> is the non-negative scalar <math>\displaystyle||\vec{v}||</math> defined by<br />
<math>\displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} </math><br />
<br />
<br />
For <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> in <math>\mathbf{R}^n</math> , the <b>distance between <math>\displaystyle \vec{u} </math> and <math>\displaystyle \vec{v} </math> </b>written as <math> \displaystyle dist(\vec{u},\vec{v}) </math>, is the length of the vector <math> \vec{u} - \vec{v}</math>. That is,<br />
<math> \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||</math><br />
<br />
<br />
If <math> \vec{u} </math> and <math> \vec{v} </math> are non-zero vectors in <math>\mathbf{R}^2</math> or <math>\mathbf{R}^3</math> , then the angle between <math> \vec{u} </math> and <math> \vec{v} </math> is given as <math>\vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta</math><br />
<br />
<br />
<br />
<b>Orthogonal and Orthonormal</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthogonal</b> (to each other) if <math>\vec{u} \cdot \vec{v} = 0</math><br />
<br />
By the Pythagorean Theorem, it may also be said that two vectors <math> \vec{u} </math> and <math> \vec{v} </math> are orthogonal if and only if <math> ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2 </math><br />
<br />
Two vectors <math> \vec{u} </math> and <math> \vec{v} </math> in <math>\mathbf{R}^n</math> are <b>orthonormal</b> if <math>\vec{u} \cdot \vec{v} = 0</math> and <math>||\vec{u}||=||\vec{v}||=0</math><br />
<br />
<br />
An <b>orthonormal matrix <math>\displaystyle U</math></b> is a ''square invertible'' matrix, such that <math>\displaystyle U^{-1} = U^T</math> or alternatively <math>\displaystyle U^T \ U = U \ U^T = I</math><br />
<br />
Note that an orthogonal matrix is an orthonormal matrix.<br />
<br />
<br />
<br />
<b>Dependence and Independence</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly independent</b> if the vector equation <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math> has only the trivial solution (i.e. <math>\displaystyle a_k = 0 \ \forall k </math> ).<br />
<br />
<br />
The set of vectors <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is said to be <b>linearly dependent</b> if there exists a set of coefficients <math> \{ a_1, \dots , a_p \} </math> (not all zero), such that <math>\displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}</math>.<br />
<br />
<br />
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent. <br />
<br />
That is, any vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> is linearly dependent if p > n.<br />
<br />
<br />
If a vector set <math> \{ \vec{v_1}, \dots , \vec{v_p} \} </math> in <math>\mathbf{R}^n</math> contains the zero vector, then the set is linearly dependent.<br />
<br />
<br />
<br />
<b>Trace and Rank</b><ref><br />
Lay, David, ''Linear Algebra and its Applications'', Copyright 2006, Pearson Education Inc., Boston, MA, USA.<br />
</ref><br />
<br />
The <b>trace</b> of a ''square matrix'' <math>\displaystyle A_{nxn} </math>, denoted by <math>\displaystyle tr(A)</math>, is the sum of the diagonal entries in <math>\displaystyle A </math>. That is, <math>\displaystyle tr(A) = \sum_{i = 1}^n a_{ii}</math><br />
<br />
Note that an alternate definition for the trace is:<br />
<br />
<math>\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}</math><br />
<br />
i.e. it is the sum of all the eigenvalues of the matrix<br />
<br />
The <b>rank</b> of a matrix <math>\displaystyle A </math>, denoted by <math>\displaystyle rank(A) </math>, is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.<br />
<br />
<br />
A ''square matrix'' is <b>non-singular</b> if and only if its <b>rank</b> equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero).<br />
A matrix that is not invertible is sometimes called a <b>singular matrix</b>.<br />
<br />
A matrix is said to be ''non-singular'' if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.<br />
<br />
A square matrix is said to be ''orthogonal'' if <math> AA^T=A^TA=I</math>.<br />
<br />
For a square matrix A,<br />
*if <math> x^TAx > 0 for all x \neq 0</math>,then A is said to be ''positive-definite''.<br />
*if <math> x^TAx \geq 0</math> for all <math>x \neq 0</math>,then A is said to be ''positive-semidefinite''.<br />
<br />
The ''inverse'' of a square matrix A is denoted by <math>A^{-1}</math> and is such that <math>AA^{-1}=A^{-1}A=I</math>. The inverse of a matrix A exists if and only if A is non-singular.<br />
<br />
The ''pseudo-inverse'' matrix <math>A^{\dagger}</math> is typically used whenever <math>A^{-1}</math> does not exist because A is either not square or singular: <math>A^{\dagger} = (A^TA)^{-1}A^T</math> with <math>A^{\dagger}A = I</math>.<br />
<br />
<br />
<b>Vector Spaces</b><br />
<br />
The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.<br />
<br />
A set of vectors <math>\{u_1, u_2, u_3, ... u_n\}</math> is said to form a ''basis'' for a vector space if any arbitrary vector x can be represented by a linear combination of the <math>\{u_i\}</math>:<br />
<math>x = a_1u_1 + a_2u_2 + ... + a_nu_n</math><br />
*The coefficients <math>\{a_1, a_2, ... a_n\}</math> are called the ''components'' of vector x with the basis <math>\{u_i\}</math>.<br />
*In order to form a basis, it is necessary and sufficient that the <math>\{u_i\}</math> vectors be linearly independent.<br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthogonal'' if <br />
<math>u^T_i u_j\begin{cases}<br />
\neq 0, & \text{ if }i=j\\<br />
= 0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
A basis <math>\{u_i\}</math> is said to be ''orthonormal'' if<br />
<math>u^T_i u_j = \begin{cases}<br />
1, & \text{ if }i=j\\<br />
0, & \text{ if } i\neq j\\<br />
\end{cases}</math><br />
<br />
<br />
<b>Eigenvectors and Eigenvalues</b><br />
<br />
Given matrix <math>A_{NxN}</math>, we say that v is an '''eigenvector'' if there exists a scalar <math>\lambda</math> (the eigenvalue) such that <math>Av = \lambda v</math> where <math>\lambda</math> is the corresponding eigenvalue.<br />
<br />
Computation of eigenvalues<br />
<math>Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}<br />
v = 0, & \text{trivial solution}\\<br />
(A-\lambda v) = 0, & \text{non-trivial solution}\\<br />
\end{cases}</math><br />
<math>(A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow</math> Characteristic Equation<br />
<br />
Properties<br />
*If A is non-singular all eigenvalues are non-zero.<br />
*If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.<br />
*If A is positive-definite all eigenvalues are positive<br />
<br />
<br />
<b>Linear Transformations</b><br />
<br />
A ''linear transformation'' is a mapping from a vector space <math>X^N</math> onto a vector space <math>Y^M</math>, and it is represented by a matrix<br />
*Given vector <math>x \in X^N</math>, the corresponding vector y on <math>Y^M</math> is computed as <math> y = Ax</math>.<br />
*The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).<br />
<br />
A linear transformation represented by a square matrix A is said to be ''orthonormal'' when <math>AA^T=A^TA=I</math><br />
*implies that <math>A^T=A^{-1}</math><br />
*An orthonormal transformation has the property of preserving the magnitude of the vectors:<br />
<math>|y| = \sqrt{y^Ty} = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|</math><br />
*An orthonormal matrix can be thought of as a rotation of the reference frame<br />
*The ''row vectors'' of an orthonormal transformation form a set of orthonormal basis vectors.<br />
<br />
<br />
<b>Interpretation of Eigenvalues and Eigenvectors</b><br />
<br />
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.<br />
*When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.<br />
<br />
Given the covariance matrix <math>\sum</math> of a Gaussian distribution<br />
*The eigenvectors of <math>\sum</math> are the principal directions of the distribution<br />
*The eigenvalues are the variances of the corresponding principal directions<br />
<br />
The linear transformation defined by the eigenvectors of <math>\sum</math> leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in [http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis]).<br />
*If the distribution is Gaussian, then the transformed vectors will be statistically independent.<br />
<br />
==Principal Component Analysis - July 9==<br />
[http://en.wikipedia.org/wiki/Principal_component_analysis Principal Component Analysis] (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.<br />
<br />
===Rough definition===<br />
<br />
Keepings two important aspects of data analysis in mind:<br />
* Reducing covariance in data<br />
* Preserving information stored in data(Variance is a source of information)<br />
<br /><br />
PCA takes a sample of ''d'' - dimensional vectors and produces an orthogonal(zero covariance) set of ''d'' 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.<br />
<br />
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first ''k'' Principle Components and approximating the data in ''k'' - dimensional space, which is easier to analyze and plot.<br />
<br />
===Principal Components of handwritten digits===<br />
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset. <br />
<br />
[[File:threes_dataset.png]]<br />
<br />
We can represent each image as a vector of length 644 (<math>644 = 23 \times 28</math>) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).<br />
<br />
[[File:matrix_decomp_PCA.png]]<br />
<br />
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call <math>V \in M_{644,2}</math> and <math>W \in M_{2,103}</math>. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: <math> \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 </math>, where <math>\lambda = [\lambda_1, \lambda_2]^T</math> is a column of W.<br />
<br />
[[File:linear_comb_PCA.png]]<br />
<br />
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.<br />
<br />
[[File:PCA_plot.png]]<br />
<br />
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.<br />
<br />
===More examples===<br />
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction Nonlinear dimensionality reduction]).<br />
*Handwritten digits.<br />
*Recognition of hand orientation. (Isomap??)<br />
*Recognition of facial expressions. (LLE - Locally Linear Embedding?)<br />
*Arranging words based on semantic meaning.<br />
*Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)<br />
<br />
===Derivation of the first Principle Component===<br />
We want to find the direction of maximum variation. Let <math>\boldsymbol{w}</math> be an arbitrary direction, <math>\boldsymbol{x}</math> a data point and <math>\displaystyle u</math> the length of the projection of <math>\boldsymbol{x}</math> in direction <math>\boldsymbol{w}</math>.<br />
<br />
<math>\begin{align}<br />
\textbf{w} &= [w_1, \ldots, w_D]^T \\<br />
\textbf{x} &= [x_1, \ldots, x_D]^T \\<br />
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
The direction <math>\textbf{w}</math> is the same as <math>c\textbf{w}</math> so without loss of generality,<br><br />
<math><br />
\begin{align}<br />
|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\<br />
u &= \textbf{w}^T \textbf{x}<br />
\end{align}<br />
</math><br />
<br /><br /><br />
Let <math>x_1, \ldots, x_D</math> be random variables, then our goal is to maximize the variance of <math>u</math>,<br />
<br />
<math><br />
\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}, <br />
</math><br />
<br /><br /><br />
For a finite data set we replace the covariance matrix <math>\Sigma</math> by <math>s</math>, the sample covariance matrix, <br />
<br />
<math>\textrm{var}(u) = \textbf{w}^T s\textbf{w} </math><br />
<br /><br /><br />
is the variance of any vector <math>\displaystyle u </math>, formed by the weight vector <math>\displaystyle w </math><br />
<br />
The first principal component is the vector that maximizes the variance<br />
<br />
<math><br />
\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
<br />
where [http://en.wikipedia.org/wiki/Arg_max arg max] denotes the value of <math>w</math> that maximizes the function.<br />
<br />
Our goal is to find the weights <math>\displaystyle w </math> that maximize this variability, subject to a constraint. The problem then becomes,<br />
<br />
<math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
such that<br />
<math>\textbf{w}^T \textbf{w} = 1</math><br />
<br /><br /><br />
Notice,<br /><br />
<math><br />
\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|<br />
</math><br />
<br /><br /><br />
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.<br />
<br />
===Principal Component Analysis Continued - July 14===<br />
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: <math><br />
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right) <br />
</math><br />
with constraint<br />
<math>\textbf{w}^T \textbf{w} = 1</math>.<br />
<br />
Before we can proceed, we must review Lagrange Multipliers.<br />
<br />
====Lagrange Multiplier====<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
To find the maximum (or minimum) of a function <math>\displaystyle f(x,y)</math> subject to constraints <math>\displaystyle g(x,y) = 0 </math>, we define a new variable <math>\displaystyle \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle (x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example====<br />
Suppose we wish to maximise the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method on this example; the lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain 2 stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
====Determining '''W''' ====<br />
Back to the original problem, from the Lagrangian we obtain,<br />
<br /><br /><br />
<math>\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)</math><br />
<br /><br /><br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is a unit vector then the second part of the equation is 0. <br />
<br />
If <math> \textbf{w}^T \textbf{w} </math> is not a unit vector the the second part of the equation increases. Thus decreasing overall <math>\displaystyle L(\textbf{w},\lambda)</math>. Maximization happens when <math> \textbf{w}^T \textbf{w} =1 </math><br />
<br />
<br />
(Note that to take the derivative with respect to '''w''' below, <math> \textbf{w}^T S \textbf{w} </math> can be thought of as a quadratic function of '''w''', hence the '''2sw''' below. For more matrix derivatives, see section 2 of the [http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf Matrix Cookbook])<br />
<br />
Taking the derivative with respect to '''w''', we get:<br />
<br><br /><br />
<math>\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w} </math><br />
<br><br /><br />
Set <math> \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 </math>, we get<br />
<br><br /><br />
<math>\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^* </math><br />
<br><br /><br />
From the eigenvalue equation <math>\, \textbf{w}^*</math> is an eigenvector of '''S''' and <math>\, \lambda^*</math> is the corresponding eigenvalue of '''S'''. If we substitute <math>\displaystyle\textbf{w}^*</math> in <math>\displaystyle \textbf{w}^T S\textbf{w}</math> we obtain, <br /><br /><br />
<math>\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^* </math><br />
<br><br /><br />
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, '''u<sub>1</sub>''' to have the maximum variance<br /> (i.e. capturing as much variability in in <math>\displaystyle x_1, x_2,...,x_D </math> as possible.) Subsequent principal components will take up successively smaller parts of the total variability.<br />
<br />
<br />
D dimensional data will have D eigenvectors<br />
<br />
<math>\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D </math> where each <math>\, \lambda_i</math> represents the amount of variation in direction <math>\, i </math><br />
<br />
so that <br />
<br />
<math>Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)</math><br />
<br />
<br />
Note that the Principal Components decompose the total variance in the data:<br />
<br /><br /><br />
<math>\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)</math><br />
<br /><br /><br />
i.e. the sum of variations in all directions is the variation in the whole data<br />
<br /><br /><br />
<b> Example from class </b><br />
<br />
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images<br />
<br />
The Matlab code is as follows:<br />
<br />
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')<br />
who<br />
size(X)<br />
imagesc(reshape(X(:,1),20,28)')<br />
colormap gray<br />
imagesc(reshape(X(:,1),20,28)')<br />
[u s v] = svd(X);<br />
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components<br />
figure<br />
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.<br />
colormap gray<br />
<br />
Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face<br />
<br />
The second one is the denoised image<br />
<br />
The noisy face:<br />
<br />
[[File:face1.jpg]]<br />
<br />
The de-noised face:<br />
<br />
[[File:face2.jpg]]<br />
<br />
As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.<br />
<br />
===Principle Component Analysis (continued) - July 16 ===<br />
<br />
<br />
====Application of PCA - Feature Abstraction ====<br />
One of the applications of PCA is to group similar data (images etc). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).<br />
<br />
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data. <br />
<br />
=====Example: Comparing Images of 2s and 3s=====<br />
To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.<br />
<br />
The matlab code is as follows.<br />
load 2_3 %the size of this file is 64 X 400<br />
[coefs , scores ] = princom (X') <br />
% performs principal components analysis on the data matrix X<br />
% returns the principal component coefficients and scores<br />
% scores is the low dimensional representatioation of the data X<br />
plot(scores(:,1),scores(:,2)) <br />
% plots the first most variant dimension on the x-axis <br />
% and the second highest on the y-axis <br />
plot(scores(1:200,1),scores(1:200,2))<br />
% same graph as above, only with the 2s (not 3s)<br />
hold on % this command allows us to add to the current plot<br />
plot (scores(201:400,1),scores(201:400,2),'ro')<br />
% this addes the data for the 3s<br />
% the 'ro' command makes them red Os on the plot<br />
% If We classify based on the position in this plot (feature), <br />
% its easier than looking at each of the 64 data pieces<br />
gname() % displays a figure window and <br />
% waits for you to press a mouse button or a keyboard key<br />
Figure<br />
subplot(1,2,1)<br />
imagesc(reshape(:,45),8,8)')<br />
subplot(1,2,2)<br />
imagesc(reshape(:,93),8,8)')<br />
<br />
====General PCA Algorithm====<br />
<br />
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).<br />
[[File:PCAalgorithm.JPG]]<br />
<br />
Other Notes:<br />
::#The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean.<br />
::#Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. <br />
:::Encoding: <math>X_{D\times n} \longrightarrow Y_{d\times n}</math> using mapping <math>\, U^T X_{d \times n} </math>. <br />
::#When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). <br />
:::Reconstructing: <math> \hat{X}_{D\times n}\longleftarrow Y_{d \times n}</math> using mapping <math>\, UY_{D \times n} </math>. <br />
::#We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.<br />
<br />
==Fisher's Linear Discriminant Analysis (FDA) - July 16(cont) ==<br />
<br />
<br />
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses). <br />
<br />
The number of dimensions that we want to reduce the data to depends on the number of classes:<br />
<br><br />
For a 2 class problem, we want to reduce the data to one dimension (a line), <math>\displaystyle Z \in \mathbb{R}^{1}</math> <br />
<br><br />
Generally, for a k class problem, we want k-1 dimensions, <math>\displaystyle Z \in \mathbb{R}^{k-1}</math><br />
<br />
As we will see from our objective function, we want to maximize the separation of the classes. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).<br />
<br />
The following diagram summarizes this goal.<br />
<br />
[[File:FDA.JPG]]<br />
<br />
In fact, the two examples above may represent the same data projected on two different lines.<br />
<br />
[[File:FDAtwo.PNG]]<br />
<br />
=== Goal: Maximum Separation ===<br />
<br />
====1. Minimize the within class variance====<br />
<br />
<math>\displaystyle \min (w^T\sum_1w) </math><br />
<br />
<math>\displaystyle \min (w^T\sum_2w) </math><br />
<br />
and this problem reduces to <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math><br />
<br />
Let <math>\displaystyle \ s_w=\sum_1 + \sum_2</math> : within classes covariance.<br />
Then, this problem can be rewritten: <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
====2. Maximize the distance between the means of the projected data after projection====<br />
<br />
<math>\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 </math><br />
<br />
<math>\displaystyle = (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2) </math><br />
<br />
<math>\displaystyle = (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) </math><br />
<br />
which is a scalar. Therefore,<br />
<br />
<math>\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)] </math><br />
<br />
<math>\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw] </math><br />
<br />
(using the property of <math>\displaystyle tr[ABC] = tr[CAB] = tr[BCA] </math><br />
<br />
<math>\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw </math><br />
<br />
Thus, we get our original problem equivalent to<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math><br />
<br />
Let <math>\displaystyle \ s_B=(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T</math> : between classes covariance.<br />
Then, this problem can be rewritten: <math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
===Objective Function===<br />
We want an objective function which satisifies both of the goals outlined above (at the same time).<br /><br />
(1) <math>\displaystyle \min (w^T(\sum_1 + \sum_2)w)</math> or <math>\displaystyle \min (w^Ts_ww)</math><br />
<br />
(2) <math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math> <br />
<br />
We take the ratio of the two -- we wish to maximize<br /><br />
<br />
<math>\displaystyle [(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) ] / [(w^T(\sum_1 + \sum_2)w)] </math><br />
<br />
or<br />
<br />
<math>\displaystyle \max (w^Ts_Bw)/(w^Ts_ww)</math><br />
<br />
<br />
<br />
This is a very famous problem. We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we can solve the following constrained optimization problem<br />
<br />
<math>\displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) </math> or <math>\displaystyle \max (w^Ts_Bw)</math><br />
<br />
<br />Subject To: <br />
<math>\displaystyle (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)=1 </math> or <math>\displaystyle (w^Ts_Bw=1)</math> <br />
<br />
<br />
<br />
<br />
Therefore, the function that we want to maximize is<br />
<br />
<math>\displaystyle (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) - \lambda * [(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)-1] </math><br />
<br />
or <br />
<br />
<math>\displaystyle (w^Ts_Bw) - \lambda * [(w^Ts_ww)-1] </math><br />
<br />
<br />
<br />
<br />
<br />
== Continuation of Fisher's Linear Discriminant Analysis (FDA) - July 21 ==<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Example: FDA<br />
<br />
== Classification - July 21 (cont) ==<br />
<br />
<br />
<br />
<br />
<br />
<br />
'''Error Rate'''<br />
<br />
<math>\, \hat E(h) =Pr(h(x)\neq y) </math><br />
<br />
Given test points, how can we how can we find the error rate?<br />
<br />
We simply count the number of points that have been misclassified and divide bu the total number of points.<br />
<br />
<math>\, \hat E(h) = \frac{1}{N} \sum_{i=1}^N I (Y_i \neq h(x_i)) </math><br />
<br />
=== Bayes Classification Rule ===<br />
<br />
Considering the case of a two-class problem where <math> \mathcal{Y} = {0,1} </math><br />
<br />
<math>\, r(x)= Pr(Y=1 \mid X=x)= \frac {Pr(X=x \mid Y=1)P(Y=1)} {Pr(X=x)} </math><br />
<br />
Where the denominator <math>\, Pr(X=x) = Pr(X=x \mid Y=1)P(Y=1)+Pr(X=x \mid Y=0)P(Y=0) </math><br />
<br />
So our classifier function<br />
<math>h(x) = \begin{cases}<br />
1 & r(x) \geq \frac{1}{2} \\<br />
0 & o/w\\<br />
\end{cases}</math><br />
<br />
This function is considered the best classifier in terms of error rate. <br />
<br />
A problem is that we do not know the joint and marginal probability distributions to calculate <math>\, r(x)</math><br />
<br />
This function is viewed as a theoretical bound - the best that can be achieved by various classification techniques.<br />
<br />
One such technique is Decision Boundary<br />
<br />
=== Decision Boundary ===</div>Ipargaru