http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=S9hu&feedformat=atomstatwiki - User contributions [US]2024-03-29T11:06:12ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341f11&diff=14863stat341f112011-11-15T19:50:52Z<p>S9hu: /* Kernel PCA -November,15,2011 */</p>
<hr />
<div>Please contribute to the discussion of splitting up this page into multiple pages on the [[{{TALKPAGENAME}}|talk page]].<br />
<br />
==[[signupformStat341F11| Editor Sign Up]]==<br />
<br />
==Notation==<br />
<br />
The following guidelines on notation were posted on the Wiki Course Note page for [[Stat946f11|STAT 946]]. Add to them as necessary for consistent notation on this page.<br />
<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
<br />
==Sampling - September 20, 2011==<br />
<br />
The meaning of sampling is to generate data points or numbers such that these data follow a certain distribution.<br /><br />
i.e. From <math>x \sim~f(x)</math> sample <math>\,x_{1}, x_{2}, ..., x_{1000}</math><br />
<br />
In practice, it maybe difficult to find the joint distribution of random variables. Through simulating the random variables, we can make an inference from the data.<br />
<br />
===Sampling from Uniform Distribution===<br />
Computers cannot generate random numbers as they are deterministic; however they can produce pseudo random numbers using algorithms. Generated numbers mimic the properties of random numbers but they are never truly random. One famous algorithm that is considered highly reliable is the Mersenne twister[http://en.wikipedia.org/wiki/Mersenne_twister], which generates random numbers in an almost uniform distribution. <br />
<br />
<br />
====Multiplicative Congruential====<br />
*involves four parameters: integers <math>\,a, b, m</math>, and an initial value <math>\,x_0</math> which we call the seed<br />
*a sequence of integers is defined as<br />
:<math>x_{k+1} \equiv (ax_{k} + b) \mod{m}</math><br />
<br />
'''Example:''' <math>\,a=13, b=0, m=31, x_0=1</math> creates a uniform histogram.<br />
<br />
MATLAB code for generating 1000 random numbers using the multiplicative congruential method:<br />
<br />
<pre><br />
a = 13;<br />
b = 0;<br />
m = 31;<br />
x(1) = 1;<br />
<br />
for ii = 2:1000<br />
x(ii) = mod(a*x(ii-1)+b, m);<br />
end<br />
</pre><br />
<br />
MATLAB code for displaying the values of x generated:<br />
<br />
<pre><br />
x<br />
</pre><br />
<br />
MATLAB code for plotting the histogram of x:<br />
<br />
<pre><br />
hist(x)<br />
</pre><br />
<br />
Histogram Output:<br />
<br />
[[File:uniform.jpg]]<br />
<br />
Facts about this algorithm:<br />
*In this example, the first 30 terms in the sequence are a permutation of integers from 1 to 30 and then the sequence repeats itself.<br />
*Values are between <b>0</b> and <b>m-1</b>, inclusive.<br />
*Dividing the numbers by <b> m-1 </b> yields numbers in the interval <b>[0,1]</b>.<br />
*MATLAB's <code>rand</code> function once used this algorithm with <b>a= 7<sup>5</sup></b>, <b>b= 0</b>, <b>m= 2<sup>31</sup>-1</b>,for reasons described in Park and Miller's 1988 paper "Random Number Generators: Good Ones are Hard to Find" (available [http://www.firstpr.com.au/dsp/rand31/p1192-park.pdf online]).<br />
*Visual Basic's <code>RND</code> function also used this algorithm with <b>a= 1140671485</b>, <b>b= 12820163</b>, <b>m= 2<sup>24</sup></b>. ([http://support.microsoft.com/kb/231847 Reference])<br />
<br />
===Inverse Transform Method===<br />
This is a basic method for sampling. Theoretically using this method we can generate sample numbers at random from any probability distribution once we know its cumulative distribution function (cdf).<br />
<br />
====Theorem====<br />
Take <math>U \sim~ \mathrm{Unif}[0, 1]</math> and let <math>X = F^{-1}(U) </math>. Then <math>X</math> has distribution function <math>F(\cdot)</math>, where <math>F(x)=P(X \leq x)</math> and <math>F^{-1}(\cdot)</math> is the inverse of <math>F(\cdot)</math>.<br />
<br />
Therefore <math>F(x)=u\implies x=F^{-1}(u)</math><br />
<br />
'''Proof'''<br />
<br />
Recall that<br />
<br />
:<math>P(a \leq X<b)=\int_a^{b} f(x) dx</math><br />
<br />
:<math>cdf=F(x)=P(X \leq x)=\int_{-\infty}^{x} f(x) dx</math><br />
<br />
Note that if <math>U \sim~ \mathrm{Unif}[0, 1]</math>, we have <math>P(U \leq a)=a</math><br />
<br />
:<math>\begin{align}<br />
<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
====Continuous Case====<br />
Generally it takes two steps to get random numbers using this method.<br />
<br />
*Step 1. Draw <math>U \sim~ \mathrm{Unif}[0, 1]</math><br />
*Step 2. <b><i>X=F <sup>&minus;1</sup>(U)</i></b><br />
<br />
'''Example'''<br />
<br />
Take the exponential distribution for example<br />
:<math>\,f(x)={\lambda}e^{-{\lambda}x}</math><br />
:<math>\,F(x)=\int_0^x {\lambda}e^{-{\lambda}u} du=[-e^{-{\lambda}u}]_0^x=1-e^{-{\lambda}x}</math><br />
<br />
Let: <math>\,F(x)=y</math><br />
:<math>\,y=1-e^{-{\lambda}x}</math><br />
:<math>\,ln(1-y)={-{\lambda}x}</math><br />
:<math>\,x=\frac{ln(1-y)}{-\lambda}</math><br />
:<math>\,F^{-1}(x)=\frac{-ln(1-x)}{\lambda}</math><br />
<br />
Therefore, to get a exponential distribution from a uniform distribution takes 2 steps.<br />
*Step 1. Draw <math>U \sim~ \mathrm{Unif}[0, 1]</math><br />
*Step 2. <math>x=\frac{-ln(1-U)}{\lambda}</math><br />
<br />
Note: If U~Unif[0, 1], then (1 - U) and U have the same distribution. This allows us to slightly simplify step 2 into an alternate form:<br />
*Alternate Step 2. <math>x=\frac{-ln(U)}{\lambda}</math><br />
<br />
'''MATLAB code'''<br />
for exponential distribution case,assuming <math>\lambda=0.5</math><br />
<br />
<pre><br />
for ii = 1:1000<br />
u = rand;<br />
x(ii) = -log(1-u)/0.5;<br />
end<br />
hist(x)<br />
</pre><br />
<br />
MATLAB result<br />
<br />
[[File:MATLAB_Exp.jpg|center|300px]]<br />
<br />
====Discrete Case - September 22, 2011====<br />
This same technique can be applied to the discrete case. Generate a discrete random variable <math>\,x</math> that has probability mass function <math>\,P(X=x_i)=P_i </math> where <math>\,x_0<x_1<x_2...</math> and <math>\,\sum_i P_i=1</math><br />
*Step 1. Draw <math>u \sim~ \mathrm{Unif}[0, 1]</math><br />
*Step 2. <math>\,x=x_i</math> if <math>\,F(x_{i-1})<u \leq F(x_i)</math><br />
<br />
'''Example'''<br />
<br />
Let x be a discrete random variable with the following probability mass function:<br />
<br />
:<math>\begin{align}<br />
P(X=0) = 0.3 \\<br />
P(X=1) = 0.2 \\<br />
P(X=2) = 0.5<br />
\end{align}</math><br />
<br />
Given the pmf, we now need to find the cdf.<br />
<br />
We have:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0 & x < 0 \\<br />
0.3 & 0 \leq x < 1 \\<br />
0.5 & 1 \leq x < 2 \\<br />
1 & 2 \leq x<br />
\end{cases}</math><br />
<br />
We can apply the inverse transform method to obtain our random numbers from this distribution.<br />
<br />
'''Pseudo Code for generating the random numbers:'''<br />
<pre><br />
Draw U ~ Unif[0,1] <br />
if U <= 0.3 <br />
return 0 <br />
else if 0.3 < U <= 0.5 <br />
return 1<br />
else if 0.5 < U <= 1 <br />
return 2<br />
</pre><br />
<br />
'''MATLAB code for generating 1000 random numbers in the discrete case:'''<br />
<br />
<pre><br />
for ii = 1:1000<br />
u = rand;<br />
<br />
if u <= 0.3<br />
x(ii) = 0;<br />
else if u <= 0.5<br />
x(ii) = 1;<br />
else<br />
x(ii) = 2;<br />
end<br />
end<br />
</pre><br />
<br />
Matlab Output:<br />
<br />
[[File:Discreteinv.jpg]]<br />
<br />
'''Pseudo code for the Discrete Case:'''<br />
<br />
1. Draw U ~ Unif [0,1]<br />
<br />
2. If <math> U \leq P_0 </math>, deliver <b><i>X= x<sub>0</sub></i></b><br />
<br />
3. Else if <math> U \leq P_0 + P_1 </math>, deliver <b><i>X= x<sub>1</sub></i></b><br />
<br />
4. Else If <math> U \leq P_0 +....+ P_k </math>, deliver <b><i>X= x<sub>k</sub></i></b><br />
<br />
====Limitations====<br />
<br />
Although this method is useful, it isn't practical in many cases since we can't always obtain <math>F</math> or <math> F^{-1} </math> as some functions are not integrable or invertible, and sometimes even <math>f(x)</math> itself cannot be obtained in closed form. Let's look at some examples:<br />
*Continuous case<br />
If we want to use this method to draw the ''pdf'' of '''normal distribution''', we may find ourselves get stuck in finding its ''cdf''. <br />
The simplest case of '''normal distribution''' is <math>f(x)=\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}</math>,<br />
whose ''cdf'' is <math>F(x)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{x}{e^{-\frac{u^2}{2}}}du</math>. This integral cannot be expressed in terms of elementary functions. So evaluating it and then finding the inverse is a very difficult task.<br />
*Discrete case <br />
It is easy for us to simulate when there are only a few values taken by the particular random variable, like the case above.<br />
And it is easy to simulate the '''binomial distribution''' <math>X \sim~ \mathrm{B}(n,p)</math> when the parameter n is not too large.<br />
But when n takes on values that are very large, say 50, it is hard to do so.<br />
<br />
===Acceptance/Rejection Method===<br />
<br />
<br />
The aforementioned difficulties of the inverse transform method motivates a sampling method that does not require analytically calculating cdf's and their inverses, which is the acceptance/rejection sampling method. Here, <math> \displaystyle f(x)</math> is approximated by another function, say <math>\displaystyle g(x)</math>, with the idea being that <math>\displaystyle g(x)</math> is a "nicer" function to work with than <math>\displaystyle f(x)</math>.<br />
<br />
Suppose we assume the following:<br />
<br />
1. There exists another distribution <math>\displaystyle g(x)</math> that is easier to work with and that you know how to sample from, and<br />
<br />
2. There exists a constant c such that <math>f(x) \leq c \cdot g(x)</math> for all x<br />
<br />
Under these assumptions, we can sample from <math>\displaystyle f(x)</math> by sampling from <math>\displaystyle g(x)</math><br />
<br />
====General Idea====<br />
<br />
Looking at the image below we have graphed <math> c \cdot g(x) </math> and <math>\displaystyle f(x)</math>.<br />
<br />
[[File:Graph_updated.jpg]]<br />
<br />
Using the acceptance/rejection method we will accept some of the points from <math>\displaystyle g(x)</math> and reject some of the points from <math>\displaystyle g(x)</math>. The points that will be accepted from <math>\displaystyle g(x)</math> will have a distribution similar to <math>\displaystyle f(x)</math>. We can see from the image that the values around <math>\displaystyle x_1</math> will be sampled more often under <math>c \cdot g(x)</math> than under <math>\displaystyle f(x)</math>, so we will have to reject more samples taken at x<sub>1</sub>. Around <math>\displaystyle x_2</math> the number of samples that are drawn and the number of samples we need are much closer, so we accept more samples that we get at <math>\displaystyle x_2</math><br />
<br />
====Procedure====<br />
<br />
1. Draw y ~ g<br />
<br />
2. Draw U ~ Unif [0,1]<br />
<br />
3. If <math> U \leq \frac{f(y)}{c \cdot g(y)}</math> then x=y; else return to 1<br />
<br />
Note that the choice of <math> c </math> plays an important role in the efficiency of the algorithm. We want <math> c \cdot g(x) </math> to be "tightly fit" over <math> f(x) </math> to increase the probability of accepting points, and therefore reducing the number of sampling attempts. Mathematically, we want to minimize <math> c </math> such that <math>f(x) \leq c \cdot g(x) \ \forall x</math>. We do this by setting<br />
<br />
<math> \frac{d}{dx}(\frac{f(x)}{g(x)}) = 0 </math>, solving for a maximum point <math> x_0 </math> and setting <math> c = \frac{f(x_0)}{g(x_0)}. </math><br />
<br />
====Proof====<br />
<br />
Mathematically, we need to show that the sample points given that they are accepted have a distribution of f(x).<br />
<br />
<math>\begin{align} P(y|accepted) &= \frac{P(y, accepted)}{P(accepted)} \\<br />
<br />
&= \frac{P(accepted|y) P(y)}{P(accepted)}\end{align} </math> (Bayes' Rule)<br />
<br />
<br />
<br />
<math>\displaystyle P(y) = g(y)</math><br />
<br />
<math>P(accepted|y) =P(u\leq \frac{f(y)}{c \cdot g(y)}) =\frac{f(y)}{c \cdot g(y)} </math>,where u ~ Unif [0,1]<br />
<br />
<math>P(accepted) = \sum P(accepted|y)\cdot P(y)=\int^{}_y \frac{f(y)}{c \cdot g(y)}g(y) dy=\int^{}_y \frac{f(y)}{c} dy=\frac{1}{c} \cdot\int^{}_y f(y) dy=\frac{1}{c}</math><br />
<br />
So,<br />
<br />
<math> P(y|accepted) = \frac{ \frac {f(y)}{c \cdot g(y)} \cdot g(y)}{\frac{1}{c}} =f(y) </math><br />
<br />
====Continuous Case====<br />
<br />
'''Example'''<br />
<br />
Sample from Beta(2,1)<br />
<br />
In general:<br />
<br />
Beta(<math>\alpha, \beta) = \frac{\Gamma (\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}</math> <math>\displaystyle x^{\alpha-1}</math> <math>\displaystyle(1-x)^{\beta-1}</math>, <math>\displaystyle 0<x<1</math><br />
<br />
Note: <math>\!\Gamma(n) = (n-1)!</math> if n is a positive integer<br />
<br />
<math>\begin{align} f(x) &= Beta(2,1) \\<br />
&= \frac{\Gamma(3)}{\Gamma(2)\Gamma(1)} x^1(1-x)^0 \\<br />
&= \frac{2!}{1! 0!}\cdot (1) x \\<br />
&= 2x \end{align}</math><br />
<br />
We want to choose <math>\displaystyle g(x)</math> that is easy to sample from. So we choose <math>\displaystyle g(x)</math> to be uniform distribution.<br />
<br />
We now want a constant c such that <math>f(x) \leq c \cdot g(x) </math> for all x from Unif(0,1)<br />
<br />
<br />
So,<br /><br />
<br />
<math>c \geq \frac{f(x)}{g(x)}</math>, for all x from (0,1)<br />
<br />
<br />
<math>\begin{align}c &\geq max (\frac {f(x)}{g(x)}, 0<x<1) \\<br />
<br />
<br />
&= max (\frac {2x}{1},0<x<1) \\<br />
<br />
<br />
&= 2 \end{align}</math><br />
<br />
<br />
<br />
Now that we have c =2,<br />
<br />
1. Draw y ~ g(x) => Draw y ~ Unif [0,1] <br />
<br />
2. Draw u ~ Unif [0,1] <br />
<br />
3. if <math>u \leq \frac{2y}{2 \cdot 1}</math> then x=y; else return to 1<br />
<br />
<br />
'''MATLAB code for generating 1000 samples following Beta(2,1):'''<br />
<br />
<pre><br />
close all<br />
clear all<br />
ii=1;<br />
while ii < 1000<br />
y = rand;<br />
u = rand;<br />
<br />
if u <= y<br />
x(ii)=y;<br />
ii=ii+1;<br />
end<br />
end<br />
hist(x)<br />
</pre><br />
<br />
'''MATLAB result'''<br />
<br />
[[File:MATLAB_Beta.jpg]]<br />
<br />
====Discrete Example====<br />
<br />
Generate random variables according to the p.m.f:<br />
<br />
:<math>\begin{align}<br />
P(Y=1) = 0.15 \\<br />
P(Y=2) = 0.22 \\<br />
P(Y=3) = 0.33 \\<br />
P(Y=4) = 0.10 \\<br />
P(Y=5) = 0.20 <br />
\end{align}</math><br />
<br />
find a g(y) discrete uniform distribution from 1 to 5<br />
<br />
<math>c \geq \frac{P(y)}{g(y)} </math><br><br />
<math>c = \max \left(\frac{P(y)}{g(y)} \right)</math><br><br />
<math>c = \max \left(\frac{0.33}{0.2} \right) = 1.65</math> Since P(Y=3) is the max of P(Y) and g(y) = 0.2 for all y.<br><br />
<br />
1. Generate Y according to the discrete uniform between 1 - 5<br />
<br />
2. U ~ unif[0,1]<br />
<br />
3. If <math>U \leq \frac{P(y)}{1.65 \times 0.2} \leq \frac{P(y)}{0.33} </math>, then x = y; else return to 1.<br />
<br />
In MATLAB, the code would be:<br />
<br />
py = [0.15 0.22 0.33 0.1 0.2];<br />
ii =1;<br />
while ii <= 1000<br />
y = unidrnd(5);<br />
u = rand;<br />
if u <= py(y)/0.33<br />
x(ii) = y;<br />
ii = ii+1;<br />
end<br />
end<br />
hist(x);<br />
<br />
MATLAB result<br />
<br />
[[File:MATLAB_Y.jpg]]<br />
<br />
====Limitations====<br />
<br />
Most of the time we have to sample many more points from g(x) before we can obtain an acceptable amount of samples from f(x), hence this method may not be computationally efficient. It depends on our choice of g(x). For example, in the example above to sample from Beta(2,1), we need roughly 2000 samples from g(X) to get 1000 acceptable samples of f(x).<br />
<br />
In addition, in situations where a g(x) function is chosen and used, there can be a discrepancy between the functional behaviors of f(x) and g(x) that render this method unreliable. For example, given the normal distribution function as g(x) and a function of f(x) with a "fat" mid-section and "thin tails", this method becomes useless as more points near the two ends of f(x) will be rejected, resulting in a tedious and overwhelming number of sampling points having to be sampled due to the high rejection rate of such a method.<br />
<br />
===Sampling From Gamma and Normal Distribution - September 27, 2011===<br />
<br />
====Sampling From Gamma====<br />
<br />
'''Gamma Distribution'''<br />
<br />
The Gamma function is written as <math>X \sim~ Gamma (t, \lambda) </math><br />
<br />
:<math> F(x) = \int_{0}^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If you have t samples of the exponential distribution,<br><br />
<br> <math> \begin{align} X_1 \sim~ Exp(\lambda)\\ \vdots \\ X_t \sim~ Exp(\lambda) \end{align}<br />
</math><br />
<br />
The sum of these t samples has a gamma distribution,<br />
<br />
:<math> X_1+X_2+ ... + X_t \sim~ Gamma (t, \lambda) </math><br><br />
:<math> \sum_{i=1}^{t} X_i \sim~ Gamma (t, \lambda) </math> where <math>X_i \sim~Exp(\lambda)</math><br><br />
<br />
'''Method'''<br />
<br />
We can sample the exponential distribution using the inverse transform method from previous class,<br><br />
:<math>\,f(x)={\lambda}e^{-{\lambda}x}</math><br><br />
:<math>\,F^{-1}(u)=\frac{-ln(1-u)}{\lambda}</math><br><br />
:<math>\,F^{-1}(u)=\frac{-ln(u)}{\lambda}</math> <br />
1 - u is the same as x since <math>U \sim~ unif [0,1] </math><br><br />
:<math> \begin{align} \frac{-ln(u_1)}{\lambda} - \frac{ln(u_2)}{\lambda} - ... - \frac{ln(u_t)}{\lambda} = x_1\\ \vdots \\ \frac{-ln(u_1)}{\lambda} - \frac{ln(u_2)}{\lambda} - ... - \frac{ln(u_t)}{\lambda} = x_t \end{align}<br />
:</math><br><br />
:<math> \frac {-\sum_{i=1}^{t} ln(u_i)}{\lambda} = x</math><br />
<br />
'''MATLAB code''' for a Gamma(3,1) is<br />
<br />
<pre><br />
x = sum(-log(rand(1000,3)),2); <br />
hist(x)<br />
</pre><br />
<br />
And the Histogram of X follows a Gamma distribution with long tail: <br />
<br />
[[File:Hist.PNG|center|500px]]<br />
<br />
We can improve the quality of histogram by adding the number of bins we want, like hist(x, number_of_bins)<br />
<br />
<pre><br />
x = sum(-log(rand(20000,3)),2); <br />
hist(x,40)<br />
</pre><br />
<br />
[[File:untitled.jpg|center|500px]]<br />
<br />
''' R code''' for a Gamma(3,1) is<br />
<pre><br />
a<-apply(-log(matrix(runif(3000),nrow=1000)),1,sum);<br />
hist(a);<br />
</pre><br />
And the histogram is <br />
<br />
[[File:hist_gamma.png|center|500px]]<br />
<br />
Here is another histogram of Gamma coding with R<br />
<pre><br />
a<-apply(-log(matrix(runif(3000),nrow=1000)),1,sum);<br />
hist(a,freq=F);<br />
lines(density(a),col="blue");<br />
rug(jitter(a));<br />
</pre><br />
[[File:hist_gamma_2.png|center|500px]]<br />
<br />
====Sampling from Normal Distribution using Box-Muller Transform - September 29, 2011====<br />
<br />
=====Procedure=====<br />
<br />
# Generate <math>\displaystyle u_1</math> and <math>\displaystyle u_2</math>, two values sampled from a uniform distribution between 0 and 1.<br />
# Set <math>\displaystyle R^2 = -2log(u_1)</math> so that <math>\displaystyle R^2</math> is exponential with mean 1/2 <br> Set <math>\!\theta = 2*\pi*u_2</math> so that <math>\!\theta</math> ~ Unif[0, 2<math>\displaystyle\pi</math>]<br />
# Set <math>\displaystyle X = R cos(\theta)</math> <br> Set <math>\displaystyle Y = R sin(\theta)</math><br />
<br />
=====Justification=====<br />
<br />
Suppose we have X ~ N(0, 1) and Y ~ N(0, 1) where X and Y are independent normal random variables. The relative probability density function of these two random variables using Cartesian coordinates is:<br />
<br />
<math> f(X, Y) dxdy= f(X) f(Y) dxdy= \frac{1}{\sqrt{2\pi}}e^{-x^2/2} \frac{1}{\sqrt{2\pi}}e^{-y^2/2} dxdy= \frac{1}{2\pi}e^{-(x^2+y^2)/2}dxdy </math> <br><br />
<br />
In polar coordinates <math>\displaystyle R^2 = x^2 + y^2</math>, so the relative probability density function of these two random variables using polar coordinates is:<br />
<br />
<math> f(R, \theta) = \frac{1}{2\pi}e^{-R^2/2} </math> <br><br />
<br />
If we have <math>\displaystyle R^2 \sim exp(1/2)</math> and <math>\!\theta \sim unif[0, 2\pi]</math> we get an equivalent relative probability density function. Notice that after the two on two transformation, a determinant of jocobian should be added according to the change of variable and rule of differential multiplication where<br />
<br />
<math> |J|=\left|\frac{\partial(x,y)}{\partial(R,\theta)}\right|= \left|\begin{matrix}\frac{\partial x}{\partial R}&\frac{\partial x}{\partial \theta}\\\frac{\partial y}{\partial R}&\frac{\partial y}{\partial \theta}\end{matrix}\right|=R </math> <br><br />
<br />
<math> f(X, Y) dxdy = f(R, \theta)|J|dRd\theta = \frac{1}{2\pi}e^{-R^2/2}R dRd\theta= \frac{1}{4\pi}e^{-\frac{s}{2}} dSd\theta </math> <br>where <math> S=R^2. </math> <br><br />
<br />
Therefore we can generate a point in polar coordinates using the uniform and exponential distributions, then convert the point to Cartesian coordinates and the resulting X and Y values will be equivalent to samples generated from N(0, 1).<br />
<br />
'''MATLAB code'''<br />
<br />
In MatLab this algorithm can be implemented with the following code, which generates 20,000 samples from N(0, 1):<br />
<br />
<pre><br />
x = zeros(10000, 1);<br />
y = zeros(10000, 1);<br />
for ii = 1:10000<br />
u1 = rand;<br />
u2 = rand;<br />
R2 = -2 * log(u1);<br />
theta = 2 * pi * u2;<br />
x(ii) = sqrt(R2) * cos(theta);<br />
y(ii) = sqrt(R2) * sin(theta);<br />
end<br />
hist(x)<br />
</pre><br />
<br />
In one execution of this script, the following histogram for x was generated:<br />
<br />
[[File:Hist standard normal.jpg|center|500px]]<br />
<br />
=====Non-Standard Normal Distributions=====<br />
<br />
'''Example 1: Single-variate Normal'''<br />
<br />
If X ~ Norm(0, 1) then (a + bX) has a normal distribution with a mean of <math>\displaystyle a</math> and a standard deviation of <math>\displaystyle b</math> (which is equivalent to a variance of <math>\displaystyle b^2</math>). Using this information with the Box-Muller transform, we can generate values sampled from some random variable <math>\displaystyle Y\sim N(a,b^2) </math> for arbitrary values of <math>\displaystyle a,b</math>.<br />
<br />
# Generate a sample u from Norm(0, 1) using the Box-Muller transform.<br />
# Set v = a + bu.<br />
<br />
The values for v generated in this way will be equivalent to sample from a <math>\displaystyle N(a, b^2)</math>distribution. We can modify the MatLab code used in the last section to demonstrate this. We just need to add one line before we generate the histogram:<br />
<br />
<pre><br />
x = a + b * x;<br />
</pre><br />
<br />
For instance, this is the histogram generated when b = 15, a = 125:<br />
<br />
[[File:Hist normal.jpg|center|500]]<br />
<br />
'''Example 2: Multi-variate Normal'''<br />
<br />
The Box-Muller method can be extended to higher dimensions to generate multivariate normals. The objects generated will be nx1 vectors, and their variance will be described by nxn covariance matrices.<br />
<br />
<math>\mathbf{z} = N(\mathbf{u}, \Sigma)</math> defines the n by 1 vector <math>\mathbf{z}</math> such that:<br />
<br />
* <math>\displaystyle u_i</math> is the average of <math>\displaystyle z_i</math><br />
* <math>\!\Sigma_{ii}</math> is the variance of <math>\displaystyle z_i</math><br />
* <math>\!\Sigma_{ij}</math> is the co-variance of <math>\displaystyle z_i</math> and <math>\displaystyle z_j</math><br />
<br />
If <math>\displaystyle z_1, z_2, ..., z_d</math> are normal variables with mean 0 and variance 1, then the vector <math>\displaystyle (z_1, z_2,..., z_d) </math> has mean 0 and variance <math>\!I</math>, where 0 is the zero vector and <math>\!I</math> is the identity matrix. This fact suggests that the method for generating a multivariate normal is to generate each component individually as single normal variables.<br />
<br />
The mean and the covariance matrix of a multivariate normal distribution can be adjusted in ways analogous to the single variable case. If <math>\mathbf{z} \sim N(0,I)</math>, then <math>\Sigma^{1/2}\mathbf{z}+\mu \sim N(\mu,\Sigma)</math>. Note here that the covariance matrix is symmetric and nonnegative, so its square root should always exist.<br />
<br />
We can compute <math>\mathbf{z}</math> in the following way:<br />
<br />
# Generate an n by 1 vector <math>\mathbf{x} = \begin{bmatrix}x_{1} & x_{2} & ... & x_{n}\end{bmatrix}</math> where <math>x_{i}</math> ~ Norm(0, 1) using the Box-Muller transform.<br />
# Calculate <math>\!\Sigma^{1/2}</math> using singular value decomposition.<br />
# Set <math>\mathbf{z} = \Sigma^{1/2} \mathbf{x} + \mathbf{u}</math>.<br />
<br />
The following MatLab code provides an example, where a scatter plot of 10000 random points is generated. In this case x and y have a co-variance of 0.9 - a very strong positive correlation.<br />
<br />
<pre><br />
x = zeros(10000, 1);<br />
y = zeros(10000, 1);<br />
for ii = 1:10000<br />
u1 = rand;<br />
u2 = rand;<br />
R2 = -2 * log(u1);<br />
theta = 2 * pi * u2;<br />
x(ii) = sqrt(R2) * cos(theta);<br />
y(ii) = sqrt(R2) * sin(theta);<br />
end<br />
<br />
E = [1, 0.9; 0.9, 1];<br />
[u s v] = svd(E);<br />
root_E = u * (s ^ (1 / 2));<br />
<br />
z = (root_E * [x y]);<br />
z(:,1) = z(:,1) + 5;<br />
z(:,2) = z(:,2) + -8;<br />
<br />
scatter(z(:,1), z(:,2))<br />
</pre><br />
<br />
This code generated the following scatter plot:<br />
<br />
[[File:scatter covar.jpg|center|500px]]<br />
<br />
In Matlab, we can also use the function "sqrtm()" or "chol()" (Cholesky Decomposition) to calculate square root of a matrix directly. Note that the resulting root matrices may be different but this does materially affect the simulation.<br />
Here is an example:<br />
<br />
<pre><br />
E = [1, 0.9; 0.9, 1];<br />
r1 = sqrtm(E);<br />
r2 = chol(E);<br />
</pre><br />
<br />
R code for a multivariate normal distribution:<br />
<br />
<pre><br />
n=10000;<br />
r2<--2*log(runif(n));<br />
theta<-2*pi*(runif(n));<br />
x<-sqrt(r2)*cos(theta);<br />
<br />
y<-sqrt(r2)*sin(theta);<br />
a<-matrix(c(x,y),nrow=n,byrow=F);<br />
e<-matrix(c(1,.9,09,1),nrow=2,byrow=T);<br />
svde<-svd(e);<br />
root_e<-svde$u %*% diag(svde$d)^1/2;<br />
z<-t(root_e %*%t(a));<br />
z[,1]=z[,1]+5;<br />
z[,2]=z[,2]+ -8;<br />
par(pch=19);<br />
plot(z,col=rgb(1,0,0,alpha=0.06))<br />
</pre><br />
<br />
[[File:m_normal.png|center|500px]]<br />
<br />
=====Remarks=====<br />
MATLAB's randn function uses the ziggurat method to generate normal distributed samples. It is an efficient rejection method based on covering the probability density function with a set of horizontal rectangles so as to obtain points within each rectangle. It is reported that a 800 MHz Pentium III laptop can generate over 10 million random numbers from normal distribution in less than one second. ([http://www.mathworks.com/company/newsletters/news_notes/clevescorner/spring01_cleve.html Reference])<br />
<br />
===Sampling From Binomial Distributions===<br />
<br />
In order to generate a sample x from <math>\displaystyle X \sim Bin(n, p)</math>, we can follow the following procedure:<br />
<br />
1. Generate n uniform random numbers sampled from <math>\displaystyle Unif [0, 1] </math>: <math>\displaystyle u_1, u_2, ..., u_n</math>.<br />
<br />
2. Set x to be the total number of cases where <math>\displaystyle u_i <= p</math> for all <math>\displaystyle 1 <= i <= n</math>.<br />
<br />
In MatLab this can be coded with a single line. The following generates a sample from <math>\displaystyle X \sim Bin(n, p)</math> <br />
<br />
>> sum(rand(n, 1) <= p, 1)<br />
<br />
==Bayesian Inference and Frequentist Inference - October 4, 2011==<br />
<br />
===Bayesian inference vs Frequentist inference===<br />
The Bayesian method has become popular in the last few decades as simulation and computer technology makes it more applicable. For more information about its history and application, please refer to http://en.wikipedia.org/wiki/Bayesian_inference.<br />
As for frequentists, please refer to http://en.wikipedia.org/wiki/Frequentist_inference.<br />
<br />
====Example====<br />
Consider: A person drinks a cup of coffee on a specific day.<br />
<br><br><br />
Frequentist: There is no explanation to this situation. It is essentially meaningless since it has only occurred once. Therefore, it is not a probability.<br />
<br><br />
Bayesian: Probability is not just about the frequent occurrences but it is what you believe about this probability.<br />
<br />
<br />
====Example of face identification====<br />
Take the face as input x. And the person as output y. The person can be either Ali or Tom. If it is Ali, y=1. Otherwise, y=0. We can divide the picture into 100*100 pixels and then list them into a 10,000*1 column vector which is x.<br />
<br />
If you are a frequentist, you would compare Pr(X=x|y=1) with Pr(X=x|y=0) and see which one is higher. But if you are a Bayesianist, you would compare Pr(y=1|X=x) with Pr(y=0|X=x).<br />
<br />
====Summary of differences between two schools====<br />
*Frequentist: Probability refers to limiting relative frequency. (objective)<br />
*Bayesian: Probability describes degree of belief not frequency. (subjective)<br />
e.g. The probability that you drank a cup of tea on May 20, 2001 is 0.62 does not refer to any frequency.<br />
----<br />
*Frequentist: Parameters are fixed, unknown constants.<br />
*Bayesian: Parameters are random variables and we can make probabilistic statement about them.<br />
----<br />
*Frequentist: Statistical procedures should have long run frequency probabilities.<br />
e.g. a 95% confidence interval should trap true value of the parameter for at least 95% of limited frequency<br />
*Bayesian: It makes inferences about <math>\theta</math> by producing a prbability distribution for <math>\theta</math>. Inference (e.g. point estimation) will be extracted from this distribution.<br />
<br />
====Bayesian inference====<br />
<br />
Bayesian inference is usually carried out in the following way:<br />
<br />
1. Choose a prior probability density function of <math>\!\theta</math> which is <math>f(\!\theta)</math>. This is our belief about <math>\theta</math> before we see any data.<br />
<br />
2. Choose a statistical model <math>\displaystyle f(x|\theta)</math> that reflects our beliefs about X.<br />
<br />
3. After observing data <math>\displaystyle x_1,...,x_n</math>, we update our beliefs and calculate the posterior probability.<br />
<br />
<math>f(\theta|x) = \frac{f(\theta,x)}{f(x)}=\frac{f(x|\theta) \cdot f(\theta)}{f(x)}=\frac{f(x|\theta) \cdot f(\theta)}{\int^{}_\theta f(x|\theta) \cdot f(\theta) d\theta}</math>, where <math>\displaystyle f(\theta|x)</math> is the posterior probability, <math>\displaystyle f(\theta)</math> is the prior probability, <math>\displaystyle f(x|\theta)</math> is the likelihood of observing X=x given <math>\!\theta</math> and f(x) is the marginal probability of X=x.<br />
<br />
If we have i.i.d. observations <math>\displaystyle x_1,...,x_n</math>, we can replace <math>\displaystyle f(x|\theta)</math> with <math>f({x_1,...,x_n}|\theta)=\prod_{i=1}^n f(x_i|\theta)</math> because of independency.<br />
<br />
We denote <math>\displaystyle f({x_1,...,x_n}|\theta)</math> as <math>\displaystyle L_n(\theta)</math> which is called likelihood. And we use <math>\displaystyle x^n</math> to denote <math>\displaystyle (x_1,...,x_n)</math>.<br />
<br />
<math>f(\theta|x^n) = \frac{f(x^n|\theta) \cdot f(\theta)}{f(x^n)}=\frac{f(x^n|\theta) \cdot f(\theta)}{\int^{}_\theta f(x^n|\theta) \cdot f(\theta) d\theta}</math> , where <math>\int^{}_\theta f(x^n|\theta) \cdot f(\theta) d\theta</math> is a constant <math>\displaystyle c_n</math>. So <math>f(\theta|x^n) \propto f(x^n|\theta) \cdot f(\theta)</math>. The posterior probability is proportional to the likelihood times prior probability.<br />
<br />
<math>E(\theta)=\int^{}_\theta \theta \cdot f(\theta|x^n) d\theta</math> which is the posterior mean of <math>\!\theta</math>.<br />
<br />
Let <math>\tilde{\theta}=(\theta_1,...,\theta_d)^T</math>, then <math>f(\theta_1|x^n) = \int^{} \int^{} \dots \int^{}f(\theta|X)d\theta_2d\theta_3 \dots d\theta_d </math> and <math>E(\theta_1)=\int^{}\theta_1 \cdot f(\theta_1|x^n) d\theta_1</math><br />
<br />
====Example 1: Estimating parameters of a univariate Gaussian distribution====<br />
<br />
Suppose X follows a univariate Gaussian distribution (i.e. a Normal distribution) with parameters <math>\!\mu</math> and <br />
<math>\displaystyle {\sigma^2}</math>.<br />
<br />
(a) For Frequentists:<br />
<br />
<math>f(x|\theta)= \frac{1}{\sqrt{2\pi}\sigma} \cdot e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}</math><br />
<br />
<math>L_n(\theta)= \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma} \cdot e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2}</math><br />
<br />
<br />
<math>\ln L_n(\theta) = l(\theta) = \sum_{i=1}^n -\frac{1}{2}\ln 2\pi-\ln \sigma-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2</math><br />
<br />
To get the maximum likelihood estimator of <math>\!\mu</math> (mle), we find the <math>\hat{\mu}</math> which maximizes <math>\displaystyle L_n(\theta)</math>:<br />
<br />
<math>\frac{\partial l(\theta)}{\partial \mu}= \sum_{i=1}^n \frac{1}{\sigma}(\frac{x_i-\mu}{\sigma})=0 \Rightarrow \sum_{i=1}^n x_i = n\mu \Rightarrow \hat{\mu}_{mle}=\bar{x}</math><br />
<br />
(b) For Bayesians:<br />
<br />
<math>f(\theta|x) \propto f(x|\theta) \cdot f(\theta)</math><br />
<br />
We assume that the mean of the above normal distribution is itself distributed normally with mean <math>\!\mu_0</math> and variance <math>\!\Gamma</math>.<br />
<br />
Suppose <math>\!\mu\sim N(\mu_0, \!\Gamma^2</math>),<br />
<br />
so <math>f(\mu) = \frac{1}{\sqrt{2\pi}\Gamma} \cdot e^{-\frac{1}{2}(\frac{\mu-\mu_0}{\Gamma})^2}</math><br />
<br />
<math>f(\mu|x) = \frac{1}{\sqrt{2\pi}\tilde{\sigma}} \cdot e^{-\frac{1}{2}(\frac{\mu-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
<br />
<math>\tilde{\mu} = \frac{\frac{n}{\sigma^2}}{\frac{n}{\sigma^2}+\frac{1}{\Gamma^2}}\bar{x}+\frac{\frac{1}{\Gamma^2}}{\frac{n}{\sigma^2}+\frac{1}{\Gamma^2}}\mu_0</math>, where <math>\tilde{\mu}</math> is the estimator of <math>\!\mu</math>.<br />
<br />
* If prior belief about <math>\!\mu_0</math> is strong, then <math>\!\Gamma</math> is small and <math>\frac{1}{\Gamma^2}</math> is large. <math>\tilde{\mu}</math> is close to <math>\!\mu_0</math> and the observations will not affect too much. On the contrary, if prior belief about <math>\!\mu_0</math> is weak, <math>\!\Gamma</math> is large and <math>\frac{1}{\Gamma^2}</math> is small. <math>\tilde{\mu}</math> depends more on observations.(This is intuitive, when our original belief is reliable, then the sample is not important in improving the result; when the belief is not reliable, then we depend a lot on the sample.)<br />
<br />
* When the sample is large (i.e. n <math>\to \infty</math>), <math>\tilde{\mu} \to \bar{x}</math> and the impact of prior belief about <math>\!\mu</math> is weakened.<br />
<br />
=='''Basic Monte Carlo Integration - October 6th, 2011'''==<br />
<br />
Three integration methods would be taught in this course:<br />
*Basic Monte Carlo Integration<br />
*Importance Sampling<br />
*Markov Chain Monte Carlo (MCMC)<br />
<br />
The first, and most basic, method of numerical integration we will see is Monte Carlo Integration. We use this to solve an integral of the form: <math> I = \int_{a}^{b} h(x) dx </math><br />
<br />
Note the following derivation: <br />
<br />
<math>\begin{align}<br />
\displaystyle I & = \int_{a}^{b} h(x)dx \\<br />
& = \int_{a}^{b} h(x)((b-a)/(b-a))dx \\<br />
& = \int_{a}^{b} (h(x)(b-a))(1/(b-a))dx \\<br />
& = \int_{a}^{b} w(x)f(x)dx \\<br />
& = E[w(x)] \\<br />
\end{align}<br />
</math><br />
<br />
~<math>(1/n) \sum_{i=1}^{n} w(x) </math><br />
<br />
Where w(x) = h(x)(b-a) and f(x) is the probability density function of a uniform random variable on the interval [a,b]. The expectation, with respect to the distribution of f, of w is taken from n samples of x.<br />
<br />
<br />
===='''General Procedure'''====<br />
<br />
i) Draw n samples <math> x_i \sim~ U[a,b] </math><br />
<br />
ii) Compute <math> \ w(x_i) </math> for every sample<br />
<br />
iii) Obtain an estimate of the integral, <math> \hat{I} </math>, as follows:<br />
<br />
<math> \hat{I} = 1/n \sum_{i=1}^{n} w(x</math><sub>i</sub><math> )</math> . Clearly, this is just the average of the simulation results.<br />
<br />
By the strong law of large numbers <math> \hat{I} </math> converges to <math> \ I </math> as <math> \ n \rightarrow \infty </math>. Because of this, we can compute all sorts of useful information, such as variance, standard error, and confidence intervals.<br />
<br />
Standard Error: <math> SE = Standard Deviation / \sqrt{n} </math><br />
<br />
Variance: <math> V = (\sum_{i=1}^{n} (w(x)-I)^2)/(n-1) </math><br />
<br />
Confidence Interval: <math> I \pm t_{(\alpha/2)} SE </math><br />
<br />
==='''Example: Uniform Distribution'''===<br />
<br />
Consider the integral, <math> \int_{0}^{1} x^3dx </math>, which is easily solved through standard analytical integration methods, and is equal to .25. Now, let us check this answer with a numerical approximation using Monte Carlo Integration. <br />
<br />
We generate a 1 by 10000 vector of uniform (on the interval [0,1]) random variables and call that vector 'u'. We see that our 'w' in this case is <math> x^3 </math>, so we set <math> w = u^3 </math>. Our I<sup>^</sup> is equal to the mean of w.<br />
<br />
In Matlab, we can solve this integration problem with the following code:<br />
<br />
<pre><br />
u = rand(1,10000);<br />
w = u.^3;<br />
mean(w)<br />
ans = 0.2475<br />
</pre><br />
<br />
Note the '.' after 'u' in the second line of code, indicating that each entry in the matrix is cubed. Also, our approximation is close to the actual value of .25. Now let's try to get an even better approximation by generating more sample points. <br />
<br />
<pre><br />
u= rand(1,100000);<br />
w= u.^3;<br />
mean(w)<br />
ans = .2503<br />
</pre><br />
<br />
We see that when the number of sample points is increased, our approximation improves, as one would expect.<br />
<br />
==='''Generalization'''===<br />
<br />
Up to this point we have seen how to numerically approximate an integral when the distribution of f is uniform. Now we will see how to generalize this to other distributions.<br />
<br />
<math> I = \int h(x)f(x)dx </math> <br />
<br />
If f is a distribution function (pdf), then <math> I </math> can be estimated as E<sub>f</sub>[h(x)]. This means taking the expectation of h with respect to the distribution of f. Our previous example is the case where f is the uniform distribution between [a,b].<br />
<br />
'''Procedure for the General Case'''<br />
<br />
i) Draw n samples from f <br />
<br />
ii) Compute h(x<sub>i</sub>)<br />
<br />
iii) <math>\hat{I} = 1/n \sum_{i=1}^{n} h(x</math><sub>i</sub><math>)</math><br />
<br />
==='''Example: Exponential Distribution'''===<br />
<br />
Find <math> E[\sqrt{x}] </math> for <math> \displaystyle f = e^{-x} </math>, which is the exponential distribution with mean 1.<br />
<br />
<math> I = \int_{0}^{\infty} \sqrt{x} e^{-x}dx </math><br />
<br />
We can see that we must draw samples from f, the exponential distribution.<br />
<br />
To find a numerical solution using Monte Carlo Integration we see that: <br />
<br />
u= rand(1,10000)<br />
X= -log(u)<br />
h= <math> \sqrt{x} </math> <br />
I= mean(h)<br />
<br />
To implement this procedure in Matlab, use the following code:<br />
<br />
<pre><br />
u = rand(1,10000);<br />
X = -log(u);<br />
h = x.^.5;<br />
mean(h)<br />
ans = .8841<br />
</pre><br />
<br />
An easy way to check whether your approximation is correct is to use the built in Matlab function 'quadl' which takes a function and bounds for the integral and returns a solution for the definite integral of that function. For this specific example, we can enter:<br />
<br />
<pre><br />
f = @(x) sqrt(x).*exp(-x);<br />
% quadl runs into computational problems when the upper bound is "inf" or an extremely large number, <br />
% so choose just a moderately large number.<br />
quadl(f,0,100)<br />
ans =<br />
0.8862<br />
</pre><br />
<br />
From the above result, we see that our approximation was quite close.<br />
<br />
==='''Example: Normal Distribution'''===<br />
<br />
Let <math> f(x) = (1/(2 \pi)^{1/2}) e^{(-x^2)/2} </math>. Compute the cumulative distribution function at some point x.<br />
<br />
<math> F(x)= \int_{-\infty}^{x} f(s)ds = \int_{-\infty}^{x}(1)(1/(2 \pi)^{1/2}) e^{(-s^2)/2}ds </math>. The (1) is inserted to illustrate that our h(x) will be the constant function 1, and our f(x) is the normal distribution. To take into account the upper bound of integration, x, any values sampled that are greater than x will be set to zero. <br />
<br />
This is the Matlab code for solving F(2):<br />
<br />
<pre><br />
<br />
u = randn(1,10000)<br />
h = u < 2;<br />
mean(h)<br />
ans = .9756<br />
<br />
</pre><br />
<br />
We generate a 1 by 10000 vector of standard normal random variables and we return a value of 1 if u is less than 2, and 0 otherwise.<br />
<br />
We can also build the function F(x) in matlab in the following way:<br />
<br />
<pre><br />
function F(x)<br />
u=rand(1,1000000);<br />
h=u<x;<br />
mean(h)<br />
</pre><br />
<br />
<br />
==='''Example: Binomial Distribution'''===<br />
<br />
In this example we will see the Bayesian Inference for 2 Binomial Distributions.<br />
<br />
Let <math> X ~ Bin(n,p) </math> and <math> Y ~ Bin(m,q) </math>, and let <math> \!\delta = p-q </math>.<br />
<br />
Therefore, <math> \displaystyle \!\delta = x/n - y/m </math> which is the frequentist approach.<br />
<br />
Bayesian wants <math> \displaystyle f(p,q|x,y) = f(x,y|p,q)f(p,q)/f(x,y) </math>, where <math> f(x,y)=\iint\limits_{\!\theta} f(x,y|p,q)f(p,q)\,dp\,dq</math> is a constant.<br />
<br />
Thus, <math> \displaystyle f(p,q|x,y)\propto f(x,y|p,q)f(p,q) </math>. Now we assume that <math>\displaystyle f(p,q) = f(p)f(q) = 1 </math> and f(p) and f(q) are uniform.<br />
<br />
Therefore, <math> \displaystyle f(p,q|x,y)\propto p^x(1-p)^{n-x}q^y(1-q)^{m-y} </math>.<br />
<br />
<math> E[\delta] = \int_{0}^{1} \int_{0}^{1} (p-q)f(p,q|x,y)dxdy </math>.<br />
<br />
As you can see this is much tougher than the frequentist approach.<br />
<br />
=='''Importance Sampling and Basic Monte Carlo Integration - October 11th, 2011'''==<br />
<br />
==='''Example: Binomial Distribution (Continued)'''===<br />
<br />
Suppose we are given two independent Binomial Distributions <math>\displaystyle X \sim Bin(n, p_1)</math>, <math>\displaystyle Y \sim Bin(m, p_2)</math>. We would like to give an Monte Carlo estimate of <math>\displaystyle \delta = p_1 - p_2</math><br><br />
<br />
Frequentist approach: <br><br><math>\displaystyle \hat{p_1} = \frac{X}{n}</math> ; <math>\displaystyle \hat{p_2} = \frac{Y}{m}</math><br><br><math>\displaystyle \hat{\delta} = \hat{p_1} - \hat{p_2} = \frac{X}{n} - \frac{Y}{m}</math><br><br><br />
Bayesian approach to compute the expected value of <math>\displaystyle \delta</math>:<br><br><br />
<math>\displaystyle E(\delta) = \int\int(p_1-p_2) f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Assume that <math>\displaystyle n = 100, m = 100, p_1 = 0.5, p_2 = 0.8</math> and the sample size is 1000.<br><br />
MATLAB code of the above example:<br />
<pre><br />
n = 100;<br />
m = 100;<br />
p_1 = 0.5;<br />
p_2 = 0.8;<br />
p1 = mean(rand(n,1000)<p_1);<br />
p2 = mean(rand(m,1000)<p_2);<br />
delta = p2 - p1;<br />
hist(delta)<br />
mean(delta)<br />
</pre><br />
<br />
In one execution of the code, the mean of delta was 0.3017. The histogram of delta generated was:<br />
[[File:Hist delta.jpg|center|]]<br />
<br />
Through Monte Carlo simulation, we can obtain an empirical distribution of delta and carry out inference on the data obtained, such as computing the mean, maximum, variance, standard deviation and the standard error of delta.<br />
<br />
==='''Importance Sampling'''===<br />
<br />
====Motivation====<br />
<br />
Consider the integral <math>\displaystyle I = \int h(x)f(x)\,dx</math><br><br><br />
According to basic Monte Carlo Integration, if we can sample from the probability density function <math>\displaystyle f(x)</math> and feed the samples of <math>\displaystyle f(x)</math> back to <math>\displaystyle h(x)</math>, <math>\displaystyle I</math> can be estimated as an average of <math>\displaystyle h(x)</math> ( i.e. <math>\hat{I} = \frac{1}{n} \sum_{i=1}^{n} h(x_i)</math> )<br><br />
However, the Monte Carlo method works when we know how to sample from <math>\displaystyle f(x)</math>. In the case where it is difficult to sample from <math>\displaystyle f(x)</math>, importance sampling is a technique that we can apply. Importance Sampling relies on another function <math>\displaystyle g(x)</math> which we know how to sample from.<br />
<br />
The above integral can be rewritten as follow:<br><br />
<math>\begin{align}<br />
\displaystyle I & = \int h(x)f(x)\,dx \\<br />
& = \int h(x)f(x)\frac{g(x)}{g(x)}\,dx \\<br />
& = \int \frac{h(x)f(x)}{g(x)}g(x)\,dx \\<br />
& = \int y(x)g(x)\,dx \\<br />
& = E_g(y(x)) \\<br />
\end{align}<br />
</math><br><br />
<math>where \ y(x) = \frac{h(x)f(x)}{g(x)}</math><br><br />
<br />
The integral can thus be simulated as <math>\displaystyle \hat{I} = \frac{1}{n} \sum_{i=1}^{n} Y_i \ , \ where \ Y_i = \frac{h(x_i)f(x_i)}{g(x_i)}</math><br><br />
<br />
====Procedure====<br />
<br />
Suppose we know how to sample from <math>\displaystyle g(x)</math><br><br />
#Choose a suitable <math>\displaystyle g(x)</math> and draw n samples <math>x_1,x_2....,x_n \sim~ g(x)</math><br />
#Set <math>Y_i =\frac{h(x_i)f(x_i)}{g(x_i)}</math><br />
#Compute <math> \hat{I} = \frac{1}{n}\sum_{i=1}^{n} Y_i </math><br><br />
<br />
By the Law of large numbers, <math>\displaystyle \hat{I} \rightarrow I </math> provided that the sample size n is large enough.<br><br><br />
<br />
'''Remarks:''' One can think of <math>\frac{f(x)}{g(x)}</math> as a weight to <math>\displaystyle h(x)</math> in the computation of <math>\hat{I}</math><br><br><br />
<math>\displaystyle i.e. \ \hat{I} = \frac{1}{n}\sum_{i=1}^{n} Y_i = \frac{1}{n}\sum_{i=1}^{n} (\frac{f(x_i)}{g(x_i)})h(x_i)</math><br><br><br />
Therefore, <math>\displaystyle \hat{I} </math> is a weighted average of <math>\displaystyle h(x_i)</math><br><br><br />
<br />
====Problem====<br />
<br />
If <math>\displaystyle g(x)</math> is not chosen appropriately, then the variance of the estimate <math>\hat{I}</math> may be very large. Here we actually face a similar problem with Rejection-Acceptance Approach. Consider the second moment of <math>\displaystyle I</math>:<br><br><br />
<math>\begin{align}<br />
\displaystyle I & = E_g((y(x))^2) \\<br />
& = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx \\<br />
& = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx \\<br />
& = \int \frac{h^2(x)f^2(x)}{g(x)} dx \\<br />
\end{align}<br />
</math><br><br><br />
<br />
When <math>\displaystyle g(x)</math> is very small, then the above integral could be very large, hence the variance can be very large when g is not chosen appropriately. This occurs when <math>\displaystyle g(x)</math> has a thinner tail than <math>\displaystyle f(x)</math> such that the quantity <math>\displaystyle \frac{h^2(x)f^2(x)}{g(x)}</math> is large.<br />
<br />
'''Remarks:''' <br />
<br />
1. We can actually compute the form of <math>\displaystyle g(x)</math> to have optimal variance. <br>Mathematically, it is to find <math>\displaystyle g(x)</math> subject to <math>\displaystyle \min_g [\ E_g([y(x)]^2) - (E_g[y(x)])^2\ ]</math><br><br />
It can be shown that the optimal <math>\displaystyle g(x)</math> is <math>\displaystyle \frac{|h(x)|f(x)}{\int_{-\infty}^{\infty}|h(s)|f(s)ds}</math>. Using the optimal <math>\displaystyle g(x)</math> will minimize the variance of estimation in Importance Sampling. This is of theoretical interest but not useful in practice. As we can see, if we can actually show the expression of g(x), we must first have the value of the integration---which is what we want in the first place.<br />
<br />
2. In practice, we shall choose <math>\displaystyle g(x)</math> which has similar shape as <math>\displaystyle f(x)</math> but with a thicker tail than <math>\displaystyle f(x)</math> in order to avoid the problem mentioned above.<br><br />
<br />
====Example====<br />
<br />
Estimate <math>\displaystyle I = Pr(Z>3),\ where\ Z \sim N(0,1)</math><br><br><br />
'''Method 1: Basic Monte Carlo'''<br />
<br />
<math>\begin{align} Pr(Z>3) & = \int^\infty_3 f(x)\,dx \\<br />
& = \int^\infty_{-\infty} h(x)f(x)\,dx \end{align}</math><br /><br />
<math> where \ <br />
h(x) = \begin{cases}<br />
0, & \text{if } x \le 3 \\<br />
1, & \text{if } x > 3<br />
\end{cases}</math><br />
<math>\ ,\ f(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}</math><br />
<br />
MATLAB code to compute <math>\displaystyle I</math> from 100 samples of standard normal distribution:<br />
<pre><br />
h = randn(100,1) > 3;<br />
I = mean(h)<br />
</pre><br />
<br />
In one execution of the code, it returns a value of 0 for <math>\displaystyle I</math>, which differs significantly from the true value of <math>\displaystyle I \approx 0.0013 </math>. The problem of using Basic Monte Carlo in this example is that <math>\displaystyle Pr(Z>3)</math> has a small value, and hence many points sampled from the standard normal distribution will be wasted. Therefore, although Basic Monte Carlo is a feasible method to compute <math>\displaystyle I</math>, it gives a poor estimation.<br />
<br />
'''Method 2: Importance Sampling'''<br />
<br />
<math>\displaystyle I = Pr(Z>3)= \int^\infty_3 f(x)\,dx </math><br><br />
<br />
To apply importance sampling, we have to choose a <math>\displaystyle g(x)</math> which we will sample from. In this example, we can choose <math>\displaystyle g(x)</math> to be the probability density function of exponential distribution, normal distribution with mean 0 and variance greater than 1 or normal distribution with mean greater than 0 and variance 1 etc.. For the following, we take <math>\displaystyle g(x)</math> to be the pdf of <math>\displaystyle N(4,1)</math>.<br><br />
<br />
Procedure:<br />
#Draw n samples <math>x_1,x_2....,x_n \sim~ g(x)</math><br />
#Calculate <math>\begin{align} \frac{f(x)}{g(x)} & = \frac{ \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}<br />
}{ \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x-4)^2} } \\<br />
& = e^{8-4x} \end{align} </math><br><br />
#Set <math> Y_i = h(x_i)e^{8-4x_i}\ with\ h(x) = \begin{cases}<br />
0, & \text{if } x \le 3 \\<br />
1, & \text{if } x > 3<br />
\end{cases}<br />
</math><br><br />
#Compute <math> \hat{Y} = \frac{1}{n}\sum_{i=1}^{n} Y_i </math><br><br />
<br />
The above procedure from 100 samples of <math>\displaystyle g(x)</math>can be implemented in MATLAB as follow:<br />
<pre><br />
for ii = 1:100<br />
x = randn + 4 ;<br />
h = x > 3 ;<br />
y(ii) = h * exp(8-4*x) ;<br />
end<br />
mean(y)<br />
</pre><br />
<br />
In one execution of the code, it returns a value of 0.001271 for <math> \hat{Y} </math>, which is much closer to the true value of <math>\displaystyle I \approx 0.0013 </math>. From many executions of the code, the variance of basic monte carlo is approximately 150 times that of importance sampling. This demonstrates that this method can provide a better estimate than the Basic Monte Carlo method.<br />
<br />
==''' Importance Sampling with Normalized Weight and Markov Chain Monte Carlo - October 13th, 2011'''==<br />
==='''Importance Sampling with Normalized Weight'''===<br />
<br />
Recall that we can think of <math>\displaystyle b(x) = \frac{f(x)}{g(x)}</math> as a weight applied to the samples <math>\displaystyle h(x)</math>. If the form of <math>\displaystyle f(x)</math> is known only up to a constant, we can use an alternate, normalized form of the weight, <math>\displaystyle b^*(x)</math>. (This situation arises in Bayesian inference.) Importance sampling with normalized or standard weight is also called indirect importance sampling.<br />
<br />
We derive the normalized weight as follows:<br><br />
<math>\begin{align}<br />
\displaystyle I & = \int h(x)f(x)\,dx \\<br />
&= \int h(x)\frac{f(x)}{g(x)}g(x)\,dx \\<br />
&= \frac{\int h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int f(x) dx} \\<br />
&= \frac{\int h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int \frac{f(x)}{g(x)}g(x) dx} \\<br />
&= \frac{\int h(x)b(x)g(x)\,dx}{\int\ b(x)g(x) dx} <br />
\end{align}</math><br />
<br />
<math>\hat{I}= \frac{\sum_{i=1}^{n} h(x_i)b(x_i)}{\sum_{i=1}^{n} b(x_i)} </math><br />
<br />
Then, the normalized weight is <math>b^*(x) = \displaystyle \frac{b(x_i)}{\sum_{i=1}^{n} b(x_i)}</math><br />
<br />
Note that <math> \int f(x) dx = 1 = \int b(x)g(x) dx = 1 </math><br />
<br />
We can also determine the associated Monte Carlo variance of this estimate by<br />
<br />
<math> Var(\hat{I})= \frac{\sum_{i=1}^{n} b(x_i)(h(x_i) - \hat{I})^2}{\sum_{i=1}^{n} b(x_i)} </math><br />
<br />
==='''Markov Chain Monte Carlo'''===<br />
We still want to solve <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
====Stochastic Process====<br />
A stochastic process <math> \{ x_t : t \in T \}</math> is a collection of random variables. Variables <math>\displaystyle x_t</math> take values in some set <math>\displaystyle X</math> called the '''space set.''' The set <math>\displaystyle T</math> is called the '''index set.'''<br />
<br />
====Markov Chain====<br />
A Markov Chain is a stochastic process for which the distribution of <math>\displaystyle x_t</math> depends only on <math>\displaystyle x_{t-1}</math>. It is a random process characterized as being memoryless; meaning that the next occurrence of a defined event is only dependent on the current event and not on the sequence of events that preceded it. <br />
Formal Definition: The process <math> \{ x_t : t \in T \}</math> is a Markov Chain if <math>\displaystyle Pr(x_t|x_0, x_1,..., x_{t-1})= Pr(x_t|x_{t-1})</math> for all <math> \{t \in T \}</math> and for all <math> \{x \in X \}</math><br />
For a Markov Chain, <math>\displaystyle f(x_1,...x_n)= f(x_1)f(x_2|x_1)f(x_3|x_2)...f(x_n|x_{n-1})</math><br />
<br><br>Real Life Example:<br />
<br>When going for an interview, the employer only looks at your highest education achieved. The employer would not look at the past educations received (elementary school, high school etc.) because the employer believes that the highest education achieved summarizes your previous educations. Therefore, anything before your most recent previous education is irrelevant. In other word, we can say that<math> x_t </math>is regarded as the summary of <math>x_{t-1},...,x_2,x_1</math>, so when we need to determine <math>x_{t+1}</math>, we only need to pay attention in <math>x_{t}</math>.<br />
<br />
====Transition Probabilities====<br />
A Transition Probability is the probability of jumping from one state to another state.<br />
Formal Definition: We call <math>\displaystyle P_{ij} = Pr(x_{t+1}=j|x_t=i)</math> the transition probability.<br />
That is, P(i,j) is the probability of going to state j from state i. The matrix P whose (i,j) element is <math>\displaystyle P_{ij}</math> is called the Transition Matrix.<br />
<br />
Properties of P: <br />
:1) <math>\displaystyle P_{ij} >= 0</math> The probability of going to another state cannot be negative<br />
:2) <math>\displaystyle \sum_{\forall i}P_{ij} = 1</math> The probability of going to some state from state i (including remaining in state i) is certainty<br />
<br />
====Random Walk====<br />
Example: Start at one point and flip a coin where <math>\displaystyle Pr(H)=p</math> and <math>\displaystyle Pr(T)=1-p=q</math>. Take one step right if heads and one step left if tails. If at an endpoint, stay there.<br />
The transition matrix is<br />
<math>P=\left(\begin{matrix}1&0&0&\dots&\dots&0\\<br />
q&0&p&0&\dots&0\\<br />
0&q&0&p&\dots&0\\<br />
\vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\<br />
\vdots&\vdots&\vdots&\vdots&\ddots&\vdots\\<br />
0&0&\dots&\dots&\dots&1<br />
\end{matrix}\right)</math><br />
<br />
Let <math>\displaystyle P_n</math> be the matrix such that its (i,j) element is <math>\displaystyle P_{ij}(n)</math>. This is called n-step probability.<br />
<br />
:<math>\displaystyle P_n = P^n</math><br />
:<math>\displaystyle P_1 = P</math><br />
:<math>\displaystyle P_2 = P^2</math><br />
<br />
<br />
==''' Markov Chain Properties and Page Rank - October 18th, 2011'''==<br />
<br />
===Summary of Terminology===<br />
<br />
====Transition Matrix====<br />
<br />
A matrix <math>\!P</math> that defines a Markov Chain has the form:<br />
<br />
<math>P = \begin{bmatrix}<br />
P_{11} & \cdots & P_{1N} \\<br />
\vdots & \ddots & \vdots \\ <br />
P_{N1} & \cdots & P_{NN}<br />
\end{bmatrix}</math><br />
<br />
where <math>\!P(i,j) = P_{ij} = Pr(x_{t+1} = j | x_t = i) </math> is the probability of transitioning from state i to state j in the Markov Chain in a single step. Note that this implies that all rows add up to one.<br />
<br />
====n-step Transition matrix====<br />
<br />
A matrix <math>\!P_n</math> whose (i,j)<sup>th</sup> entry is the probability of moving from state i to state j after n transitions:<br />
<br />
<math>\!P_n(i,j) = Pr(x_{m+n}=j|x_m = i)</math><br />
<br />
This probability is called the n-step transition probability. A nice property of this matrix is that<br />
<br />
<math>\!P_n = P^n</math><br />
<br />
For all n >= 0, where P is the transition matrix. Note that the rows of <math>P_n</math> should still add up to one.<br />
<br />
====Marginal distribution of a Markov Chain====<br />
<br />
We represent the state at time t as a vector.<br />
<br />
<math>\mu_t = (\mu_t(1) \; \mu_t(2) \; ... \; \mu_t(n))</math><br />
<br />
Consider this Markov Chain:<br />
<br />
[[File:MarkovSample.png|300px]]<br />
<br />
<math>\mu_t = (A \; B)</math>, where A is the probability of being in state a at time t, and B is the probability of being in state b at time t.<br />
<br />
For example if <math>\mu_t = (0.1 \; 0.9)</math>, we have a 10% chance of being in state a at time t, and a 90% chance of being in state b at time t.<br />
<br />
Suppose we run this Markov chain many times, and record the state at each step.<br />
<br />
In this example, we run 4 trials, up until t=5.<br />
<br />
{| class="wikitable"<br />
|-<br />
! t<br />
! Trial 1<br />
! Trial 2<br />
! Trial 3<br />
! Trial 4<br />
! Observed <math>\mu</math><br />
|-<br />
| 1<br />
| a<br />
| b<br />
| b<br />
| a<br />
| (0.5, 0.5)<br />
|-<br />
| 2<br />
| b<br />
| a<br />
| a<br />
| a<br />
| (0.75, 0.25)<br />
|-<br />
| 3<br />
| a<br />
| a<br />
| b<br />
| a<br />
| (0.75, 0.25)<br />
|-<br />
| 4<br />
| b<br />
| b<br />
| a<br />
| b<br />
| (0.25, 0.75)<br />
|-<br />
| 5<br />
| b<br />
| b<br />
| b<br />
| a<br />
| (0.25, 0.75)<br />
|}<br />
<br />
Imagine simulating the chain many times. If we collect all the outcomes at time t from all the chains, the histogram of this data would look like <math>\!\mu_t</math>.<br />
<br />
We can find the marginal probabilities as <math>\!\mu_n = \mu_0 P^n</math><br />
<br />
====Stationary Distribution====<br />
<br />
Let <math>\pi = (\pi_i \mid i \in \chi)</math> be a vector of non-negative numbers that sum to 1. (i.e. <math>\!\pi</math> is a pmf)<br />
<br />
If <math>\!\pi = \pi P</math>, then <math>\!\pi</math> is a stationary distribution, also known as an invariant distribution.<br />
<br />
====Limiting Distribution====<br />
<br />
A Markov chain has limiting distribution <math>\!\pi </math> if <math>\lim_{n \to \infty} P^n = \begin{bmatrix} \pi \\ \vdots \\ \pi \end{bmatrix}</math><br />
<br />
That is, <math>\!\pi_j = \lim_{n \to \infty}\left [ P^n \right ]_{ij}</math> exists and is independent of i.<br />
<br />
Here is an example:<br />
<br />
Suppose we want to find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/3&1/3&1/3\\<br />
1/4&3/4&0\\<br />
1/2&0&1/2<br />
\end{matrix}\right)</math><br />
<br />
We want to solve <math>\pi=\pi P</math> and we want <math>\displaystyle \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
<math>\displaystyle \pi_0 = 1/3\pi_0 + 1/4\pi_1 + 1/2\pi_2</math><br /><br />
<math>\displaystyle \pi_1 = 1/3\pi_0 + 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_2 = 1/3\pi_0 + 1/2\pi_2</math><br /><br />
<br />
Solving the system of equations, we get <br /> <br />
<math>\displaystyle \pi_1 = 4/3\pi_0</math><br /><br />
<math>\displaystyle \pi_2 = 2/3\pi_0</math><br /><br />
<br />
So using our condition above, we have <math>\displaystyle \pi_0 + 4/3\pi_0 + 2/3\pi_0 = 1</math> and by solving we get <math>\displaystyle \pi_0 = 1/3</math><br />
<br />
Using this in our system of equations, we obtain: <br /><br />
<math>\displaystyle \pi_1 = 4/9</math><br /><br />
<math>\displaystyle \pi_2 = 2/9</math><br />
<br />
Thus, the limiting distribution is <math>\displaystyle \pi = (1/3, 4/9, 2/9)</math><br />
<br />
====Detailed Balance====<br />
<br />
<math>\!\pi</math> has the detailed balance property if <math>\!\pi_iP_{ij} = P_{ji}\pi_j</math><br />
<br />
'''Theorem'''<br />
<br />
If <math>\!\pi</math> satisfies detailed balance, then <math>\!\pi</math> is a stationary distribution.<br />
<br />
In other words, if <math>\!\pi_iP_{ij} = P_{ji}\pi_j</math>, then <math>\!\pi = \pi P</math><br />
<br />
'''Proof:''' <br />
<br />
<math>\!\pi P =<br />
\begin{bmatrix}\pi_1 & \pi_2 & \cdots & \pi_N\end{bmatrix} \begin{bmatrix}P_{11} & \cdots & P_{1N} \\ \vdots & \ddots & \vdots \\ P_{N1} & \cdots & P_{NN}\end{bmatrix}</math><br />
<br />
Observe that the j<sup>th</sup> element of <math>\!\pi P</math> is<br />
<br />
<math>\!\left [ \pi P \right ]_j = \pi_1 P_{1j} + \pi_2 P_{2j} + \dots + \pi_N P_{Nj}</math><br />
<br />
::<math>\! = \sum_{i=1}^N \pi_i P_{ij}</math><br />
<br />
::<math>\! = \sum_{i=1}^N P_{ji} \pi_j</math>, by the definition of detailed balance.<br />
<br />
::<math>\! = \pi_j \sum_{i=1}^N P_{ji}</math><br />
<br />
::<math>\! = \pi_j</math>, as the sum of the entries in a column of P must sum to 1.<br />
<br />
So <math>\!\pi = \pi P</math>.<br />
<br />
<br />
'''Example'''<br />
<br />
Find the marginal distribution of <br />
<br />
[[File:MarkovSample.png|300px]]<br />
<br />
Start by generating the matrix P.<br />
<br />
<math>\!P = \begin{pmatrix} 0.2 & 0.8 \\ 0.6 & 0.4 \end{pmatrix}</math><br />
<br />
We must assume some starting value for <math>\mu_0</math><br />
<br />
<math>\!\mu_0 = \begin{pmatrix} 0.1 & 0.9 \end{pmatrix}</math><br />
<br />
For t = 1, the marginal distribution is<br />
<br />
<math>\!\mu_1 = \mu_0 P</math><br />
<br />
Notice that this <math>\mu</math> converges. <br />
<br />
If you repeatedly run:<br />
<br />
<math>\!\mu_{i+1} = \mu_i P</math><br />
<br />
It converges to <math>\mu = \begin{pmatrix} 0.4286 & 0.5714 \end{pmatrix}</math><br />
<br />
This can be seen by running the following Matlab code:<br />
P = [0.2 0.8; 0.6 0.4];<br />
mu = [0.1 0.9]; <br />
while 1 <br />
mu_old = mu; <br />
mu = mu * P;<br />
if mu_old == mu <br />
disp(mu);<br />
break;<br />
end<br />
end<br />
<br />
Another way of looking at this simple question is that we can see whether the ultimate pmf converges:<br />
<br />
Let <math>\hat{p_n}(1)=\frac{1}{n}\sum_{k=1}^n I(X_k=1)</math> denote the estimator of the stationary probability of state 1,<math>\hat{p_n}(2)=\frac{1}{n}\sum_{k=1}^n I(X_k=2)</math> denote the estimator of the stationary probability of state 2, where <math>\displaystyle I(X_k=1)</math> and <math>\displaystyle I(X_k=2)</math> are indicator variables which equal 1 if <math>X_k=1</math>(or <math>X_k=2</math> for the latter one).<br />
<br />
Matlab codes for this explanation is<br />
<br />
n=1;<br />
if rand<0.1<br />
x(1)=1;<br />
else<br />
x(1)=0;<br />
end<br />
p1(1)=sum(x)/n;<br />
p2(1)=1-p1(1);<br />
for i=2:10000<br />
n=n+1;<br />
if (x(i-1)==1&rand<0.2)|(x(i-1)==0&rand<0.6)<br />
x(i)=1;<br />
else<br />
x(i)=0;<br />
end<br />
p1(i)=sum(x)/n;<br />
p2(i)=1-p1(i); <br />
end<br />
plot(p1,'red');<br />
hold on;<br />
plot(p2)<br />
<br />
The results can be easily seen from the graph below:<br />
<br />
[[File:Stationary distribution.png|300px]]<br />
<br />
Additionally, we can plot the marginal distribution as it converges without estimating it. The following Matlab code shows this:<br />
<br />
%transition matrix<br />
P=[0.2 0.8; 0.6 0.4];<br />
%mu at time 0<br />
mu=[0.1 0.9];<br />
%number of points for simulation<br />
n=20;<br />
for i=1:n<br />
mu_a(i)=mu(1);<br />
mu_b(i)=mu(2);<br />
mu=mu*P;<br />
end<br />
t=[1:n];<br />
plot(t, mu_a, t, mu_b);<br />
hleg1=legend('state a', 'state b');<br />
<br />
[[File:Marginal distribution convergence.png|300px]]<br />
<br />
Note that there are chains with stationary distributions that don't converge (the chain might not naturally reach the stationary distribution, and isn't limiting). An example of this is:<br />
<br />
<math>P = \begin{pmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \end{pmatrix}, \mu_0 = \begin{pmatrix} 1/3 & 1/3 & 1/3 \end{pmatrix}</math><br />
<br />
<math>\!\mu_0</math> is a stationary distribution, so <math>\!\mu P</math> is the same for all iterations.<br />
<br />
But,<br />
<br />
<math>P^{1000} = \begin{pmatrix} 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{pmatrix} \ne \begin{pmatrix} \mu \\ \mu \\ \mu \end{pmatrix}</math><br />
<br />
So <math>\!\mu</math> is not a limiting distribution. Also, if<br />
<br />
<math>\mu = \begin{pmatrix} 0.2 & 0.1 & 0.7 \end{pmatrix}</math><br />
<br />
Then <math>\!\mu = \mu P</math> does not converge.<br />
<br />
This can be observed through the following Matlab code.<br />
<br />
P = [0 0 1; 1 0 0; 0 1 0];<br />
mu = [0.2 0.1 0.7]; <br />
for i= 1:4 <br />
mu = mu * P;<br />
disp(mu);<br />
end<br />
<br />
This outputs<br />
0.1000 0.7000 0.2000<br />
0.7000 0.2000 0.1000<br />
0.2000 0.1000 0.7000<br />
0.1000 0.7000 0.2000<br />
<br />
Note that <math>\!\mu_1 = \!\mu_4</math>, which indicates that <math>\!\mu</math> will cycle forever.<br />
<br />
This means that this chain has a stationary distribution, but is not limiting.<br />
<br />
===Page Rank===<br />
<br />
Page Rank was the original ranking algorithm used by Google's search engine to rank web pages.<ref><br />
http://ilpubs.stanford.edu:8090/422/<br />
</ref> The algorithm was created by the founders of Google, Larry Page and Sergey Brin as part of Page's PhD thesis. When a query is entered in a search engine, there are a set of web pages which are matched by this query, but this set of pages must be ordered by their "importance" in order to identify the most meaningful results first. Page Rank is an algorithm which assigns importance to every web page based on the links in each page.<br />
<br />
==== Intuition ====<br />
<br />
We can represent web pages by a set of nodes, where web links are represented as edges connecting these nodes. Based on our intuition, there are three main factors in deciding whether a web page is important or not.<br />
<br />
# A web page is important if many other pages point to it.<br />
# The more important a webpage is, the more weight is placed on its links.<br />
# The more links a webpage has, the less weight is placed on its links.<br />
<br />
====Modelling====<br />
<br />
We can model the set of links as a N-by-N matrix L, where N is the number of web pages we are interested in:<br />
<br />
<math>L_{ij} =<br />
\left\{<br />
\begin{array}{lr}<br />
1 : \text{if page j points to i}\\<br />
0 : \text{otherwise}<br />
\end{array}<br />
\right. <br />
</math><br />
<br />
<br />
<br />
The number of outgoing links from page j is<br />
<br />
<math>c_j = \sum_{i=1}^N L_{ij}</math><br />
<br />
For example, consider the following set of links between web pages:<br />
<br />
[[File:PageRank.png|250px]]<br />
<br />
According to the factors relating to importance of links, we can consider two possible rankings :<br />
<br />
<br />
<math>\displaystyle 3 > 2 > 1 > 4 </math> <br />
<br />
or<br />
<br />
<math>\displaystyle 3>1>2>4 </math> <br />
if we consider that the high importance of the link from 3 to 1 is more influent than the fact that there are two outgoing links from page 1 and only one from page 2.<br />
<br />
<br />
We have <math>L = \begin{bmatrix} <br />
0 & 0 & 1 & 0 \\ <br />
1 & 0 & 0 & 0 \\ <br />
1 & 1 & 0 & 1 \\<br />
0 & 0 & 0 & 0<br />
\end{bmatrix}</math>, and <math>c = \begin{pmatrix}2 & 1 & 1 & 1\end{pmatrix} </math><br />
<br />
We can represent the ranks of web pages as the vector P, where the i<sup>th</sup> element is the rank of page i:<br />
<br />
<math>P_i = (1-d) + d\sum_j \frac{L_{ij}}{c_j} P_j</math><br />
<br />
Here we take the sum of the weights of the incoming links, where links are reduced in weight if the linking page has a lot of outgoing links, and links are increased in weight if the linking page has a lot of incoming links. <br />
<br />
We don't want to completely ignore pages with no incoming links, which is why we add the constant (1 - d).<br />
<br />
If <br />
<br />
<math>L = \begin{bmatrix} L_{11} & \cdots & L_{1N} \\<br />
\vdots & \ddots & \vdots \\<br />
L_{N1} & \cdots & L_{NN} \end{bmatrix}</math><br />
<br />
<math>D = \begin{bmatrix} c_1 & \cdots & 0 \\<br />
\vdots & \ddots & \vdots \\<br />
0 & \cdots & c_N \end{bmatrix}</math><br />
<br />
Then <math>D^{-1} = \begin{bmatrix} c_1^{-1} & \cdots & 0 \\<br />
\vdots & \ddots & \vdots \\<br />
0 & \cdots & c_N^{-1} \end{bmatrix}</math><br />
<br />
<math>\!P = (1-d)e + dLD^{-1}P</math><br />
<br />
where <math>\!e = \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}</math> is the vector with all 1's<br />
<br />
To simplify the problem, we let <math>\!e^T P = N \Rightarrow \frac{e^T P}{N} = 1</math>. This means that the average importance of all pages on the internet is 1.<br />
<br />
Then<br />
<math>\!P = (1-d)\frac{ee^TP}{N} + dLD^{-1}P</math><br />
::<math>\! = \left [ (1-d)\frac{ee^T}{N} + dLD^{-1} \right ] P</math><br />
::<math>\! = \left [ \left ( \frac{1-d}{N} \right ) E + dLD^{-1} \right ] P</math>, where <math> E </math> is an NxN matrix filled with ones.<br />
<br />
Let <math>\!A = \left [ \left ( \frac{1-d}{N} \right ) E + dLD^{-1} \right ]</math><br />
<br />
Then <math>\!P = AP</math>.<br />
<br />
<br />
Note that P is a stationary distribution and, more importantly, P is an eigenvector of A with eigenvalue 1. Therefore, we can find the ranks of all web pages by solving this equation for P. <br />
<br />
We can find the vector P for the example above, using the following Matlab code:<br />
L = [0 0 1 0; 1 0 0 0; 1 1 0 1; 0 0 0 0];<br />
D = [2 0 0 0; 0 1 0 0; 0 0 1 0; 0 0 0 1];<br />
d = 0.8 ;% pages with no links get a weight of 0.2<br />
N = 4 ;<br />
<br />
A = ((1-d)/N) * ones(N) + d * L * inv(D);<br />
[EigenVectors, EigenValues] = eigs(A)<br />
s=sum(EigenVectors(:,1));% we should note that the average entry of P should be 1 according to our assumption<br />
P=(EigenVectors(:,1))/s*N<br />
<br />
This outputs:<br />
<br />
EigenVectors =<br />
-0.6363 0.7071 0.7071 -0.0000 <br />
-0.3421 -0.3536 + 0.3536i -0.3536 - 0.3536i -0.7071 <br />
-0.6859 -0.3536 - 0.3536i -0.3536 + 0.3536i 0.0000 <br />
-0.0876 0.0000 + 0.0000i 0.0000 - 0.0000i 0.7071 <br />
<br />
<br />
EigenValues =<br />
1.0000 0 0 0 <br />
0 -0.4000 - 0.4000i 0 0 <br />
0 0 -0.4000 + 0.4000i 0 <br />
0 0 0 0.0000 <br />
<br />
P =<br />
<br />
1.4528<br />
0.7811<br />
1.5660<br />
0.2000<br />
<br />
Note that there is an eigenvector with eigenvalue 1. <br />
The reason why there always exist a 1-eigenvector is that A is a stochastic matrix. <br />
<br />
Thus our vector P is <math> <br />
\begin{bmatrix}1.4528 \\ 0.7811 \\ 1.5660\\ 0.2000 \end{bmatrix}</math><br />
<br />
However, this method is not practical, because there are simply too many web pages on the internet. So instead Google uses a method to approximate an eigenvector with eigenvalue 1.<br />
<br />
Note that page three has the rank with highest magnitude and page four has the rank with lowest magnitude, as expected.<br />
<br />
==''' Markov Chain Monte Carlo - Metropolis-Hastings - October 25th, 2011'''==<br />
<br />
We want to find <math> \int h(x)f(x)\, \mathrm dx </math>, but we don't know how to sample from <math>\,f</math>.<br />
<br />
We have seen simple techniques before. This one is used in real life.<br />
It consists of the search of a Markov Chain such that its stationary distribution is <math>\,f</math>.<br />
<br />
==== Main procedure ====<br />
<br />
Let us suppose that <math>\,q(y|x)</math> is a friendly distribution: we can sample from this function.<br />
<br />
1. Initialize the chain with <math>\,x_{i}</math> and set <math>\,i=0</math>.<br />
<br />
2. Draw a point from <math>\,q(y|x)</math> i.e. <math>\,Y \backsim q(y|x_{i})</math>.<br />
<br />
3. Evaluate <math>\,r(x,y)=min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\}</math><br />
<br />
<br />
4. Draw a point <math>\,U \backsim Unif[0,1]</math>.<br />
<br />
5. <math>\,x_{i+1}=\begin{cases}y & \text{ if } U<r \\x_{i} & \text{ otherwise } \end{cases} </math>.<br />
<br />
6. <math>\,i=i+1</math>. Go back to 2.<br />
<br />
==== Remark 1 ====<br />
<br />
A very common choice for <math>\,q(y|x)</math> is <math>\,N(y;x,b^{2})</math>, a normal distribution centered at the current point.<br />
<br />
Note : In this case <math>\,q(y|x)</math> is symmetric i.e. <math>\,q(y|x)=q(x|y)</math>.<br />
<br />
(Because <math>\,q(y|x)=\frac{1}{\sqrt{2\pi}b}e^{-\frac{1}{2b^{2}}(y-x)^{2}}</math> and <math>\,(y-x)^{2}=(x-y)^{2}</math>).<br />
<br />
Thus we have <math>\,\frac{q(x|y)}{q(y|x)}=1</math>, which implies :<br />
<br />
<math>\,r(x,y)=min\left\{\frac{f(y)}{f(x)},1\right\}</math>.<br />
<br />
In general, if <math>\,q(x|y)</math> is symmetric then the algorithm is called Metropolis in reference to the original algorithm (made in 1953)<ref>http://en.wikipedia.org/wiki/Equations_of_State_Calculations_by_Fast_Computing_Machines</ref>.<br />
<br />
<br />
<br />
====Remark 2====<br />
<br />
The value y is accepted if <math>\,u<min\left\{\frac{f(y)}{f(x)},1\right\}</math> so it is accepted with the probability <math>\,min\left\{\frac{f(y)}{f(x)},1\right\}</math>.<br />
<br />
Thus, if <math>\,f(y)>f(x)</math>, then <math>\,y</math> is always accepted.<br />
<br />
The higher that value of the pdf is in the vicinity of a point <math>\,y_1</math>, the more likely it is that a random variable will take on values around <math>\,y_1</math>. As a result it makes sense that we would want a high probability of acceptance for points generated near <math>\,y_1</math>.<br />
<br />
====Remark 3====<br />
<br />
One strength of the Metropolis-Hastings algorithm is that normalizing constants, which are often quite difficult to determine, can be cancelled out in the ratio <math> r </math>. For example, consider the case where we want to sample from the beta distribution, which has the pdf:<br />
<br />
<math><br />
\begin{align}<br />
f(x;\alpha,\beta)& = \frac{1}{\mathrm{B}(\alpha,\beta)}\, x^{\alpha-1}(1-x)^{\beta-1}\end{align}<br />
</math><br />
<br />
The beta function, ''B'', appears as a normalizating constant but it can be simplified by construction of the method.<br />
<br />
====Example====<br />
<br />
<math>\,f(x)=\frac{1}{\pi^{2}}\frac{1}{1+x^{2}}</math><br />
<br />
Then, we have <math>\,f(x)\propto\frac{1}{1+x^{2}}</math>.<br />
<br />
And let us take <math>\,q(x|y)=\frac{1}{\sqrt{2\pi}b}e^{-\frac{1}{2b^{2}}(y-x)^{2}}</math>.<br />
<br />
Then <math>\,q(x|y)</math> is symmetric.<br />
<br />
Therefore Y can be simplified.<br />
<br />
<br />
We get :<br />
<br />
<math>\,\begin{align}<br />
\displaystyle r(x,y) <br />
& =min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} \\<br />
& =min\left\{\frac{f(y)}{f(x)},1\right\} \\<br />
& =min\left\{ \frac{ \frac{1}{1+y^{2}} }{ \frac{1}{1+x^{2}} },1\right\}\\<br />
& =min\left\{ \frac{1+x^{2}}{1+y^{2}},1\right\}\\<br />
\end{align}<br />
</math>.<br />
<br />
<br />
<br />
The Matlab code of the algorithm is the following :<br />
<br />
<pre><br />
clear all<br />
close all<br />
clc<br />
b=2;<br />
x(1)=randn;<br />
for i=2:10000<br />
y=b*randn+x(i-1);<br />
r=min((1+x(i-1)^2)/(1+y^2),1);<br />
u=rand;<br />
if u<r<br />
x(i)=y;<br />
else<br />
x(i)=x(i-1);<br />
end<br />
<br />
end<br />
hist(x(5000:end));<br />
%The Markov Chain usually takes some time to converge and this is known as the "burning time".<br />
%Therefore, we don't display the first 5000 points because they don't show the limiting behaviour of the Markov <br />
Chain.<br />
</pre><br />
<br />
As we can see, the choice of the value of b is made by us.<br />
<br />
Changing this value has a significant impact on the results we obtain. There is a pitfall when b is too big or too small.<br />
<br />
Example with <math>\,b=0.1</math> (Also with graph after we run j=5000:10000; plot(j,x(5000:10000)):<br />
<br />
[[File:redaccoursb01.JPG|300px]] [[File:001Metr.PNG|300px]]<br />
<br />
With <math>\,b=0.1</math>, the chain takes small steps so the chain doesn't explore enough of the sample space. It doesn't give an accurate report of the function we want to sample.<br />
<br />
<br />
<br />
Example with <math>\,b=10</math> :<br />
<br />
[[File:redaccoursb10.JPG|300px]] [[File:010metro.PNG|300px]]<br />
<br />
With <math>\,b=10</math>, jumps are very unlikely to be accepted as they deviate far from the mean (i.e. <math>\,y</math> is rejected as <math>\ u<r </math> and <math>\,x(i)=x(i-1)</math> most of the time, hence most sample points stay fairly close to the origin.<br />
The third graph that resembles white noise (as in the case of <math>\,b=2</math>) indicates better sampling as more points are covered and accepted. For <math>\,b=0.1</math>, we have lots of jumps but most values are not repeated, hence the stationary distribution is less obvious; whereas in the <math>\,b=10</math> case, many points remains around 0. Approximately 73% were selected as x(i-1).<br />
<br />
<br />
Example with <math>\,b=2</math> :<br />
<br />
[[File:redaccoursb2.JPG|300px]] [[File:100metr.PNG|300px]]<br />
<br />
With <math>\,b=2</math>, we get a more accurate result as we avoid these extremes. Approximately 37% were selected as x(i-1).<br />
<br />
<br />
If the sample from the Markov Chain starts to look like the target distribution quickly, we say the chain is mixing well.<br />
<br />
==''' Theory and Applications of Metropolis-Hastings - October 27th, 2011'''==<br />
<br />
As mentioned in the previous section, the idea of the Metropolis-Hastings (MH) algorithm is to produce a Markov chain that converges to a stationary distribution <math>f</math> which we are interested in sampling from.<br />
<br />
====Convergence====<br />
<br />
One important fact to check is that <math>\displaystyle f</math> is indeed a stationary distribution in the MH scheme. For this, we can appeal to the implications of the detailed balance property:<br />
<br />
Given a probability vector <math>\!\pi</math> and a transition matrix <math>\displaystyle P</math>, <math>\!\pi</math> has the detailed balance property if <math>\!\pi_iP_{ij} = P_{ji}\pi_j</math><br />
<br />
If <math>\!\pi</math> satisfies detailed balance, then it is a stationary distribution.<br />
<br />
The above definition applies to the case where the states are discrete. In the continuous case, <math>\displaystyle f</math> satisfies detailed balance if <math>\displaystyle f(x)p(x,y)=f(y)p(y,x)</math>. Where <math>\displaystyle p(x,y)</math> and <math>\displaystyle p(y,x)</math> are the probabilities of transitioning from x to y and y to x respectively. If we can show that <math>\displaystyle f</math> has the detailed balance property, we can conclude that it is a stationary distribution. Because <math>\int^{}_y f(y)p(y,x)dy=\int^{}_y f(x)p(x,y)dy=f(x)</math>.<br />
<br />
In the MH algorithm, we use a proposal distribution to generate y~<math>\displaystyle q(y|x)</math>, and accept y with probability <math>min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\}</math><br />
<br />
Suppose, without loss of generality, that <math>\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)} <= 1</math>. This implies that <math>\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)} >= 1</math><br />
<br />
Let <math>\,r(x,y)</math> be the chance of accepting point y given that we are at point x.<br />
<br />
So <math>\,r(x,y) = min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} = \frac{f(x)}{f(y)} \frac{q(x|y)}{q(y|x)}</math><br />
<br />
Let <math>\,r(y,x)</math> be the chance of accepting point x given that we are at point y.<br />
<br />
So <math>\,r(y,x) = min\left\{\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)},1\right\} = 1</math><br />
<br />
<br />
<math>\,p(x,y)</math> is the probability of generating and accepting y, while at point x.<br />
<br />
So <math>\,p(x,y) = q(y|x)r(x,y) = q(y|x) \frac{f(y)}{f(x)} \frac{q(x|y)}{q(y|x)} = \frac{f(y)q(x|y)}{f(x)}</math><br />
<br />
<br />
<math>\,p(y,x)</math> is the probability of generating and accepting x, while at point y.<br />
<br />
So <math>\,p(y,x) = q(x|y)r(y,x) = q(x|y)</math><br />
<br />
<br />
<math>\,f(x)p(x,y) = f(x)\frac{f(y)q(x|y)}{f(x)} = f(y)q(x|y) = f(y)p(y,x)</math><br />
<br />
Thus, detailed balance holds.<br />
:i.e. <math>\,f(x)</math> is stationary distribution<br />
<br />
It can be shown (although not here) that <math>f</math> is a limiting distribution as well. Therefore, the MH algorithm generates a sequence whose distribution converges to <math>f</math>, the target.<br />
<br />
====Implementation====<br />
<br />
In the implementation of MH, the proposal distribution is commonly chosen to be symmetric, which simplifies the calculations and makes the algorithm more intuitively understandable. The MH algorithm can usually be regarded as a random walk along the distribution we want to sample from. Suppose we have a distribution <math>f</math>:<br />
<br />
[[File:Standard normal distribution.gif]]<br />
<br />
Suppose we start the walk at point <math>x</math>. The point <math>y_{1}</math> is in a denser region than <math>x</math>, therefore, the walk will always progress from <math>x</math> to <math>y_{1}</math>. On the other hand, <math>y_{2}</math> is in a less dense region, so it is not certain that the walk will progress from <math>x</math> to <math>y_{2}</math>. In terms of the MH algorithm:<br />
<br />
<math>r(x,y_{1})=min(\frac{f(y_{1})}{f(x)},1)=1</math> since <math>f(y_{1})>f(x)</math>. Thus, any generated value with a higher density will be accepted.<br />
<br />
<math>r(x,y_{2})=\frac{f(y_{2})}{f(x)}</math>. The lower the density of <math>y_{2}</math> is, the less chance it will have of being accepted.<br />
<br />
A certain class of proposal distributions can be written in the form:<br />
<br />
<math>\,y|x_i = x_i + \epsilon_i</math><br />
<br />
where <math>\,\epsilon_i = g(|x-y|)</math><br />
<br />
The density depends only on the distance between the current point and the next one (which can be seen as the "step" being taken). These proposal distributions give the Markov chain the random walk nature. The normal distribution that we frequently use in our examples satisfies the above definition.<br />
<br />
In actual implementations of the MH algorithm, the proposal distribution needs to be chosen judiciously, because not all proposals will work well with all target distributions we want to sample from. Take a trimodal distribution for example:<br />
<br />
[[File:trimodal.jpg]]<br />
<br />
If we choose the proposal distribution to be a standard normal as we have done before, problems will arise. The low densities between the peaks means that the MH algorithm will almost never walk to any points generated in these regions and get stuck at one peak. One way to address this issue is to increase the variance, so that the steps will be large enough to cross the gaps. Of course, in this case, it would probably be beneficial to come up with a different proposal function. As a rule of thumb, such functions should result in an approximately 50% acceptance rate for generated points.<br />
<br />
====Simulated Annealing====<br />
<br />
Metropolis-Hastings is very useful in simulation methods for solving optimization problems. One such application is simulated annealing, which addresses the problems of minimizing a function <math>h(x)</math>. This method will not always produce the global solution, but it is intuitively simple and easy to implement.<br />
<br />
Consider <math>e^{\frac{-h(x)}{T}}</math>, maximizing this expression is equivalent to minimizing <math>h(x)</math>. Suppose <math>\mu</math> is the maximizing value and <math>h(x)=(x-\mu)^2</math>, then the maximization function is a gaussian distribution <math>e^{-\frac{(x-\mu)^2}{T}}</math>. When many samples are taken from this distribution, the mean will converge to the desired maximizing value. The annealing comes into play by lowering T (the temperature) as the sampling progresses, making the distribution narrower. The steps of simulated annealing are outlined below:<br />
<br />
1. start with a random <math>x</math> and set T to a large number<br />
<br />
2. generate <math>y</math> from a proposal distribution <math>q(y|x)</math>, which should be symmetric<br />
<br />
3. accept <math>y</math> with probability <math>min(\frac{f(y)}{f(x)},1)</math><br />
<br />
4. decrease T, and then go to step 2<br />
<br />
The following plot and Matlab code illustrates the simulated annealing procedure as temperature ''T'', the variance, decreases for a Gaussian distribution with zero mean. Starting off with a large value for the temperature ''T'' allows the Metropolis-Hastings component of the procedure to capture the mean, before gradually decreasing the temperature ''T'' in order to converge to the mean. <br />
<br />
[[File:Simulated annealing illustration.png]]<br />
<br />
x=-10:0.1:10;<br />
mu=0;<br />
T=5;<br />
colour = ['b', 'g', 'm', 'r', 'k'];<br />
for i=1:5<br />
pdfNormal=normpdf(x, mu, T);<br />
plot(x, pdfNormal, colour(i));<br />
T=T-1;<br />
hold on<br />
end<br />
hleg1=legend('T=5', 'T=4', 'T=3', 'T=2', 'T=1');<br />
title('Simulated Annealing Illustration');<br />
<br />
=='''References'''==<br />
<br />
<references/><br />
<br />
=='''Simulated Annealing and Gibbs Sampling - November 1, 2011'''==<br />
<br />
continued from previous lecture...<br />
<br />
We will now look at a couple cases where <math> \displaystyle h(y) > h(x) </math> or <math> \displaystyle h(y) < h(x) </math>, and explore whether to accept or reject <math> y </math>.<br />
<br />
Recall r(x,y)=min{<math>\frac{f(y)}{f(x)}</math>,1} where <math> \frac{f(y)}{f(x)} = \frac{e^{\frac{-h(x)}{T}}}{e^{\frac{-h(y)}{T}}} = e^{\frac{h(x)-h(y)}{T}}</math>. And r(x,y) represents the probability of accepting <math>y</math>.<br />
<br />
====Cases====<br />
<br />
Case a)<br />
Suppose <math> \displaystyle h(y) < h(x) </math>. Since we want to find the minimum value for <math>\displaystyle h(x) </math>, and the point <math>\displaystyle y </math> creates a lower value than our previous point, we accept the new point. Mathematically, <math>\displaystyle h(y) < h(x) </math> implies that:<br />
<br />
<math> \frac{f(y)}{f(x)} > 1 </math>. Therefore,<br />
<math> \displaystyle r = 1 </math>.<br />
So, we will always accept <math>\displaystyle y </math>.<br />
<br />
Case b)<br />
Suppose <math> \displaystyle h(y) > h(x) </math>. This is bad, since our goal is to minimize <math>\displaystyle h(x) </math>. However, we may still accept <math>\displaystyle y </math> with some chance:<br />
<br />
<math> \frac{f(y)}{f(x)} < 1 </math>. Therefore,<br />
<math>\displaystyle r < 1 </math>.<br />
So, we may accept <math>\displaystyle y </math> with probability <math>\displaystyle r </math>.<br />
<br />
<br />
Next, we will look at these cases as <math>\displaystyle T\to0 </math>.<br />
<br />
As <math>\displaystyle T\to0 </math> and case a) happens, <math> e^{\frac{h(x)-h(y)}{T}} </math> approaches infinity, so we will always accept <math>\displaystyle y </math>.<br />
<br />
As <math>\displaystyle T\to0 </math> and case b) happens, <math> e^{\frac{h(x)-h(y)}{T}} </math> approaches zero, so the probability that <math>\displaystyle y </math> will be accepted gets extremely small.<br />
<br />
It is worth noting that if we simply start with a small value of T, we may end up rejecting all the generated points, and hence we will get stuck somewhere in the function (due to case b)), which might be a maximum value in some intervals (local maxima), but not in the whole domain (global maxima). It is therefore necessary to start with a large value of T in order to explore the whole function. At the same time, a good estimation of x0 is needed (at least cannot differ from the maximum point too much). <br />
<br />
=====Example=====<br />
<br />
Let <math>\displaystyle h(x) = (x-2)^2 </math>.<br />
The graph of it is:<br />
[[File:PCh(x).jpg|center|500]]<br />
<br />
Then, <math> e^{\frac{-h(x)}{T}} = e^{\frac{-(x-2)^2}{T}} </math> . Take an initial value of T = 20. A graph of this is:<br />
[[File:PC-highT.jpg|center|500]]<br />
<br />
<br />
In comparison, we look a graph of T = 0.2:<br />
[[File:PC-lowT.jpg|center|500]]<br />
<br />
One can see that with a low T value, the graph has a lot of r = 0, and r >1, while having a bigger T value gives smoother transitions in the graph.<br />
<br />
The MATLAB code for the above graphs are:<br />
<pre><br />
ezplot('(x-2)^2',[-6,10])<br />
ezplot('exp((-(x-2)^2)/20)',[-6,10])<br />
ezplot('exp((-(x-2)^2)/0.2)',[-6,10])<br />
</pre><br />
<br />
=====Travelling Salesman Problem=====<br />
<br />
The simulated annealing method can be applied to compute the solution to the travelling salesman problem. Suppose there are N cities and the salesman only have to visit each city once. The objective is to find out the shortest path (i.e. shortest total length of journey) connecting the cities. An algorithm using simulated annealing on the problem can be found here ([http://www.cs.ubbcluj.ro/~csatol/mestint/pdfs/Numerical_Recipes_Simulated_Annealing.pdf Reference]).<br />
<br />
===Gibbs Sampling===<br />
<br />
Gibbs sampling is another Markov chain Monte Carlo method, similar to Metropolis-Hastings. There are two main differences between Metropolis-Hastings and Gibbs sampling. First, the candidate state is always accepted as the next state in Gibbs sampling. Second, it is assumed that the full conditional distributions are known, i.e. <math>P(X_i=x|X_j=x_j, \forall j\neq i)</math> for all <math>\displaystyle i</math>. The idea is that it is easier to sample from conditional distributions which are sets of one dimensional distributions than to sample from a joint distribution which is a higher dimensional distribution. Gibbs is a way to turn the joint distribution into multiple conditional distribution. <br />
<br />
<b>Advantages:</b><br /><br />
- sampling from conditional distributions may be easier than sampling from joint distributions<br />
<br />
<b>Disadvantages:</b><br /><br />
- we do not necessarily know the conditional distributions<br />
<br />
For example, if we want to sample from <math>\, f_{X,Y}(x,y)</math>, we need to know how to sample from <math>\, f_{X|Y}(x|y)</math> and <math>\, f_{Y|X}(y|x)</math>. Suppose the chain starts with <math>\,(X_0,Y_0)</math> and <math>(X_1,Y_1), \dots , (X_n,Y_n)</math> have been sampled. Then,<br />
<br />
<math>\, (X_{n+1},Y_{n+1})=(f_{X|Y}(x|Y_n),f_{Y|X}(y|X_{n+1}))</math><br />
<br />
Gibbs sampling turns a multi-dimensional distribution into a set of one-dimensional distributions. If we want to sample from <br />
<br />
<math>P_{X^1,\dots ,X^p}(x^1,\dots ,x^p)</math> <br />
<br />
and the full conditionals are known, then:<br />
<br />
<math>X^1_{n+1}=f(X^1|X^2_n,\dots ,X^p_n)</math><br />
<br />
<math>X^2_{n+1}=f(X^2|X^1_{n+1},X^3_n\dots ,X^p_n)</math><br />
<br />
<math>\vdots</math><br />
<br />
<math>X^{p-1}_{n+1}=f(X^{p-1}|X^1_{n+1},\dots ,X^{p-2}_{n+1},X^p_n)</math><br />
<br />
<math>X^p_{n+1}=f(X^p|X^1_{n+1},\dots ,X^{p-1}_{n+1})</math><br />
<br />
With Gibbs sampling, we can simulate <math>\displaystyle n</math> random variables sequentially from <math>\displaystyle n</math> univariate conditionals rather than generating one <math>n</math>-dimensional vector using the full joint distribution, which could be a lot more complicated.<br />
<br />
Computational inference deals with probabilistic graphical models. Gibbs sampling is useful here: graphical models show the dependence relations among random variables. For instance, Bayesian networks are graphical models represented using directed acyclic graphs. Looking at such a graphical model tells us on which random variable the distribution of a certain random variable depends (i.e. its parent). The model can be used to "factor" a joint distribution into conditional distributions.<br />
<br />
[[File:stat341_nov_1_graphical_model.png|200px|thumb|left|Sample graphical model of five RVs]]<br />
<br />
For example, consider the five random variables A, B, C, D, and E. Without making any assumptions about dependence relations among them, all we know is <br />
<br />
<math>\, P(A,B,C,D,E)=</math><math>\, P(A|B,C,D,E) P(B|C,D,E) P(C|D,E) P(D|E) P(E)</math><br />
<br />
However, if we know the relation between the random variables, e.g. given the graphical model on the left, we can simplify this expression:<br />
<br />
<math>\, P(A,B,C,D,E)=P(A) P(B|A) P(C|A) P(D|C) P(E|C)</math><br />
<br />
Although the joint distribution may be very complicated, the conditional distributions may not be.<br />
<br />
Check out the following notes on Gibbs sampling:<br />
<br />
* [http://web.mit.edu/~wingated/www/introductions/mcmc-gibbs-intro.pdf MCMC and Gibbs Sampling, MIT Lecture Notes]<br />
* chapter 7.4 in [http://stat.fsu.edu/~anuj/pdf/classes/CompStatI09/BOOK.pdf Notes on Computational Methods in Statistics]<br />
* chapter 4.9 in [http://www.ma.hw.ac.uk/~foss/StochMod/Ross_S.pdf Introduction to Probability Models] by Sheldon Ross<br />
<br />
====Example of Gibbs sampling: Multi-variate normal====<br />
<br />
We'd like to generate samples from a bivariate normal with parameters<br />
<br />
<math>\mu = \begin{bmatrix}1\\ 2 \end{bmatrix} = \begin{bmatrix}\mu_1 \\ \mu_2 \end{bmatrix}</math> <br />
and <math>\sigma = \begin{bmatrix}1 && 0.9 \\ 0.9 && 1 \end{bmatrix}= \begin{bmatrix}1 && \rho \\ \rho && 1 \end{bmatrix}</math><br />
<br />
The conditional distributions of multi-variate normal random variables are also normal:<br />
<br />
<math>\, f(x_1|x_2)=N(\mu_1 + \rho(x_2-\mu_2), 1-\rho^2)</math><br />
<br />
<math>\, f(x_2|x_1)=N(\mu_2 + \rho(x_1-\mu_1), 1-\rho^2)</math><br />
<br />
(In general, if the joint distribution has parameters<br />
<br />
<math>\mu = \begin{bmatrix}\mu_1 \\ \mu_2 \end{bmatrix}</math> and <math>\Sigma = \begin{bmatrix} \Sigma _{1,1} && \Sigma _{1,2} \\ \Sigma _{2,1} && \Sigma _{2,2} \end{bmatrix}</math><br />
<br />
then the conditional distribution <math>\, f(x_1|x_2)</math> has mean <math>\, \mu_1 + \Sigma _{1,2}(\Sigma _{1,1})^{-1}(x_2-\mu_2)</math> and variance <math>\, \Sigma _{1,1}-\Sigma _{1,2}(\Sigma _{2,2})^{-1}\Sigma _{2,1})</math>.<br />
<br />
=='''Principal Component Analysis (PCA) - November 8, 2011'''==<br />
<br />
Principal component analysis is an 100 year old algorithm used for the dimensionality reduction of data. As dimensions increase, the data points needed to sample accurately increase by an exponential factor.<br />
<br />
<math>\, x\in \mathbb{R}^D \rarr y\in \mathbb{R}^d</math><br />
<br />
<math>\ d \le D </math><br />
<br />
We want to transform <math>\, x</math> to <math>\, y</math> by reducing dimensionality yet losing little information.<br />
<br />
For example, consider dots in a three dimensional space. By unrolling the 2D manifold that they are on, we can reduce the data to 2D while losing little information. Note: This is not an application of PCA, but simple illustrates one way we can reduce dimensionality.<br />
<br />
Principle Component Analysis lets us reduce data to a linear subspace of its original space. It works best when data is in a lower dimensional subspace of its original space, or is close to.<br />
<br />
<br />
'''Probabilistic View'''<br />
<br />
We can see data set <math>\, x</math> as a high dimensional random variable governed by a low dimensional random variable <math>\, y</math>. Given <math>\, x</math>, we are trying to estimate <math>\, y</math>.<br />
<br />
We can see this in 2D linear regression, as the locations of data points in a scatter plot are governed by its approximate linear regression. The subspace that we have reduced the data to here is in the direction of variation in the data.<br />
<br />
'''Principal Component Analysis'''<br />
<br />
Principal component analysis is an orthogonal linear transform on a data set. It transforms the data coordinates to associate with a new set of orthogonal vectors, each representing the direction of the maximum variance of the the data. E.G. the first principal component is the direction of the max variance, the second principal component is the direction of the max variance orthogonal to the first vector, the third principal component is the direction of the max variance orthogonal to the first and second vectors and etc. until we have D vectors, where D is the dimension of the original data.<br />
<br />
Suppose we have data represented by <math>\, X = \begin{bmatrix}<br />
x^1\\<br />
x^2\\<br />
\vdots \\ <br />
x^D<br />
\end{bmatrix}<br />
\in \mathbb{R}^{D \times n} </math><br />
<br />
For some <math>\, W = \begin{bmatrix}<br />
w^1\\<br />
w^2\\<br />
\vdots \\ <br />
w^D<br />
\end{bmatrix}<br />
\in \mathbb{R}^{D} </math><br />
<br />
We can write any vector in <math>\, \mathbb{R}^D </math> as<br />
<br />
<math>\, w^1x^1 + w^2x^2 + \cdots + w^dx^d = W^TX</math><br />
<br />
To find the first principal component, we want to maximize the variance of <math>\,W^TX</math>.<br />
<br />
The variance of <math>\,W^TX</math> is <math>\,W^TSW</math> where <math>\,S</math> is the covariance matrix of X.<br />
<br />
<math>\, S = (x-\mu)(x-\mu)^T</math><br />
<br />
<br />
So we have to solve the problem<br />
<br />
<math>\, \text {Max } W^TSW</math><br />
<br />
<math>\, \text{such that } W^TW = 1</math>.<br />
<br />
<br />
We restrict W to unit vectors as otherwise the maximum is unbounded. We are only looking for the direction of of the vector, the actual scale of it is unnecessary.<br />
<br />
Using the method of Lagrange multipliers, we have<br />
<br />
<math>\,L(W, \lambda) = W^TSW - \lambda(W^TW - 1) </math><br />
<br />
We set<br />
<br />
<math>\, \frac{\partial L}{\partial W} = 0 </math><br />
<br />
<br />
<br />
Note that <math>\, W^TSW</math> is a quadratic form. So we have<br />
<br />
<br />
<br />
<math>\, \frac{\partial L}{\partial W} = 2SW - 2\lambda W = 0 </math><br />
<br />
<math>\, SW = \lambda W </math><br />
<br />
Since S is a matrix and lambda is a scaler, W is an eigenvector of S and lambda is its corresponding eigenvalue.<br />
<br />
Suppose that<br />
<br />
<math>\, \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_d</math><br />
are eigenvalues of S and <math>\, u_1, u_2, \cdots u_d</math> are their corresponding eigenvectors.<br />
<br />
We want to choose some <math>\, W = u </math><br />
<br />
<math>\,u^TSu =u^T\lambda u = \lambda u^Tu = \lambda</math><br />
<br />
So to maximize <math>\, u^TSu</math>, choose the eigenvector corresponding to the max eiegenvalue, e.g. <math>\, u_1</math>.<br />
<br />
So we let <math>\, W = u_1 </math> be the first principal component.<br />
<br />
The principal component's decompose the total variance in the data.<br />
<br />
<math>\, \sum_{i=1}^D \text{Var}(u_i) = \sum_{i=1}^D \lambda_i = \text{Tr}(S) = \sum_{i=1}^D \text{Var}(x_i)</math><br />
<br />
<br><br />
===Singular Value Decomposition===<br />
Singular value decomposition is a "generalization" of eigenvalue decomposition "to rectangular matrices of size ''mxn''."<ref name="Abdel_SVD">Abdel-Rahman, E. (2011). Singular Value Decomposition [Lecture notes]. Retrieved from http://uwace.uwaterloo.ca</ref> Singular value decomposition solves:<br><br><br />
:<math>\ A_{mxn}\ v_{nx1}=s\ u_{mx1}</math><br><br><br />
"for the right singular vector ''v'', the singular value ''s'', and the left singular vector ''u''. There are ''n'' singular values ''s''<sub>''i''</sub> and ''n'' right and left singular vectors that must satisfy the following conditions"<ref name="Abdel_SVD"/>:<br />
# "All singular values are non-negative"<ref name="Abdel_SVD"/>, <br> <math>\ s_i \ge 0.</math><br />
# All "right singular vectors are pairwise orthonormal"<ref name="Abdel_SVD"/>, <br> <math>\ v_iv_j=\delta_{i,j}.</math><br />
# All "left singular vectors are pairwise orthonormal"<ref name="Abdel_SVD"/>, <br> <math>\ u_iu_j=\delta_{i,j}.</math><br />
where<br />
:<math>\delta_{i,j}=\left\{\begin{matrix}1 & \mathrm{if}\ i=j \\ 0 & \mathrm{if}\ i\neq j\end{matrix}\right.</math><br><br><br />
<br />
'''Procedure to find the singular values and vectors'''<br><br />
Observe the following about the eigenvalue decomposition of a real square matrix ''A'' where ''v'' is the unit eigenvector:<br><br />
::<math><br />
\begin{align}<br />
& Av=\lambda v \\<br />
& (Av)^T=(\lambda v)^T \\<br />
& (Av)^TAv=(\lambda v)^T\lambda v \\<br />
& v^TA^TAv=\lambda^2v^Tv \\<br />
& vv^TA^TAv=v\lambda^2 \\<br />
& A^TAv=\lambda^2v<br />
\end{align}<br />
</math><br />
As a result:<br />
# "The matrices ''A'' and ''A''<sup>''T''</sup>''A'' have the same eigenvectors."<ref name="Abdel_SVD"/><br />
# "The eigenvalues of matrix ''A''<sup>''T''</sup>''A'' are the square of the eigenvalues of matrix ''A''."<ref name="Abdel_SVD"/><br />
# Since matrix ''A''<sup>''T''</sup>''A'' is symmetric,<br />
## "all the eigenvalues of matrix ''A''<sup>''T''</sup>''A'' are real and distinct."<ref name="Abdel_SVD"/><br />
## "the eigenvectors of matrix ''A''<sup>''T''</sup>''A'' are orthogonal and can be chosen to be orthonormal."<ref name="Abdel_SVD"/><br />
# "The eigenvalues of matrix ''A''<sup>''T''</sup>''A'' are non-negative"<ref name="Abdel_SVD"/> since <math>\ \lambda^2_i \ge 0.</math><br />
Conclusions 3 and 4 are "true even for a rectangular matrix ''A'' since ''A''<sup>''T''</sup>''A'' is still a square symmetric matrix"<ref name="Abdel_SVD"/> and its eigenvalues and eigenvectors can be found.<br><br><br />
Therefore, for a rectangular matrix ''A'', assuming ''m>n'', the singular values and vectors can be found by:<br />
# "Form the ''nxn'' symmetric matrix ''A''<sup>''T''</sup>''A''."<ref name="Abdel_SVD"/><br />
# Perform an eigenvalue decomposition to get ''n'' eigenvalues and their "corresponding eigenvectors, ordered such that"<ref name="Abdel_SVD"/> <br><math>\lambda^2_1 \ge \lambda^2_2 \ge \dots \ge \lambda^2_n \ge 0</math> and <math>\{v_1, v_2, \dots, v_n\}.</math><br />
# "The singular values are"<ref name="Abdel_SVD"/>: <br><math>s_1=\sqrt{\lambda^2_1} \ge s_2=\sqrt{\lambda^2_2} \ge \dots \ge s_n=\sqrt{\lambda^2_n} \ge 0.</math><br>"The non-zero singular values are distinct; the equal sign applies only to the singular values that are equal to zero."<ref name="Abdel_SVD"/><br />
# "The ''n''-dimensional right singular vectors are"<ref name="Abdel_SVD"/><br><math>\{v_1, v_2, \dots, v_n\}.</math><br />
# "For the first <math>r \le n</math> singular values such that ''s''<sub>''i''</sub> ''> 0'', the left singular vectors are obtained as unit vectors"<ref name="Abdel_SVD"/> by <math>\tfrac{1}{s_i}Av_i=u_i.</math><br />
# Select "the <math>\ m-r</math> left singular vectors corresponding to the zero singular values such that they are unit vectors orthogonal to each other and to the first ''r'' left singular vectors"<ref name="Abdel_SVD"/> <math>\{u_1, u_2, \dots, u_r\}.</math><br><br><br />
<br />
'''Finding Singular value Decomposition Using MATLAB Code'''<br />
Please refer to the following link: http://www.mathworks.com/help/techdoc/ref/svd-singular-value-decomposition.html<br />
<br />
'''Formal definition'''<br><br />
"We can now decompose the rectangular matrix ''A'' in terms of singular values and vectors as follows"<ref name="Abdel_SVD"/>:<br><br><br />
<math>A_{mxn} \begin{bmatrix} v_1 & | & \cdots & | & v_n \end{bmatrix}_{nxn} = \begin{bmatrix} u_1 & | & \cdots & | & u_n & | I_{n+1} & | & \cdots & | & I_m \end{bmatrix}_{mxm} \begin{bmatrix} s_1 & 0 & \cdots & 0 \\ 0 & s_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & s_n \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix}_{mxn}</math><br><br />
:<math>\ AV=US</math><br><br><br />
Since "the matrices ''V'' and ''U'' are orthogonal"<ref name="Abdel_SVD"/>, ''V ''<sup>''-1''</sup>=''V''<sup>T</sup> and ''U ''<sup>''-1''</sup>=''U''<sup>T</sup>:<br><br><br />
:<math>\ A=USV^T</math><br><br><br />
"which is the formal definition of the singular value decomposition."<ref name="Abdel_SVD"/><br><br><br />
<br />
'''Relevance to PCA'''<br><br />
In order to perform PCA, one needs to do eigenvalue decomposition on the covariance matrix. By transforming the mean for all attributes to zero, the covariance matrix can be simplified to:<br><br><br />
<math>\ S=XX^T</math><br><br><br />
Since the eigenvalue decomposition of ''A''<sup>''T''</sup>''A'' gives the same eigenvectors as the singular value decomposition of ''A'', an additional and more consistent method (if a matrix has eigenvectors that are not invertible, its eigenvalue decomposition does not exist) for performing PCA is through the singular value decomposition of ''X''.<br />
<br />
The following MATLAB code uses singular value decomposition for performing PCA; 20 principal components, and thus the top 20 maximum variation directions, are selected for reconstructing facial images that have had noise applied to them:<br />
<br />
load noisy.mat<br />
%first noisy image; each image has a resolution of 20x28<br />
imagesc(reshape(X(:,1),20,28)')<br />
%to grayscale<br />
colormap gray<br />
%singular value decomposition <br />
[u s v]=svd(X);<br />
%reduced feature space: 20 principal components<br />
Xh=u(:,1:20)*s(1:20,1:20)*v(:,1:20)';<br />
figure<br />
imagesc(reshape(Xh(:,1),20,28)')<br />
colormap gray<br />
<br />
Since the reduced feature space image is noiseless, the added noise feature has less variation than the 20 principal components.<br />
<br />
=='''References'''==<br />
<br />
<references/><br />
<br />
==''' PCA and Introduction to Kernel Function-November,10,2011'''==<br />
===Continue with the last lecture===<br />
Some notations:<br />
Let <math>\displaystyle X_{d\times n}</math> be a matrix. <br />
<br />
Let <math>\displaystyle X_j,j=1,2,...,d</math> be the j th the data point,and <math>\displaystyle X_j\in\R^d</math>.<br />
<br />
Let <math>\displaystyle Q=\sum_{j=1}^n(X_j-\bar{X})(X_j-\bar{X})^T</math>, where <math> \bar{X}=\frac{1}{n}\sum_{j=1}^n X_j)</math>.<br />
<br />
But now, we are assuming that we have already centered the data, which means our <math>\displaystyle Q=\sum_{j=1}^n(X_j)(X_j)^T=X X^T </math>.<br />
<br />
*Find PC,which means finding eigenvectors of Q or do the singular value decomposition,[u s v]=svd(X), where the columns of u are eigenvectors of <math>\displaystyle Q=X X^T</math>.<br />
<br />
*Map the data in lower dimension space.<br />
We can choose the first p (p<d) eigenvectors, which means <math>\displaystyle u^T</math> is a <math>\displaystyle p\times n</math> matrix.<br />
Thus,we can project our original data points <math>\displaystyle x_j</math> to p dimension.<br />
Mathematically, it is <math>\displaystyle Y_{p\times n}={u^T}_{p\times d} X_{d\times n}</math>.Also,this means that we can reduce our original d variables to p principal components.<br />
<br />
*Reconstruct Points.<br />
We can also use those dimension-reduced data to project back to high dimension.<br />
However, we will lose some information because when we map those points into lower dimension, we throw away the last (d-p) eigenvectors which contain some of the original information.<br />
Since <math>\displaystyle u^T</math> is an orthogonal matrix, we can have <math> u_{d\times p} Y_{p\times n}=u_{d\times p}{u^T}_{p\times d}\hat{x}_{d\times n}= \hat{x}_{d\times n} </math>.<br />
<br />
*Map a new data point to a lower dimensional space and reconstruct it to the high dimension <math>\displaystyle y_{d\times 1}={u^T}_{p\times d} x_{d\times 1}=x_{d\times 1}=u_{d\times p} y_{p\times 1}</math><br />
<br />
===3 and 2 digits example===<br />
The data X is a 64 by 400 matrix. Every column can be imaged out as either "3" or "2". The first 200 columns are "2" and the last 200 columns are "3".<br />
We can first modify the data to centered data, and then try to find the first p(p<d) columns of the singular value decomposition of u.<br />
<br />
MATLAB CODE:<br />
MU=repmat(mean(X,2),1,400);<br />
% mean(X,2) is the average of each row <br />
%In order to center the data, we should change mean(X,2) which is a 64 by 1 matrix into a 64 by 400 matirx<br />
Xt=X-MU;<br />
% modify the data to zero mean data<br />
[u s v]=svd(Xt);<br />
%note that size(u)=64*64, and the columns of u are eigenvectors of VCM<br />
Y=u(:,1:2)'*X;<br />
%using the first two PCs to transform the high dimensional points to lower onces<br />
One way to look at this case is that, we can plot Principle Component #1 and Principle Component #2 in a two dimensional space.<br />
plot(Y(1,:)',Y(2,:)')<br />
The result is as follows, we can see clearly there are two classes.<br />
<br />
[[file:pca2.png|350px|400px]]<br />
<br />
To dig more into what kind of difference of these two classes, we can try to seperate the first 200 columns and the last 200 columns to find whether it has a significant difference due to the different types of digits.<br />
plot(Y(1,1:200)',Y(2,1:200)','d')<br />
% Note that the first 200 columns represent digit "2",and are in the form of "diamond"<br />
hold on<br />
% draw different graphs in one figure<br />
plot(Y(1,201:400)',Y(2,201:400)','ro')<br />
% Note that the first 200 columns represent digit "3",and are in the form of "o"<br />
<br />
[[file:pca3.png|350px|400px]]<br />
<br />
image=reshape(X,8,8,400);<br />
plotdigits(image,Y,.1,1);<br />
The result can be seen more clearly from the following picture.<br />
It is clearly to seperate "3" and "2" apart.<br />
<br />
[[file:Pca.png|350px|400px]]<br />
<br />
===Introduction to Kernel Function===<br />
PCA is useful when those data points spread in or close to a plane. This means that PCA is powerful when dealing with linear problems. But when data points spread in a manifold space, PCA is hard to implement. But there is a solution to this problem---we can use a "trick" to change the nonlinear classification problems into linear ones. And this is called the "Kernel Trick".<br />
<br />
'''An intuitive example'''<br />
<br />
[[File:Kernel trick.png|400px|300px]]<br />
<br />
From the picture, we can see the red dots are in the middle of the blue ones.However,it is hard to separate those two classes by using any lines(linear in the two dimensional space). But we can pull the red ones out of the two dimensional space to form a three dimensional space, in which case, we can easily tell them apart.<br />
<br />
For more details about this trick,please see http://omega.albany.edu:8008/machine-learning-dir/notes-dir/ker1/ker1.pdf<br />
<br />
More in detail,the significance of Kernel Function is that we can change the data points into a high dimension implicitly.<br />
Let's look at how this is possible:<br />
<br />
<math>Z_1=<br />
\begin{bmatrix}<br />
x_1\\<br />
y_1<br />
\end{bmatrix}\xrightarrow{\phi}<br />
</math><br />
<math>\phi(Z_1)=<br />
\begin{bmatrix}<br />
x_1^2\\<br />
y_1^2\\<br />
\sqrt2x_1y_1<br />
\end{bmatrix}.<br />
<br />
</math><br />
<math>Z_2=<br />
\begin{bmatrix}<br />
x_2\\<br />
y_2<br />
\end{bmatrix}\xrightarrow{\phi}<br />
</math><br />
<math>\phi(Z_2)=<br />
\begin{bmatrix}<br />
x_2^2\\<br />
y_2^2\\<br />
\sqrt2x_2y_2<br />
\end{bmatrix}<br />
</math><br />
<br />
The inner product of <math>\displaystyle \phi(Z1)</math> and <math>\displaystyle\phi(Z2)</math>, which is denoted as <math>\displaystyle\phi(Z1)\phi(Z2)^T</math>, is equal to:<br />
<math><br />
\begin{bmatrix}<br />
x_1^2&y_1^2&\sqrt2x_1y_1 <br />
\end{bmatrix}<br />
\begin{bmatrix}<br />
x_2^2\\<br />
y_2^2\\<br />
\sqrt2x_2y_2 <br />
\end{bmatrix}=</math> <math>\displaystyle (x_1x_2+y_1y_2)^2=K(Z_1,Z_2)</math>.<br />
<br />
'''The most common Kernel functions are as follows:'''<br />
*Linear: <math>\displaystyle K_{ij}=<X_i,X_j></math><br />
*Polynomial:<math>\displaystyle K_{ij}=(1+<X_i,X_j>)^p</math><br />
*Gausian:<math>\displaystyle K_{ij}=e^\frac{-{\left\Vert X_i-X_j\right\|}^2}{2\sigma^2}</math>,<br />
where <math>\displaystyle <X_i,X_j></math> denotes the inner product of <math>\displaystyle X_i</math> and <math>\displaystyle X_j</math>, <math>{\left\Vert X_i-X_j\right\|}^2</math> denotes the distance between vector<math>\displaystyle X_i</math> and vector <math>\displaystyle X_j</math>.<br />
<br />
<br />
==''' Kernel PCA -November,15,2011'''==<br />
<br />
PCA doesn't work well when the directions of variation in our data is nonlinear. To deal with this problem, we apply kernels to PCA.<br />
<br />
First we look at the algorithm for PCA and see how we can kernelize PCA:<br />
<br />
== PCA ==<br />
<br />
Find eigenvectors of <math>XX^T</math>, call it U<br />
<br />
<math><br />
\begin{align}<br />
Y &= U^{T}X \\<br />
\hat{X} & = UY \\<br />
Y & = U^{T}X \\<br />
\hat{X} & = UY<br />
\end{align}<br />
</math><br />
<br />
== Modifying PCA ==<br />
<br />
<math><br />
\begin{align}<br />
\left[ U \Sigma V \right] & = svd(X) \\<br />
Z & = U\Sigma{V^T}<br />
\end{align}<br />
</math><br />
<br />
U is eigenvectors of <math>XX^T</math><br />
<br />
V is eigenvectors of <math>X^T{X}</math><br />
<br />
Now we want to kernelize this classical version of PCA.<br />
<br />
We would like to express everything based on V which is eigenvectors of X^T{X} which can be kernelized. This is called Dual PCA.<br />
<br />
<math><br />
\begin{align}<br />
X&= U \Sigma V^T \\<br />
XV&=U \Sigma V^T V<br />
&= U\Sigma \\<br />
U&=XV\Sigma^{-1}<br />
\end{align}<br />
</math><br />
<br />
Find eigenvectors of <math>X^TX</math>, call it V.<br />
<br />
<math><br />
\begin{align}<br />
X&=U \Sigma V^T \\<br />
U^TX &= U^TU\Sigma V^T \\<br />
U^TX &= \Sigma V^T \\<br />
Y&=\Sigma V^T \\<br />
\end{align}<br />
</math><br />
<br />
Reconstruct Points<br />
<br />
<math><br />
\begin{align}<br />
\hat{X}&=UY \\<br />
X &=XV\Sigma^{-1}\Sigma{V^T} \\<br />
\hat{X} &= XVV^T<br />
\end{align}<br />
</math><br />
<br />
Map an out of sample point x to low-dimensional space<br />
<br />
<math><br />
\begin{align}<br />
Y &=U^TX \\<br />
& = (XV\Sigma^1)^TX \\<br />
& = \Sigma^{-1}{V^T}{X^T}X<br />
\end{align}<br />
</math><br />
<br />
Reconstruct an out of sample point <br />
<br />
<br />
<math><br />
\begin{align}<br />
\hat{X} &= UY=XV\Sigma^{-1}\Sigma{-1}V^T{X^T}X \\<br />
&= XV\Sigma^{-2}V^T{X^T}X<br />
\end{align}<br />
</math></div>S9huhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341f11&diff=14861stat341f112011-11-15T19:45:25Z<p>S9hu: /* Modifying PCA */</p>
<hr />
<div>Please contribute to the discussion of splitting up this page into multiple pages on the [[{{TALKPAGENAME}}|talk page]].<br />
<br />
==[[signupformStat341F11| Editor Sign Up]]==<br />
<br />
==Notation==<br />
<br />
The following guidelines on notation were posted on the Wiki Course Note page for [[Stat946f11|STAT 946]]. Add to them as necessary for consistent notation on this page.<br />
<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
<br />
==Sampling - September 20, 2011==<br />
<br />
The meaning of sampling is to generate data points or numbers such that these data follow a certain distribution.<br /><br />
i.e. From <math>x \sim~f(x)</math> sample <math>\,x_{1}, x_{2}, ..., x_{1000}</math><br />
<br />
In practice, it maybe difficult to find the joint distribution of random variables. Through simulating the random variables, we can make an inference from the data.<br />
<br />
===Sampling from Uniform Distribution===<br />
Computers cannot generate random numbers as they are deterministic; however they can produce pseudo random numbers using algorithms. Generated numbers mimic the properties of random numbers but they are never truly random. One famous algorithm that is considered highly reliable is the Mersenne twister[http://en.wikipedia.org/wiki/Mersenne_twister], which generates random numbers in an almost uniform distribution. <br />
<br />
<br />
====Multiplicative Congruential====<br />
*involves four parameters: integers <math>\,a, b, m</math>, and an initial value <math>\,x_0</math> which we call the seed<br />
*a sequence of integers is defined as<br />
:<math>x_{k+1} \equiv (ax_{k} + b) \mod{m}</math><br />
<br />
'''Example:''' <math>\,a=13, b=0, m=31, x_0=1</math> creates a uniform histogram.<br />
<br />
MATLAB code for generating 1000 random numbers using the multiplicative congruential method:<br />
<br />
<pre><br />
a = 13;<br />
b = 0;<br />
m = 31;<br />
x(1) = 1;<br />
<br />
for ii = 2:1000<br />
x(ii) = mod(a*x(ii-1)+b, m);<br />
end<br />
</pre><br />
<br />
MATLAB code for displaying the values of x generated:<br />
<br />
<pre><br />
x<br />
</pre><br />
<br />
MATLAB code for plotting the histogram of x:<br />
<br />
<pre><br />
hist(x)<br />
</pre><br />
<br />
Histogram Output:<br />
<br />
[[File:uniform.jpg]]<br />
<br />
Facts about this algorithm:<br />
*In this example, the first 30 terms in the sequence are a permutation of integers from 1 to 30 and then the sequence repeats itself.<br />
*Values are between <b>0</b> and <b>m-1</b>, inclusive.<br />
*Dividing the numbers by <b> m-1 </b> yields numbers in the interval <b>[0,1]</b>.<br />
*MATLAB's <code>rand</code> function once used this algorithm with <b>a= 7<sup>5</sup></b>, <b>b= 0</b>, <b>m= 2<sup>31</sup>-1</b>,for reasons described in Park and Miller's 1988 paper "Random Number Generators: Good Ones are Hard to Find" (available [http://www.firstpr.com.au/dsp/rand31/p1192-park.pdf online]).<br />
*Visual Basic's <code>RND</code> function also used this algorithm with <b>a= 1140671485</b>, <b>b= 12820163</b>, <b>m= 2<sup>24</sup></b>. ([http://support.microsoft.com/kb/231847 Reference])<br />
<br />
===Inverse Transform Method===<br />
This is a basic method for sampling. Theoretically using this method we can generate sample numbers at random from any probability distribution once we know its cumulative distribution function (cdf).<br />
<br />
====Theorem====<br />
Take <math>U \sim~ \mathrm{Unif}[0, 1]</math> and let <math>X = F^{-1}(U) </math>. Then <math>X</math> has distribution function <math>F(\cdot)</math>, where <math>F(x)=P(X \leq x)</math> and <math>F^{-1}(\cdot)</math> is the inverse of <math>F(\cdot)</math>.<br />
<br />
Therefore <math>F(x)=u\implies x=F^{-1}(u)</math><br />
<br />
'''Proof'''<br />
<br />
Recall that<br />
<br />
:<math>P(a \leq X<b)=\int_a^{b} f(x) dx</math><br />
<br />
:<math>cdf=F(x)=P(X \leq x)=\int_{-\infty}^{x} f(x) dx</math><br />
<br />
Note that if <math>U \sim~ \mathrm{Unif}[0, 1]</math>, we have <math>P(U \leq a)=a</math><br />
<br />
:<math>\begin{align}<br />
<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
====Continuous Case====<br />
Generally it takes two steps to get random numbers using this method.<br />
<br />
*Step 1. Draw <math>U \sim~ \mathrm{Unif}[0, 1]</math><br />
*Step 2. <b><i>X=F <sup>&minus;1</sup>(U)</i></b><br />
<br />
'''Example'''<br />
<br />
Take the exponential distribution for example<br />
:<math>\,f(x)={\lambda}e^{-{\lambda}x}</math><br />
:<math>\,F(x)=\int_0^x {\lambda}e^{-{\lambda}u} du=[-e^{-{\lambda}u}]_0^x=1-e^{-{\lambda}x}</math><br />
<br />
Let: <math>\,F(x)=y</math><br />
:<math>\,y=1-e^{-{\lambda}x}</math><br />
:<math>\,ln(1-y)={-{\lambda}x}</math><br />
:<math>\,x=\frac{ln(1-y)}{-\lambda}</math><br />
:<math>\,F^{-1}(x)=\frac{-ln(1-x)}{\lambda}</math><br />
<br />
Therefore, to get a exponential distribution from a uniform distribution takes 2 steps.<br />
*Step 1. Draw <math>U \sim~ \mathrm{Unif}[0, 1]</math><br />
*Step 2. <math>x=\frac{-ln(1-U)}{\lambda}</math><br />
<br />
Note: If U~Unif[0, 1], then (1 - U) and U have the same distribution. This allows us to slightly simplify step 2 into an alternate form:<br />
*Alternate Step 2. <math>x=\frac{-ln(U)}{\lambda}</math><br />
<br />
'''MATLAB code'''<br />
for exponential distribution case,assuming <math>\lambda=0.5</math><br />
<br />
<pre><br />
for ii = 1:1000<br />
u = rand;<br />
x(ii) = -log(1-u)/0.5;<br />
end<br />
hist(x)<br />
</pre><br />
<br />
MATLAB result<br />
<br />
[[File:MATLAB_Exp.jpg|center|300px]]<br />
<br />
====Discrete Case - September 22, 2011====<br />
This same technique can be applied to the discrete case. Generate a discrete random variable <math>\,x</math> that has probability mass function <math>\,P(X=x_i)=P_i </math> where <math>\,x_0<x_1<x_2...</math> and <math>\,\sum_i P_i=1</math><br />
*Step 1. Draw <math>u \sim~ \mathrm{Unif}[0, 1]</math><br />
*Step 2. <math>\,x=x_i</math> if <math>\,F(x_{i-1})<u \leq F(x_i)</math><br />
<br />
'''Example'''<br />
<br />
Let x be a discrete random variable with the following probability mass function:<br />
<br />
:<math>\begin{align}<br />
P(X=0) = 0.3 \\<br />
P(X=1) = 0.2 \\<br />
P(X=2) = 0.5<br />
\end{align}</math><br />
<br />
Given the pmf, we now need to find the cdf.<br />
<br />
We have:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0 & x < 0 \\<br />
0.3 & 0 \leq x < 1 \\<br />
0.5 & 1 \leq x < 2 \\<br />
1 & 2 \leq x<br />
\end{cases}</math><br />
<br />
We can apply the inverse transform method to obtain our random numbers from this distribution.<br />
<br />
'''Pseudo Code for generating the random numbers:'''<br />
<pre><br />
Draw U ~ Unif[0,1] <br />
if U <= 0.3 <br />
return 0 <br />
else if 0.3 < U <= 0.5 <br />
return 1<br />
else if 0.5 < U <= 1 <br />
return 2<br />
</pre><br />
<br />
'''MATLAB code for generating 1000 random numbers in the discrete case:'''<br />
<br />
<pre><br />
for ii = 1:1000<br />
u = rand;<br />
<br />
if u <= 0.3<br />
x(ii) = 0;<br />
else if u <= 0.5<br />
x(ii) = 1;<br />
else<br />
x(ii) = 2;<br />
end<br />
end<br />
</pre><br />
<br />
Matlab Output:<br />
<br />
[[File:Discreteinv.jpg]]<br />
<br />
'''Pseudo code for the Discrete Case:'''<br />
<br />
1. Draw U ~ Unif [0,1]<br />
<br />
2. If <math> U \leq P_0 </math>, deliver <b><i>X= x<sub>0</sub></i></b><br />
<br />
3. Else if <math> U \leq P_0 + P_1 </math>, deliver <b><i>X= x<sub>1</sub></i></b><br />
<br />
4. Else If <math> U \leq P_0 +....+ P_k </math>, deliver <b><i>X= x<sub>k</sub></i></b><br />
<br />
====Limitations====<br />
<br />
Although this method is useful, it isn't practical in many cases since we can't always obtain <math>F</math> or <math> F^{-1} </math> as some functions are not integrable or invertible, and sometimes even <math>f(x)</math> itself cannot be obtained in closed form. Let's look at some examples:<br />
*Continuous case<br />
If we want to use this method to draw the ''pdf'' of '''normal distribution''', we may find ourselves get stuck in finding its ''cdf''. <br />
The simplest case of '''normal distribution''' is <math>f(x)=\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}</math>,<br />
whose ''cdf'' is <math>F(x)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{x}{e^{-\frac{u^2}{2}}}du</math>. This integral cannot be expressed in terms of elementary functions. So evaluating it and then finding the inverse is a very difficult task.<br />
*Discrete case <br />
It is easy for us to simulate when there are only a few values taken by the particular random variable, like the case above.<br />
And it is easy to simulate the '''binomial distribution''' <math>X \sim~ \mathrm{B}(n,p)</math> when the parameter n is not too large.<br />
But when n takes on values that are very large, say 50, it is hard to do so.<br />
<br />
===Acceptance/Rejection Method===<br />
<br />
<br />
The aforementioned difficulties of the inverse transform method motivates a sampling method that does not require analytically calculating cdf's and their inverses, which is the acceptance/rejection sampling method. Here, <math> \displaystyle f(x)</math> is approximated by another function, say <math>\displaystyle g(x)</math>, with the idea being that <math>\displaystyle g(x)</math> is a "nicer" function to work with than <math>\displaystyle f(x)</math>.<br />
<br />
Suppose we assume the following:<br />
<br />
1. There exists another distribution <math>\displaystyle g(x)</math> that is easier to work with and that you know how to sample from, and<br />
<br />
2. There exists a constant c such that <math>f(x) \leq c \cdot g(x)</math> for all x<br />
<br />
Under these assumptions, we can sample from <math>\displaystyle f(x)</math> by sampling from <math>\displaystyle g(x)</math><br />
<br />
====General Idea====<br />
<br />
Looking at the image below we have graphed <math> c \cdot g(x) </math> and <math>\displaystyle f(x)</math>.<br />
<br />
[[File:Graph_updated.jpg]]<br />
<br />
Using the acceptance/rejection method we will accept some of the points from <math>\displaystyle g(x)</math> and reject some of the points from <math>\displaystyle g(x)</math>. The points that will be accepted from <math>\displaystyle g(x)</math> will have a distribution similar to <math>\displaystyle f(x)</math>. We can see from the image that the values around <math>\displaystyle x_1</math> will be sampled more often under <math>c \cdot g(x)</math> than under <math>\displaystyle f(x)</math>, so we will have to reject more samples taken at x<sub>1</sub>. Around <math>\displaystyle x_2</math> the number of samples that are drawn and the number of samples we need are much closer, so we accept more samples that we get at <math>\displaystyle x_2</math><br />
<br />
====Procedure====<br />
<br />
1. Draw y ~ g<br />
<br />
2. Draw U ~ Unif [0,1]<br />
<br />
3. If <math> U \leq \frac{f(y)}{c \cdot g(y)}</math> then x=y; else return to 1<br />
<br />
Note that the choice of <math> c </math> plays an important role in the efficiency of the algorithm. We want <math> c \cdot g(x) </math> to be "tightly fit" over <math> f(x) </math> to increase the probability of accepting points, and therefore reducing the number of sampling attempts. Mathematically, we want to minimize <math> c </math> such that <math>f(x) \leq c \cdot g(x) \ \forall x</math>. We do this by setting<br />
<br />
<math> \frac{d}{dx}(\frac{f(x)}{g(x)}) = 0 </math>, solving for a maximum point <math> x_0 </math> and setting <math> c = \frac{f(x_0)}{g(x_0)}. </math><br />
<br />
====Proof====<br />
<br />
Mathematically, we need to show that the sample points given that they are accepted have a distribution of f(x).<br />
<br />
<math>\begin{align} P(y|accepted) &= \frac{P(y, accepted)}{P(accepted)} \\<br />
<br />
&= \frac{P(accepted|y) P(y)}{P(accepted)}\end{align} </math> (Bayes' Rule)<br />
<br />
<br />
<br />
<math>\displaystyle P(y) = g(y)</math><br />
<br />
<math>P(accepted|y) =P(u\leq \frac{f(y)}{c \cdot g(y)}) =\frac{f(y)}{c \cdot g(y)} </math>,where u ~ Unif [0,1]<br />
<br />
<math>P(accepted) = \sum P(accepted|y)\cdot P(y)=\int^{}_y \frac{f(y)}{c \cdot g(y)}g(y) dy=\int^{}_y \frac{f(y)}{c} dy=\frac{1}{c} \cdot\int^{}_y f(y) dy=\frac{1}{c}</math><br />
<br />
So,<br />
<br />
<math> P(y|accepted) = \frac{ \frac {f(y)}{c \cdot g(y)} \cdot g(y)}{\frac{1}{c}} =f(y) </math><br />
<br />
====Continuous Case====<br />
<br />
'''Example'''<br />
<br />
Sample from Beta(2,1)<br />
<br />
In general:<br />
<br />
Beta(<math>\alpha, \beta) = \frac{\Gamma (\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}</math> <math>\displaystyle x^{\alpha-1}</math> <math>\displaystyle(1-x)^{\beta-1}</math>, <math>\displaystyle 0<x<1</math><br />
<br />
Note: <math>\!\Gamma(n) = (n-1)!</math> if n is a positive integer<br />
<br />
<math>\begin{align} f(x) &= Beta(2,1) \\<br />
&= \frac{\Gamma(3)}{\Gamma(2)\Gamma(1)} x^1(1-x)^0 \\<br />
&= \frac{2!}{1! 0!}\cdot (1) x \\<br />
&= 2x \end{align}</math><br />
<br />
We want to choose <math>\displaystyle g(x)</math> that is easy to sample from. So we choose <math>\displaystyle g(x)</math> to be uniform distribution.<br />
<br />
We now want a constant c such that <math>f(x) \leq c \cdot g(x) </math> for all x from Unif(0,1)<br />
<br />
<br />
So,<br /><br />
<br />
<math>c \geq \frac{f(x)}{g(x)}</math>, for all x from (0,1)<br />
<br />
<br />
<math>\begin{align}c &\geq max (\frac {f(x)}{g(x)}, 0<x<1) \\<br />
<br />
<br />
&= max (\frac {2x}{1},0<x<1) \\<br />
<br />
<br />
&= 2 \end{align}</math><br />
<br />
<br />
<br />
Now that we have c =2,<br />
<br />
1. Draw y ~ g(x) => Draw y ~ Unif [0,1] <br />
<br />
2. Draw u ~ Unif [0,1] <br />
<br />
3. if <math>u \leq \frac{2y}{2 \cdot 1}</math> then x=y; else return to 1<br />
<br />
<br />
'''MATLAB code for generating 1000 samples following Beta(2,1):'''<br />
<br />
<pre><br />
close all<br />
clear all<br />
ii=1;<br />
while ii < 1000<br />
y = rand;<br />
u = rand;<br />
<br />
if u <= y<br />
x(ii)=y;<br />
ii=ii+1;<br />
end<br />
end<br />
hist(x)<br />
</pre><br />
<br />
'''MATLAB result'''<br />
<br />
[[File:MATLAB_Beta.jpg]]<br />
<br />
====Discrete Example====<br />
<br />
Generate random variables according to the p.m.f:<br />
<br />
:<math>\begin{align}<br />
P(Y=1) = 0.15 \\<br />
P(Y=2) = 0.22 \\<br />
P(Y=3) = 0.33 \\<br />
P(Y=4) = 0.10 \\<br />
P(Y=5) = 0.20 <br />
\end{align}</math><br />
<br />
find a g(y) discrete uniform distribution from 1 to 5<br />
<br />
<math>c \geq \frac{P(y)}{g(y)} </math><br><br />
<math>c = \max \left(\frac{P(y)}{g(y)} \right)</math><br><br />
<math>c = \max \left(\frac{0.33}{0.2} \right) = 1.65</math> Since P(Y=3) is the max of P(Y) and g(y) = 0.2 for all y.<br><br />
<br />
1. Generate Y according to the discrete uniform between 1 - 5<br />
<br />
2. U ~ unif[0,1]<br />
<br />
3. If <math>U \leq \frac{P(y)}{1.65 \times 0.2} \leq \frac{P(y)}{0.33} </math>, then x = y; else return to 1.<br />
<br />
In MATLAB, the code would be:<br />
<br />
py = [0.15 0.22 0.33 0.1 0.2];<br />
ii =1;<br />
while ii <= 1000<br />
y = unidrnd(5);<br />
u = rand;<br />
if u <= py(y)/0.33<br />
x(ii) = y;<br />
ii = ii+1;<br />
end<br />
end<br />
hist(x);<br />
<br />
MATLAB result<br />
<br />
[[File:MATLAB_Y.jpg]]<br />
<br />
====Limitations====<br />
<br />
Most of the time we have to sample many more points from g(x) before we can obtain an acceptable amount of samples from f(x), hence this method may not be computationally efficient. It depends on our choice of g(x). For example, in the example above to sample from Beta(2,1), we need roughly 2000 samples from g(X) to get 1000 acceptable samples of f(x).<br />
<br />
In addition, in situations where a g(x) function is chosen and used, there can be a discrepancy between the functional behaviors of f(x) and g(x) that render this method unreliable. For example, given the normal distribution function as g(x) and a function of f(x) with a "fat" mid-section and "thin tails", this method becomes useless as more points near the two ends of f(x) will be rejected, resulting in a tedious and overwhelming number of sampling points having to be sampled due to the high rejection rate of such a method.<br />
<br />
===Sampling From Gamma and Normal Distribution - September 27, 2011===<br />
<br />
====Sampling From Gamma====<br />
<br />
'''Gamma Distribution'''<br />
<br />
The Gamma function is written as <math>X \sim~ Gamma (t, \lambda) </math><br />
<br />
:<math> F(x) = \int_{0}^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If you have t samples of the exponential distribution,<br><br />
<br> <math> \begin{align} X_1 \sim~ Exp(\lambda)\\ \vdots \\ X_t \sim~ Exp(\lambda) \end{align}<br />
</math><br />
<br />
The sum of these t samples has a gamma distribution,<br />
<br />
:<math> X_1+X_2+ ... + X_t \sim~ Gamma (t, \lambda) </math><br><br />
:<math> \sum_{i=1}^{t} X_i \sim~ Gamma (t, \lambda) </math> where <math>X_i \sim~Exp(\lambda)</math><br><br />
<br />
'''Method'''<br />
<br />
We can sample the exponential distribution using the inverse transform method from previous class,<br><br />
:<math>\,f(x)={\lambda}e^{-{\lambda}x}</math><br><br />
:<math>\,F^{-1}(u)=\frac{-ln(1-u)}{\lambda}</math><br><br />
:<math>\,F^{-1}(u)=\frac{-ln(u)}{\lambda}</math> <br />
1 - u is the same as x since <math>U \sim~ unif [0,1] </math><br><br />
:<math> \begin{align} \frac{-ln(u_1)}{\lambda} - \frac{ln(u_2)}{\lambda} - ... - \frac{ln(u_t)}{\lambda} = x_1\\ \vdots \\ \frac{-ln(u_1)}{\lambda} - \frac{ln(u_2)}{\lambda} - ... - \frac{ln(u_t)}{\lambda} = x_t \end{align}<br />
:</math><br><br />
:<math> \frac {-\sum_{i=1}^{t} ln(u_i)}{\lambda} = x</math><br />
<br />
'''MATLAB code''' for a Gamma(3,1) is<br />
<br />
<pre><br />
x = sum(-log(rand(1000,3)),2); <br />
hist(x)<br />
</pre><br />
<br />
And the Histogram of X follows a Gamma distribution with long tail: <br />
<br />
[[File:Hist.PNG|center|500px]]<br />
<br />
We can improve the quality of histogram by adding the number of bins we want, like hist(x, number_of_bins)<br />
<br />
<pre><br />
x = sum(-log(rand(20000,3)),2); <br />
hist(x,40)<br />
</pre><br />
<br />
[[File:untitled.jpg|center|500px]]<br />
<br />
''' R code''' for a Gamma(3,1) is<br />
<pre><br />
a<-apply(-log(matrix(runif(3000),nrow=1000)),1,sum);<br />
hist(a);<br />
</pre><br />
And the histogram is <br />
<br />
[[File:hist_gamma.png|center|500px]]<br />
<br />
Here is another histogram of Gamma coding with R<br />
<pre><br />
a<-apply(-log(matrix(runif(3000),nrow=1000)),1,sum);<br />
hist(a,freq=F);<br />
lines(density(a),col="blue");<br />
rug(jitter(a));<br />
</pre><br />
[[File:hist_gamma_2.png|center|500px]]<br />
<br />
====Sampling from Normal Distribution using Box-Muller Transform - September 29, 2011====<br />
<br />
=====Procedure=====<br />
<br />
# Generate <math>\displaystyle u_1</math> and <math>\displaystyle u_2</math>, two values sampled from a uniform distribution between 0 and 1.<br />
# Set <math>\displaystyle R^2 = -2log(u_1)</math> so that <math>\displaystyle R^2</math> is exponential with mean 1/2 <br> Set <math>\!\theta = 2*\pi*u_2</math> so that <math>\!\theta</math> ~ Unif[0, 2<math>\displaystyle\pi</math>]<br />
# Set <math>\displaystyle X = R cos(\theta)</math> <br> Set <math>\displaystyle Y = R sin(\theta)</math><br />
<br />
=====Justification=====<br />
<br />
Suppose we have X ~ N(0, 1) and Y ~ N(0, 1) where X and Y are independent normal random variables. The relative probability density function of these two random variables using Cartesian coordinates is:<br />
<br />
<math> f(X, Y) dxdy= f(X) f(Y) dxdy= \frac{1}{\sqrt{2\pi}}e^{-x^2/2} \frac{1}{\sqrt{2\pi}}e^{-y^2/2} dxdy= \frac{1}{2\pi}e^{-(x^2+y^2)/2}dxdy </math> <br><br />
<br />
In polar coordinates <math>\displaystyle R^2 = x^2 + y^2</math>, so the relative probability density function of these two random variables using polar coordinates is:<br />
<br />
<math> f(R, \theta) = \frac{1}{2\pi}e^{-R^2/2} </math> <br><br />
<br />
If we have <math>\displaystyle R^2 \sim exp(1/2)</math> and <math>\!\theta \sim unif[0, 2\pi]</math> we get an equivalent relative probability density function. Notice that after the two on two transformation, a determinant of jocobian should be added according to the change of variable and rule of differential multiplication where<br />
<br />
<math> |J|=\left|\frac{\partial(x,y)}{\partial(R,\theta)}\right|= \left|\begin{matrix}\frac{\partial x}{\partial R}&\frac{\partial x}{\partial \theta}\\\frac{\partial y}{\partial R}&\frac{\partial y}{\partial \theta}\end{matrix}\right|=R </math> <br><br />
<br />
<math> f(X, Y) dxdy = f(R, \theta)|J|dRd\theta = \frac{1}{2\pi}e^{-R^2/2}R dRd\theta= \frac{1}{4\pi}e^{-\frac{s}{2}} dSd\theta </math> <br>where <math> S=R^2. </math> <br><br />
<br />
Therefore we can generate a point in polar coordinates using the uniform and exponential distributions, then convert the point to Cartesian coordinates and the resulting X and Y values will be equivalent to samples generated from N(0, 1).<br />
<br />
'''MATLAB code'''<br />
<br />
In MatLab this algorithm can be implemented with the following code, which generates 20,000 samples from N(0, 1):<br />
<br />
<pre><br />
x = zeros(10000, 1);<br />
y = zeros(10000, 1);<br />
for ii = 1:10000<br />
u1 = rand;<br />
u2 = rand;<br />
R2 = -2 * log(u1);<br />
theta = 2 * pi * u2;<br />
x(ii) = sqrt(R2) * cos(theta);<br />
y(ii) = sqrt(R2) * sin(theta);<br />
end<br />
hist(x)<br />
</pre><br />
<br />
In one execution of this script, the following histogram for x was generated:<br />
<br />
[[File:Hist standard normal.jpg|center|500px]]<br />
<br />
=====Non-Standard Normal Distributions=====<br />
<br />
'''Example 1: Single-variate Normal'''<br />
<br />
If X ~ Norm(0, 1) then (a + bX) has a normal distribution with a mean of <math>\displaystyle a</math> and a standard deviation of <math>\displaystyle b</math> (which is equivalent to a variance of <math>\displaystyle b^2</math>). Using this information with the Box-Muller transform, we can generate values sampled from some random variable <math>\displaystyle Y\sim N(a,b^2) </math> for arbitrary values of <math>\displaystyle a,b</math>.<br />
<br />
# Generate a sample u from Norm(0, 1) using the Box-Muller transform.<br />
# Set v = a + bu.<br />
<br />
The values for v generated in this way will be equivalent to sample from a <math>\displaystyle N(a, b^2)</math>distribution. We can modify the MatLab code used in the last section to demonstrate this. We just need to add one line before we generate the histogram:<br />
<br />
<pre><br />
x = a + b * x;<br />
</pre><br />
<br />
For instance, this is the histogram generated when b = 15, a = 125:<br />
<br />
[[File:Hist normal.jpg|center|500]]<br />
<br />
'''Example 2: Multi-variate Normal'''<br />
<br />
The Box-Muller method can be extended to higher dimensions to generate multivariate normals. The objects generated will be nx1 vectors, and their variance will be described by nxn covariance matrices.<br />
<br />
<math>\mathbf{z} = N(\mathbf{u}, \Sigma)</math> defines the n by 1 vector <math>\mathbf{z}</math> such that:<br />
<br />
* <math>\displaystyle u_i</math> is the average of <math>\displaystyle z_i</math><br />
* <math>\!\Sigma_{ii}</math> is the variance of <math>\displaystyle z_i</math><br />
* <math>\!\Sigma_{ij}</math> is the co-variance of <math>\displaystyle z_i</math> and <math>\displaystyle z_j</math><br />
<br />
If <math>\displaystyle z_1, z_2, ..., z_d</math> are normal variables with mean 0 and variance 1, then the vector <math>\displaystyle (z_1, z_2,..., z_d) </math> has mean 0 and variance <math>\!I</math>, where 0 is the zero vector and <math>\!I</math> is the identity matrix. This fact suggests that the method for generating a multivariate normal is to generate each component individually as single normal variables.<br />
<br />
The mean and the covariance matrix of a multivariate normal distribution can be adjusted in ways analogous to the single variable case. If <math>\mathbf{z} \sim N(0,I)</math>, then <math>\Sigma^{1/2}\mathbf{z}+\mu \sim N(\mu,\Sigma)</math>. Note here that the covariance matrix is symmetric and nonnegative, so its square root should always exist.<br />
<br />
We can compute <math>\mathbf{z}</math> in the following way:<br />
<br />
# Generate an n by 1 vector <math>\mathbf{x} = \begin{bmatrix}x_{1} & x_{2} & ... & x_{n}\end{bmatrix}</math> where <math>x_{i}</math> ~ Norm(0, 1) using the Box-Muller transform.<br />
# Calculate <math>\!\Sigma^{1/2}</math> using singular value decomposition.<br />
# Set <math>\mathbf{z} = \Sigma^{1/2} \mathbf{x} + \mathbf{u}</math>.<br />
<br />
The following MatLab code provides an example, where a scatter plot of 10000 random points is generated. In this case x and y have a co-variance of 0.9 - a very strong positive correlation.<br />
<br />
<pre><br />
x = zeros(10000, 1);<br />
y = zeros(10000, 1);<br />
for ii = 1:10000<br />
u1 = rand;<br />
u2 = rand;<br />
R2 = -2 * log(u1);<br />
theta = 2 * pi * u2;<br />
x(ii) = sqrt(R2) * cos(theta);<br />
y(ii) = sqrt(R2) * sin(theta);<br />
end<br />
<br />
E = [1, 0.9; 0.9, 1];<br />
[u s v] = svd(E);<br />
root_E = u * (s ^ (1 / 2));<br />
<br />
z = (root_E * [x y]);<br />
z(:,1) = z(:,1) + 5;<br />
z(:,2) = z(:,2) + -8;<br />
<br />
scatter(z(:,1), z(:,2))<br />
</pre><br />
<br />
This code generated the following scatter plot:<br />
<br />
[[File:scatter covar.jpg|center|500px]]<br />
<br />
In Matlab, we can also use the function "sqrtm()" or "chol()" (Cholesky Decomposition) to calculate square root of a matrix directly. Note that the resulting root matrices may be different but this does materially affect the simulation.<br />
Here is an example:<br />
<br />
<pre><br />
E = [1, 0.9; 0.9, 1];<br />
r1 = sqrtm(E);<br />
r2 = chol(E);<br />
</pre><br />
<br />
R code for a multivariate normal distribution:<br />
<br />
<pre><br />
n=10000;<br />
r2<--2*log(runif(n));<br />
theta<-2*pi*(runif(n));<br />
x<-sqrt(r2)*cos(theta);<br />
<br />
y<-sqrt(r2)*sin(theta);<br />
a<-matrix(c(x,y),nrow=n,byrow=F);<br />
e<-matrix(c(1,.9,09,1),nrow=2,byrow=T);<br />
svde<-svd(e);<br />
root_e<-svde$u %*% diag(svde$d)^1/2;<br />
z<-t(root_e %*%t(a));<br />
z[,1]=z[,1]+5;<br />
z[,2]=z[,2]+ -8;<br />
par(pch=19);<br />
plot(z,col=rgb(1,0,0,alpha=0.06))<br />
</pre><br />
<br />
[[File:m_normal.png|center|500px]]<br />
<br />
=====Remarks=====<br />
MATLAB's randn function uses the ziggurat method to generate normal distributed samples. It is an efficient rejection method based on covering the probability density function with a set of horizontal rectangles so as to obtain points within each rectangle. It is reported that a 800 MHz Pentium III laptop can generate over 10 million random numbers from normal distribution in less than one second. ([http://www.mathworks.com/company/newsletters/news_notes/clevescorner/spring01_cleve.html Reference])<br />
<br />
===Sampling From Binomial Distributions===<br />
<br />
In order to generate a sample x from <math>\displaystyle X \sim Bin(n, p)</math>, we can follow the following procedure:<br />
<br />
1. Generate n uniform random numbers sampled from <math>\displaystyle Unif [0, 1] </math>: <math>\displaystyle u_1, u_2, ..., u_n</math>.<br />
<br />
2. Set x to be the total number of cases where <math>\displaystyle u_i <= p</math> for all <math>\displaystyle 1 <= i <= n</math>.<br />
<br />
In MatLab this can be coded with a single line. The following generates a sample from <math>\displaystyle X \sim Bin(n, p)</math> <br />
<br />
>> sum(rand(n, 1) <= p, 1)<br />
<br />
==Bayesian Inference and Frequentist Inference - October 4, 2011==<br />
<br />
===Bayesian inference vs Frequentist inference===<br />
The Bayesian method has become popular in the last few decades as simulation and computer technology makes it more applicable. For more information about its history and application, please refer to http://en.wikipedia.org/wiki/Bayesian_inference.<br />
As for frequentists, please refer to http://en.wikipedia.org/wiki/Frequentist_inference.<br />
<br />
====Example====<br />
Consider: A person drinks a cup of coffee on a specific day.<br />
<br><br><br />
Frequentist: There is no explanation to this situation. It is essentially meaningless since it has only occurred once. Therefore, it is not a probability.<br />
<br><br />
Bayesian: Probability is not just about the frequent occurrences but it is what you believe about this probability.<br />
<br />
<br />
====Example of face identification====<br />
Take the face as input x. And the person as output y. The person can be either Ali or Tom. If it is Ali, y=1. Otherwise, y=0. We can divide the picture into 100*100 pixels and then list them into a 10,000*1 column vector which is x.<br />
<br />
If you are a frequentist, you would compare Pr(X=x|y=1) with Pr(X=x|y=0) and see which one is higher. But if you are a Bayesianist, you would compare Pr(y=1|X=x) with Pr(y=0|X=x).<br />
<br />
====Summary of differences between two schools====<br />
*Frequentist: Probability refers to limiting relative frequency. (objective)<br />
*Bayesian: Probability describes degree of belief not frequency. (subjective)<br />
e.g. The probability that you drank a cup of tea on May 20, 2001 is 0.62 does not refer to any frequency.<br />
----<br />
*Frequentist: Parameters are fixed, unknown constants.<br />
*Bayesian: Parameters are random variables and we can make probabilistic statement about them.<br />
----<br />
*Frequentist: Statistical procedures should have long run frequency probabilities.<br />
e.g. a 95% confidence interval should trap true value of the parameter for at least 95% of limited frequency<br />
*Bayesian: It makes inferences about <math>\theta</math> by producing a prbability distribution for <math>\theta</math>. Inference (e.g. point estimation) will be extracted from this distribution.<br />
<br />
====Bayesian inference====<br />
<br />
Bayesian inference is usually carried out in the following way:<br />
<br />
1. Choose a prior probability density function of <math>\!\theta</math> which is <math>f(\!\theta)</math>. This is our belief about <math>\theta</math> before we see any data.<br />
<br />
2. Choose a statistical model <math>\displaystyle f(x|\theta)</math> that reflects our beliefs about X.<br />
<br />
3. After observing data <math>\displaystyle x_1,...,x_n</math>, we update our beliefs and calculate the posterior probability.<br />
<br />
<math>f(\theta|x) = \frac{f(\theta,x)}{f(x)}=\frac{f(x|\theta) \cdot f(\theta)}{f(x)}=\frac{f(x|\theta) \cdot f(\theta)}{\int^{}_\theta f(x|\theta) \cdot f(\theta) d\theta}</math>, where <math>\displaystyle f(\theta|x)</math> is the posterior probability, <math>\displaystyle f(\theta)</math> is the prior probability, <math>\displaystyle f(x|\theta)</math> is the likelihood of observing X=x given <math>\!\theta</math> and f(x) is the marginal probability of X=x.<br />
<br />
If we have i.i.d. observations <math>\displaystyle x_1,...,x_n</math>, we can replace <math>\displaystyle f(x|\theta)</math> with <math>f({x_1,...,x_n}|\theta)=\prod_{i=1}^n f(x_i|\theta)</math> because of independency.<br />
<br />
We denote <math>\displaystyle f({x_1,...,x_n}|\theta)</math> as <math>\displaystyle L_n(\theta)</math> which is called likelihood. And we use <math>\displaystyle x^n</math> to denote <math>\displaystyle (x_1,...,x_n)</math>.<br />
<br />
<math>f(\theta|x^n) = \frac{f(x^n|\theta) \cdot f(\theta)}{f(x^n)}=\frac{f(x^n|\theta) \cdot f(\theta)}{\int^{}_\theta f(x^n|\theta) \cdot f(\theta) d\theta}</math> , where <math>\int^{}_\theta f(x^n|\theta) \cdot f(\theta) d\theta</math> is a constant <math>\displaystyle c_n</math>. So <math>f(\theta|x^n) \propto f(x^n|\theta) \cdot f(\theta)</math>. The posterior probability is proportional to the likelihood times prior probability.<br />
<br />
<math>E(\theta)=\int^{}_\theta \theta \cdot f(\theta|x^n) d\theta</math> which is the posterior mean of <math>\!\theta</math>.<br />
<br />
Let <math>\tilde{\theta}=(\theta_1,...,\theta_d)^T</math>, then <math>f(\theta_1|x^n) = \int^{} \int^{} \dots \int^{}f(\theta|X)d\theta_2d\theta_3 \dots d\theta_d </math> and <math>E(\theta_1)=\int^{}\theta_1 \cdot f(\theta_1|x^n) d\theta_1</math><br />
<br />
====Example 1: Estimating parameters of a univariate Gaussian distribution====<br />
<br />
Suppose X follows a univariate Gaussian distribution (i.e. a Normal distribution) with parameters <math>\!\mu</math> and <br />
<math>\displaystyle {\sigma^2}</math>.<br />
<br />
(a) For Frequentists:<br />
<br />
<math>f(x|\theta)= \frac{1}{\sqrt{2\pi}\sigma} \cdot e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}</math><br />
<br />
<math>L_n(\theta)= \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma} \cdot e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2}</math><br />
<br />
<br />
<math>\ln L_n(\theta) = l(\theta) = \sum_{i=1}^n -\frac{1}{2}\ln 2\pi-\ln \sigma-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2</math><br />
<br />
To get the maximum likelihood estimator of <math>\!\mu</math> (mle), we find the <math>\hat{\mu}</math> which maximizes <math>\displaystyle L_n(\theta)</math>:<br />
<br />
<math>\frac{\partial l(\theta)}{\partial \mu}= \sum_{i=1}^n \frac{1}{\sigma}(\frac{x_i-\mu}{\sigma})=0 \Rightarrow \sum_{i=1}^n x_i = n\mu \Rightarrow \hat{\mu}_{mle}=\bar{x}</math><br />
<br />
(b) For Bayesians:<br />
<br />
<math>f(\theta|x) \propto f(x|\theta) \cdot f(\theta)</math><br />
<br />
We assume that the mean of the above normal distribution is itself distributed normally with mean <math>\!\mu_0</math> and variance <math>\!\Gamma</math>.<br />
<br />
Suppose <math>\!\mu\sim N(\mu_0, \!\Gamma^2</math>),<br />
<br />
so <math>f(\mu) = \frac{1}{\sqrt{2\pi}\Gamma} \cdot e^{-\frac{1}{2}(\frac{\mu-\mu_0}{\Gamma})^2}</math><br />
<br />
<math>f(\mu|x) = \frac{1}{\sqrt{2\pi}\tilde{\sigma}} \cdot e^{-\frac{1}{2}(\frac{\mu-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
<br />
<math>\tilde{\mu} = \frac{\frac{n}{\sigma^2}}{\frac{n}{\sigma^2}+\frac{1}{\Gamma^2}}\bar{x}+\frac{\frac{1}{\Gamma^2}}{\frac{n}{\sigma^2}+\frac{1}{\Gamma^2}}\mu_0</math>, where <math>\tilde{\mu}</math> is the estimator of <math>\!\mu</math>.<br />
<br />
* If prior belief about <math>\!\mu_0</math> is strong, then <math>\!\Gamma</math> is small and <math>\frac{1}{\Gamma^2}</math> is large. <math>\tilde{\mu}</math> is close to <math>\!\mu_0</math> and the observations will not affect too much. On the contrary, if prior belief about <math>\!\mu_0</math> is weak, <math>\!\Gamma</math> is large and <math>\frac{1}{\Gamma^2}</math> is small. <math>\tilde{\mu}</math> depends more on observations.(This is intuitive, when our original belief is reliable, then the sample is not important in improving the result; when the belief is not reliable, then we depend a lot on the sample.)<br />
<br />
* When the sample is large (i.e. n <math>\to \infty</math>), <math>\tilde{\mu} \to \bar{x}</math> and the impact of prior belief about <math>\!\mu</math> is weakened.<br />
<br />
=='''Basic Monte Carlo Integration - October 6th, 2011'''==<br />
<br />
Three integration methods would be taught in this course:<br />
*Basic Monte Carlo Integration<br />
*Importance Sampling<br />
*Markov Chain Monte Carlo (MCMC)<br />
<br />
The first, and most basic, method of numerical integration we will see is Monte Carlo Integration. We use this to solve an integral of the form: <math> I = \int_{a}^{b} h(x) dx </math><br />
<br />
Note the following derivation: <br />
<br />
<math>\begin{align}<br />
\displaystyle I & = \int_{a}^{b} h(x)dx \\<br />
& = \int_{a}^{b} h(x)((b-a)/(b-a))dx \\<br />
& = \int_{a}^{b} (h(x)(b-a))(1/(b-a))dx \\<br />
& = \int_{a}^{b} w(x)f(x)dx \\<br />
& = E[w(x)] \\<br />
\end{align}<br />
</math><br />
<br />
~<math>(1/n) \sum_{i=1}^{n} w(x) </math><br />
<br />
Where w(x) = h(x)(b-a) and f(x) is the probability density function of a uniform random variable on the interval [a,b]. The expectation, with respect to the distribution of f, of w is taken from n samples of x.<br />
<br />
<br />
===='''General Procedure'''====<br />
<br />
i) Draw n samples <math> x_i \sim~ U[a,b] </math><br />
<br />
ii) Compute <math> \ w(x_i) </math> for every sample<br />
<br />
iii) Obtain an estimate of the integral, <math> \hat{I} </math>, as follows:<br />
<br />
<math> \hat{I} = 1/n \sum_{i=1}^{n} w(x</math><sub>i</sub><math> )</math> . Clearly, this is just the average of the simulation results.<br />
<br />
By the strong law of large numbers <math> \hat{I} </math> converges to <math> \ I </math> as <math> \ n \rightarrow \infty </math>. Because of this, we can compute all sorts of useful information, such as variance, standard error, and confidence intervals.<br />
<br />
Standard Error: <math> SE = Standard Deviation / \sqrt{n} </math><br />
<br />
Variance: <math> V = (\sum_{i=1}^{n} (w(x)-I)^2)/(n-1) </math><br />
<br />
Confidence Interval: <math> I \pm t_{(\alpha/2)} SE </math><br />
<br />
==='''Example: Uniform Distribution'''===<br />
<br />
Consider the integral, <math> \int_{0}^{1} x^3dx </math>, which is easily solved through standard analytical integration methods, and is equal to .25. Now, let us check this answer with a numerical approximation using Monte Carlo Integration. <br />
<br />
We generate a 1 by 10000 vector of uniform (on the interval [0,1]) random variables and call that vector 'u'. We see that our 'w' in this case is <math> x^3 </math>, so we set <math> w = u^3 </math>. Our I<sup>^</sup> is equal to the mean of w.<br />
<br />
In Matlab, we can solve this integration problem with the following code:<br />
<br />
<pre><br />
u = rand(1,10000);<br />
w = u.^3;<br />
mean(w)<br />
ans = 0.2475<br />
</pre><br />
<br />
Note the '.' after 'u' in the second line of code, indicating that each entry in the matrix is cubed. Also, our approximation is close to the actual value of .25. Now let's try to get an even better approximation by generating more sample points. <br />
<br />
<pre><br />
u= rand(1,100000);<br />
w= u.^3;<br />
mean(w)<br />
ans = .2503<br />
</pre><br />
<br />
We see that when the number of sample points is increased, our approximation improves, as one would expect.<br />
<br />
==='''Generalization'''===<br />
<br />
Up to this point we have seen how to numerically approximate an integral when the distribution of f is uniform. Now we will see how to generalize this to other distributions.<br />
<br />
<math> I = \int h(x)f(x)dx </math> <br />
<br />
If f is a distribution function (pdf), then <math> I </math> can be estimated as E<sub>f</sub>[h(x)]. This means taking the expectation of h with respect to the distribution of f. Our previous example is the case where f is the uniform distribution between [a,b].<br />
<br />
'''Procedure for the General Case'''<br />
<br />
i) Draw n samples from f <br />
<br />
ii) Compute h(x<sub>i</sub>)<br />
<br />
iii) <math>\hat{I} = 1/n \sum_{i=1}^{n} h(x</math><sub>i</sub><math>)</math><br />
<br />
==='''Example: Exponential Distribution'''===<br />
<br />
Find <math> E[\sqrt{x}] </math> for <math> \displaystyle f = e^{-x} </math>, which is the exponential distribution with mean 1.<br />
<br />
<math> I = \int_{0}^{\infty} \sqrt{x} e^{-x}dx </math><br />
<br />
We can see that we must draw samples from f, the exponential distribution.<br />
<br />
To find a numerical solution using Monte Carlo Integration we see that: <br />
<br />
u= rand(1,10000)<br />
X= -log(u)<br />
h= <math> \sqrt{x} </math> <br />
I= mean(h)<br />
<br />
To implement this procedure in Matlab, use the following code:<br />
<br />
<pre><br />
u = rand(1,10000);<br />
X = -log(u);<br />
h = x.^.5;<br />
mean(h)<br />
ans = .8841<br />
</pre><br />
<br />
An easy way to check whether your approximation is correct is to use the built in Matlab function 'quadl' which takes a function and bounds for the integral and returns a solution for the definite integral of that function. For this specific example, we can enter:<br />
<br />
<pre><br />
f = @(x) sqrt(x).*exp(-x);<br />
% quadl runs into computational problems when the upper bound is "inf" or an extremely large number, <br />
% so choose just a moderately large number.<br />
quadl(f,0,100)<br />
ans =<br />
0.8862<br />
</pre><br />
<br />
From the above result, we see that our approximation was quite close.<br />
<br />
==='''Example: Normal Distribution'''===<br />
<br />
Let <math> f(x) = (1/(2 \pi)^{1/2}) e^{(-x^2)/2} </math>. Compute the cumulative distribution function at some point x.<br />
<br />
<math> F(x)= \int_{-\infty}^{x} f(s)ds = \int_{-\infty}^{x}(1)(1/(2 \pi)^{1/2}) e^{(-s^2)/2}ds </math>. The (1) is inserted to illustrate that our h(x) will be the constant function 1, and our f(x) is the normal distribution. To take into account the upper bound of integration, x, any values sampled that are greater than x will be set to zero. <br />
<br />
This is the Matlab code for solving F(2):<br />
<br />
<pre><br />
<br />
u = randn(1,10000)<br />
h = u < 2;<br />
mean(h)<br />
ans = .9756<br />
<br />
</pre><br />
<br />
We generate a 1 by 10000 vector of standard normal random variables and we return a value of 1 if u is less than 2, and 0 otherwise.<br />
<br />
We can also build the function F(x) in matlab in the following way:<br />
<br />
<pre><br />
function F(x)<br />
u=rand(1,1000000);<br />
h=u<x;<br />
mean(h)<br />
</pre><br />
<br />
<br />
==='''Example: Binomial Distribution'''===<br />
<br />
In this example we will see the Bayesian Inference for 2 Binomial Distributions.<br />
<br />
Let <math> X ~ Bin(n,p) </math> and <math> Y ~ Bin(m,q) </math>, and let <math> \!\delta = p-q </math>.<br />
<br />
Therefore, <math> \displaystyle \!\delta = x/n - y/m </math> which is the frequentist approach.<br />
<br />
Bayesian wants <math> \displaystyle f(p,q|x,y) = f(x,y|p,q)f(p,q)/f(x,y) </math>, where <math> f(x,y)=\iint\limits_{\!\theta} f(x,y|p,q)f(p,q)\,dp\,dq</math> is a constant.<br />
<br />
Thus, <math> \displaystyle f(p,q|x,y)\propto f(x,y|p,q)f(p,q) </math>. Now we assume that <math>\displaystyle f(p,q) = f(p)f(q) = 1 </math> and f(p) and f(q) are uniform.<br />
<br />
Therefore, <math> \displaystyle f(p,q|x,y)\propto p^x(1-p)^{n-x}q^y(1-q)^{m-y} </math>.<br />
<br />
<math> E[\delta] = \int_{0}^{1} \int_{0}^{1} (p-q)f(p,q|x,y)dxdy </math>.<br />
<br />
As you can see this is much tougher than the frequentist approach.<br />
<br />
=='''Importance Sampling and Basic Monte Carlo Integration - October 11th, 2011'''==<br />
<br />
==='''Example: Binomial Distribution (Continued)'''===<br />
<br />
Suppose we are given two independent Binomial Distributions <math>\displaystyle X \sim Bin(n, p_1)</math>, <math>\displaystyle Y \sim Bin(m, p_2)</math>. We would like to give an Monte Carlo estimate of <math>\displaystyle \delta = p_1 - p_2</math><br><br />
<br />
Frequentist approach: <br><br><math>\displaystyle \hat{p_1} = \frac{X}{n}</math> ; <math>\displaystyle \hat{p_2} = \frac{Y}{m}</math><br><br><math>\displaystyle \hat{\delta} = \hat{p_1} - \hat{p_2} = \frac{X}{n} - \frac{Y}{m}</math><br><br><br />
Bayesian approach to compute the expected value of <math>\displaystyle \delta</math>:<br><br><br />
<math>\displaystyle E(\delta) = \int\int(p_1-p_2) f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Assume that <math>\displaystyle n = 100, m = 100, p_1 = 0.5, p_2 = 0.8</math> and the sample size is 1000.<br><br />
MATLAB code of the above example:<br />
<pre><br />
n = 100;<br />
m = 100;<br />
p_1 = 0.5;<br />
p_2 = 0.8;<br />
p1 = mean(rand(n,1000)<p_1);<br />
p2 = mean(rand(m,1000)<p_2);<br />
delta = p2 - p1;<br />
hist(delta)<br />
mean(delta)<br />
</pre><br />
<br />
In one execution of the code, the mean of delta was 0.3017. The histogram of delta generated was:<br />
[[File:Hist delta.jpg|center|]]<br />
<br />
Through Monte Carlo simulation, we can obtain an empirical distribution of delta and carry out inference on the data obtained, such as computing the mean, maximum, variance, standard deviation and the standard error of delta.<br />
<br />
==='''Importance Sampling'''===<br />
<br />
====Motivation====<br />
<br />
Consider the integral <math>\displaystyle I = \int h(x)f(x)\,dx</math><br><br><br />
According to basic Monte Carlo Integration, if we can sample from the probability density function <math>\displaystyle f(x)</math> and feed the samples of <math>\displaystyle f(x)</math> back to <math>\displaystyle h(x)</math>, <math>\displaystyle I</math> can be estimated as an average of <math>\displaystyle h(x)</math> ( i.e. <math>\hat{I} = \frac{1}{n} \sum_{i=1}^{n} h(x_i)</math> )<br><br />
However, the Monte Carlo method works when we know how to sample from <math>\displaystyle f(x)</math>. In the case where it is difficult to sample from <math>\displaystyle f(x)</math>, importance sampling is a technique that we can apply. Importance Sampling relies on another function <math>\displaystyle g(x)</math> which we know how to sample from.<br />
<br />
The above integral can be rewritten as follow:<br><br />
<math>\begin{align}<br />
\displaystyle I & = \int h(x)f(x)\,dx \\<br />
& = \int h(x)f(x)\frac{g(x)}{g(x)}\,dx \\<br />
& = \int \frac{h(x)f(x)}{g(x)}g(x)\,dx \\<br />
& = \int y(x)g(x)\,dx \\<br />
& = E_g(y(x)) \\<br />
\end{align}<br />
</math><br><br />
<math>where \ y(x) = \frac{h(x)f(x)}{g(x)}</math><br><br />
<br />
The integral can thus be simulated as <math>\displaystyle \hat{I} = \frac{1}{n} \sum_{i=1}^{n} Y_i \ , \ where \ Y_i = \frac{h(x_i)f(x_i)}{g(x_i)}</math><br><br />
<br />
====Procedure====<br />
<br />
Suppose we know how to sample from <math>\displaystyle g(x)</math><br><br />
#Choose a suitable <math>\displaystyle g(x)</math> and draw n samples <math>x_1,x_2....,x_n \sim~ g(x)</math><br />
#Set <math>Y_i =\frac{h(x_i)f(x_i)}{g(x_i)}</math><br />
#Compute <math> \hat{I} = \frac{1}{n}\sum_{i=1}^{n} Y_i </math><br><br />
<br />
By the Law of large numbers, <math>\displaystyle \hat{I} \rightarrow I </math> provided that the sample size n is large enough.<br><br><br />
<br />
'''Remarks:''' One can think of <math>\frac{f(x)}{g(x)}</math> as a weight to <math>\displaystyle h(x)</math> in the computation of <math>\hat{I}</math><br><br><br />
<math>\displaystyle i.e. \ \hat{I} = \frac{1}{n}\sum_{i=1}^{n} Y_i = \frac{1}{n}\sum_{i=1}^{n} (\frac{f(x_i)}{g(x_i)})h(x_i)</math><br><br><br />
Therefore, <math>\displaystyle \hat{I} </math> is a weighted average of <math>\displaystyle h(x_i)</math><br><br><br />
<br />
====Problem====<br />
<br />
If <math>\displaystyle g(x)</math> is not chosen appropriately, then the variance of the estimate <math>\hat{I}</math> may be very large. Here we actually face a similar problem with Rejection-Acceptance Approach. Consider the second moment of <math>\displaystyle I</math>:<br><br><br />
<math>\begin{align}<br />
\displaystyle I & = E_g((y(x))^2) \\<br />
& = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx \\<br />
& = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx \\<br />
& = \int \frac{h^2(x)f^2(x)}{g(x)} dx \\<br />
\end{align}<br />
</math><br><br><br />
<br />
When <math>\displaystyle g(x)</math> is very small, then the above integral could be very large, hence the variance can be very large when g is not chosen appropriately. This occurs when <math>\displaystyle g(x)</math> has a thinner tail than <math>\displaystyle f(x)</math> such that the quantity <math>\displaystyle \frac{h^2(x)f^2(x)}{g(x)}</math> is large.<br />
<br />
'''Remarks:''' <br />
<br />
1. We can actually compute the form of <math>\displaystyle g(x)</math> to have optimal variance. <br>Mathematically, it is to find <math>\displaystyle g(x)</math> subject to <math>\displaystyle \min_g [\ E_g([y(x)]^2) - (E_g[y(x)])^2\ ]</math><br><br />
It can be shown that the optimal <math>\displaystyle g(x)</math> is <math>\displaystyle \frac{|h(x)|f(x)}{\int_{-\infty}^{\infty}|h(s)|f(s)ds}</math>. Using the optimal <math>\displaystyle g(x)</math> will minimize the variance of estimation in Importance Sampling. This is of theoretical interest but not useful in practice. As we can see, if we can actually show the expression of g(x), we must first have the value of the integration---which is what we want in the first place.<br />
<br />
2. In practice, we shall choose <math>\displaystyle g(x)</math> which has similar shape as <math>\displaystyle f(x)</math> but with a thicker tail than <math>\displaystyle f(x)</math> in order to avoid the problem mentioned above.<br><br />
<br />
====Example====<br />
<br />
Estimate <math>\displaystyle I = Pr(Z>3),\ where\ Z \sim N(0,1)</math><br><br><br />
'''Method 1: Basic Monte Carlo'''<br />
<br />
<math>\begin{align} Pr(Z>3) & = \int^\infty_3 f(x)\,dx \\<br />
& = \int^\infty_{-\infty} h(x)f(x)\,dx \end{align}</math><br /><br />
<math> where \ <br />
h(x) = \begin{cases}<br />
0, & \text{if } x \le 3 \\<br />
1, & \text{if } x > 3<br />
\end{cases}</math><br />
<math>\ ,\ f(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}</math><br />
<br />
MATLAB code to compute <math>\displaystyle I</math> from 100 samples of standard normal distribution:<br />
<pre><br />
h = randn(100,1) > 3;<br />
I = mean(h)<br />
</pre><br />
<br />
In one execution of the code, it returns a value of 0 for <math>\displaystyle I</math>, which differs significantly from the true value of <math>\displaystyle I \approx 0.0013 </math>. The problem of using Basic Monte Carlo in this example is that <math>\displaystyle Pr(Z>3)</math> has a small value, and hence many points sampled from the standard normal distribution will be wasted. Therefore, although Basic Monte Carlo is a feasible method to compute <math>\displaystyle I</math>, it gives a poor estimation.<br />
<br />
'''Method 2: Importance Sampling'''<br />
<br />
<math>\displaystyle I = Pr(Z>3)= \int^\infty_3 f(x)\,dx </math><br><br />
<br />
To apply importance sampling, we have to choose a <math>\displaystyle g(x)</math> which we will sample from. In this example, we can choose <math>\displaystyle g(x)</math> to be the probability density function of exponential distribution, normal distribution with mean 0 and variance greater than 1 or normal distribution with mean greater than 0 and variance 1 etc.. For the following, we take <math>\displaystyle g(x)</math> to be the pdf of <math>\displaystyle N(4,1)</math>.<br><br />
<br />
Procedure:<br />
#Draw n samples <math>x_1,x_2....,x_n \sim~ g(x)</math><br />
#Calculate <math>\begin{align} \frac{f(x)}{g(x)} & = \frac{ \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}<br />
}{ \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x-4)^2} } \\<br />
& = e^{8-4x} \end{align} </math><br><br />
#Set <math> Y_i = h(x_i)e^{8-4x_i}\ with\ h(x) = \begin{cases}<br />
0, & \text{if } x \le 3 \\<br />
1, & \text{if } x > 3<br />
\end{cases}<br />
</math><br><br />
#Compute <math> \hat{Y} = \frac{1}{n}\sum_{i=1}^{n} Y_i </math><br><br />
<br />
The above procedure from 100 samples of <math>\displaystyle g(x)</math>can be implemented in MATLAB as follow:<br />
<pre><br />
for ii = 1:100<br />
x = randn + 4 ;<br />
h = x > 3 ;<br />
y(ii) = h * exp(8-4*x) ;<br />
end<br />
mean(y)<br />
</pre><br />
<br />
In one execution of the code, it returns a value of 0.001271 for <math> \hat{Y} </math>, which is much closer to the true value of <math>\displaystyle I \approx 0.0013 </math>. From many executions of the code, the variance of basic monte carlo is approximately 150 times that of importance sampling. This demonstrates that this method can provide a better estimate than the Basic Monte Carlo method.<br />
<br />
==''' Importance Sampling with Normalized Weight and Markov Chain Monte Carlo - October 13th, 2011'''==<br />
==='''Importance Sampling with Normalized Weight'''===<br />
<br />
Recall that we can think of <math>\displaystyle b(x) = \frac{f(x)}{g(x)}</math> as a weight applied to the samples <math>\displaystyle h(x)</math>. If the form of <math>\displaystyle f(x)</math> is known only up to a constant, we can use an alternate, normalized form of the weight, <math>\displaystyle b^*(x)</math>. (This situation arises in Bayesian inference.) Importance sampling with normalized or standard weight is also called indirect importance sampling.<br />
<br />
We derive the normalized weight as follows:<br><br />
<math>\begin{align}<br />
\displaystyle I & = \int h(x)f(x)\,dx \\<br />
&= \int h(x)\frac{f(x)}{g(x)}g(x)\,dx \\<br />
&= \frac{\int h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int f(x) dx} \\<br />
&= \frac{\int h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int \frac{f(x)}{g(x)}g(x) dx} \\<br />
&= \frac{\int h(x)b(x)g(x)\,dx}{\int\ b(x)g(x) dx} <br />
\end{align}</math><br />
<br />
<math>\hat{I}= \frac{\sum_{i=1}^{n} h(x_i)b(x_i)}{\sum_{i=1}^{n} b(x_i)} </math><br />
<br />
Then, the normalized weight is <math>b^*(x) = \displaystyle \frac{b(x_i)}{\sum_{i=1}^{n} b(x_i)}</math><br />
<br />
Note that <math> \int f(x) dx = 1 = \int b(x)g(x) dx = 1 </math><br />
<br />
We can also determine the associated Monte Carlo variance of this estimate by<br />
<br />
<math> Var(\hat{I})= \frac{\sum_{i=1}^{n} b(x_i)(h(x_i) - \hat{I})^2}{\sum_{i=1}^{n} b(x_i)} </math><br />
<br />
==='''Markov Chain Monte Carlo'''===<br />
We still want to solve <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
====Stochastic Process====<br />
A stochastic process <math> \{ x_t : t \in T \}</math> is a collection of random variables. Variables <math>\displaystyle x_t</math> take values in some set <math>\displaystyle X</math> called the '''space set.''' The set <math>\displaystyle T</math> is called the '''index set.'''<br />
<br />
====Markov Chain====<br />
A Markov Chain is a stochastic process for which the distribution of <math>\displaystyle x_t</math> depends only on <math>\displaystyle x_{t-1}</math>. It is a random process characterized as being memoryless; meaning that the next occurrence of a defined event is only dependent on the current event and not on the sequence of events that preceded it. <br />
Formal Definition: The process <math> \{ x_t : t \in T \}</math> is a Markov Chain if <math>\displaystyle Pr(x_t|x_0, x_1,..., x_{t-1})= Pr(x_t|x_{t-1})</math> for all <math> \{t \in T \}</math> and for all <math> \{x \in X \}</math><br />
For a Markov Chain, <math>\displaystyle f(x_1,...x_n)= f(x_1)f(x_2|x_1)f(x_3|x_2)...f(x_n|x_{n-1})</math><br />
<br><br>Real Life Example:<br />
<br>When going for an interview, the employer only looks at your highest education achieved. The employer would not look at the past educations received (elementary school, high school etc.) because the employer believes that the highest education achieved summarizes your previous educations. Therefore, anything before your most recent previous education is irrelevant. In other word, we can say that<math> x_t </math>is regarded as the summary of <math>x_{t-1},...,x_2,x_1</math>, so when we need to determine <math>x_{t+1}</math>, we only need to pay attention in <math>x_{t}</math>.<br />
<br />
====Transition Probabilities====<br />
A Transition Probability is the probability of jumping from one state to another state.<br />
Formal Definition: We call <math>\displaystyle P_{ij} = Pr(x_{t+1}=j|x_t=i)</math> the transition probability.<br />
That is, P(i,j) is the probability of going to state j from state i. The matrix P whose (i,j) element is <math>\displaystyle P_{ij}</math> is called the Transition Matrix.<br />
<br />
Properties of P: <br />
:1) <math>\displaystyle P_{ij} >= 0</math> The probability of going to another state cannot be negative<br />
:2) <math>\displaystyle \sum_{\forall i}P_{ij} = 1</math> The probability of going to some state from state i (including remaining in state i) is certainty<br />
<br />
====Random Walk====<br />
Example: Start at one point and flip a coin where <math>\displaystyle Pr(H)=p</math> and <math>\displaystyle Pr(T)=1-p=q</math>. Take one step right if heads and one step left if tails. If at an endpoint, stay there.<br />
The transition matrix is<br />
<math>P=\left(\begin{matrix}1&0&0&\dots&\dots&0\\<br />
q&0&p&0&\dots&0\\<br />
0&q&0&p&\dots&0\\<br />
\vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\<br />
\vdots&\vdots&\vdots&\vdots&\ddots&\vdots\\<br />
0&0&\dots&\dots&\dots&1<br />
\end{matrix}\right)</math><br />
<br />
Let <math>\displaystyle P_n</math> be the matrix such that its (i,j) element is <math>\displaystyle P_{ij}(n)</math>. This is called n-step probability.<br />
<br />
:<math>\displaystyle P_n = P^n</math><br />
:<math>\displaystyle P_1 = P</math><br />
:<math>\displaystyle P_2 = P^2</math><br />
<br />
<br />
==''' Markov Chain Properties and Page Rank - October 18th, 2011'''==<br />
<br />
===Summary of Terminology===<br />
<br />
====Transition Matrix====<br />
<br />
A matrix <math>\!P</math> that defines a Markov Chain has the form:<br />
<br />
<math>P = \begin{bmatrix}<br />
P_{11} & \cdots & P_{1N} \\<br />
\vdots & \ddots & \vdots \\ <br />
P_{N1} & \cdots & P_{NN}<br />
\end{bmatrix}</math><br />
<br />
where <math>\!P(i,j) = P_{ij} = Pr(x_{t+1} = j | x_t = i) </math> is the probability of transitioning from state i to state j in the Markov Chain in a single step. Note that this implies that all rows add up to one.<br />
<br />
====n-step Transition matrix====<br />
<br />
A matrix <math>\!P_n</math> whose (i,j)<sup>th</sup> entry is the probability of moving from state i to state j after n transitions:<br />
<br />
<math>\!P_n(i,j) = Pr(x_{m+n}=j|x_m = i)</math><br />
<br />
This probability is called the n-step transition probability. A nice property of this matrix is that<br />
<br />
<math>\!P_n = P^n</math><br />
<br />
For all n >= 0, where P is the transition matrix. Note that the rows of <math>P_n</math> should still add up to one.<br />
<br />
====Marginal distribution of a Markov Chain====<br />
<br />
We represent the state at time t as a vector.<br />
<br />
<math>\mu_t = (\mu_t(1) \; \mu_t(2) \; ... \; \mu_t(n))</math><br />
<br />
Consider this Markov Chain:<br />
<br />
[[File:MarkovSample.png|300px]]<br />
<br />
<math>\mu_t = (A \; B)</math>, where A is the probability of being in state a at time t, and B is the probability of being in state b at time t.<br />
<br />
For example if <math>\mu_t = (0.1 \; 0.9)</math>, we have a 10% chance of being in state a at time t, and a 90% chance of being in state b at time t.<br />
<br />
Suppose we run this Markov chain many times, and record the state at each step.<br />
<br />
In this example, we run 4 trials, up until t=5.<br />
<br />
{| class="wikitable"<br />
|-<br />
! t<br />
! Trial 1<br />
! Trial 2<br />
! Trial 3<br />
! Trial 4<br />
! Observed <math>\mu</math><br />
|-<br />
| 1<br />
| a<br />
| b<br />
| b<br />
| a<br />
| (0.5, 0.5)<br />
|-<br />
| 2<br />
| b<br />
| a<br />
| a<br />
| a<br />
| (0.75, 0.25)<br />
|-<br />
| 3<br />
| a<br />
| a<br />
| b<br />
| a<br />
| (0.75, 0.25)<br />
|-<br />
| 4<br />
| b<br />
| b<br />
| a<br />
| b<br />
| (0.25, 0.75)<br />
|-<br />
| 5<br />
| b<br />
| b<br />
| b<br />
| a<br />
| (0.25, 0.75)<br />
|}<br />
<br />
Imagine simulating the chain many times. If we collect all the outcomes at time t from all the chains, the histogram of this data would look like <math>\!\mu_t</math>.<br />
<br />
We can find the marginal probabilities as <math>\!\mu_n = \mu_0 P^n</math><br />
<br />
====Stationary Distribution====<br />
<br />
Let <math>\pi = (\pi_i \mid i \in \chi)</math> be a vector of non-negative numbers that sum to 1. (i.e. <math>\!\pi</math> is a pmf)<br />
<br />
If <math>\!\pi = \pi P</math>, then <math>\!\pi</math> is a stationary distribution, also known as an invariant distribution.<br />
<br />
====Limiting Distribution====<br />
<br />
A Markov chain has limiting distribution <math>\!\pi </math> if <math>\lim_{n \to \infty} P^n = \begin{bmatrix} \pi \\ \vdots \\ \pi \end{bmatrix}</math><br />
<br />
That is, <math>\!\pi_j = \lim_{n \to \infty}\left [ P^n \right ]_{ij}</math> exists and is independent of i.<br />
<br />
Here is an example:<br />
<br />
Suppose we want to find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/3&1/3&1/3\\<br />
1/4&3/4&0\\<br />
1/2&0&1/2<br />
\end{matrix}\right)</math><br />
<br />
We want to solve <math>\pi=\pi P</math> and we want <math>\displaystyle \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
<math>\displaystyle \pi_0 = 1/3\pi_0 + 1/4\pi_1 + 1/2\pi_2</math><br /><br />
<math>\displaystyle \pi_1 = 1/3\pi_0 + 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_2 = 1/3\pi_0 + 1/2\pi_2</math><br /><br />
<br />
Solving the system of equations, we get <br /> <br />
<math>\displaystyle \pi_1 = 4/3\pi_0</math><br /><br />
<math>\displaystyle \pi_2 = 2/3\pi_0</math><br /><br />
<br />
So using our condition above, we have <math>\displaystyle \pi_0 + 4/3\pi_0 + 2/3\pi_0 = 1</math> and by solving we get <math>\displaystyle \pi_0 = 1/3</math><br />
<br />
Using this in our system of equations, we obtain: <br /><br />
<math>\displaystyle \pi_1 = 4/9</math><br /><br />
<math>\displaystyle \pi_2 = 2/9</math><br />
<br />
Thus, the limiting distribution is <math>\displaystyle \pi = (1/3, 4/9, 2/9)</math><br />
<br />
====Detailed Balance====<br />
<br />
<math>\!\pi</math> has the detailed balance property if <math>\!\pi_iP_{ij} = P_{ji}\pi_j</math><br />
<br />
'''Theorem'''<br />
<br />
If <math>\!\pi</math> satisfies detailed balance, then <math>\!\pi</math> is a stationary distribution.<br />
<br />
In other words, if <math>\!\pi_iP_{ij} = P_{ji}\pi_j</math>, then <math>\!\pi = \pi P</math><br />
<br />
'''Proof:''' <br />
<br />
<math>\!\pi P =<br />
\begin{bmatrix}\pi_1 & \pi_2 & \cdots & \pi_N\end{bmatrix} \begin{bmatrix}P_{11} & \cdots & P_{1N} \\ \vdots & \ddots & \vdots \\ P_{N1} & \cdots & P_{NN}\end{bmatrix}</math><br />
<br />
Observe that the j<sup>th</sup> element of <math>\!\pi P</math> is<br />
<br />
<math>\!\left [ \pi P \right ]_j = \pi_1 P_{1j} + \pi_2 P_{2j} + \dots + \pi_N P_{Nj}</math><br />
<br />
::<math>\! = \sum_{i=1}^N \pi_i P_{ij}</math><br />
<br />
::<math>\! = \sum_{i=1}^N P_{ji} \pi_j</math>, by the definition of detailed balance.<br />
<br />
::<math>\! = \pi_j \sum_{i=1}^N P_{ji}</math><br />
<br />
::<math>\! = \pi_j</math>, as the sum of the entries in a column of P must sum to 1.<br />
<br />
So <math>\!\pi = \pi P</math>.<br />
<br />
<br />
'''Example'''<br />
<br />
Find the marginal distribution of <br />
<br />
[[File:MarkovSample.png|300px]]<br />
<br />
Start by generating the matrix P.<br />
<br />
<math>\!P = \begin{pmatrix} 0.2 & 0.8 \\ 0.6 & 0.4 \end{pmatrix}</math><br />
<br />
We must assume some starting value for <math>\mu_0</math><br />
<br />
<math>\!\mu_0 = \begin{pmatrix} 0.1 & 0.9 \end{pmatrix}</math><br />
<br />
For t = 1, the marginal distribution is<br />
<br />
<math>\!\mu_1 = \mu_0 P</math><br />
<br />
Notice that this <math>\mu</math> converges. <br />
<br />
If you repeatedly run:<br />
<br />
<math>\!\mu_{i+1} = \mu_i P</math><br />
<br />
It converges to <math>\mu = \begin{pmatrix} 0.4286 & 0.5714 \end{pmatrix}</math><br />
<br />
This can be seen by running the following Matlab code:<br />
P = [0.2 0.8; 0.6 0.4];<br />
mu = [0.1 0.9]; <br />
while 1 <br />
mu_old = mu; <br />
mu = mu * P;<br />
if mu_old == mu <br />
disp(mu);<br />
break;<br />
end<br />
end<br />
<br />
Another way of looking at this simple question is that we can see whether the ultimate pmf converges:<br />
<br />
Let <math>\hat{p_n}(1)=\frac{1}{n}\sum_{k=1}^n I(X_k=1)</math> denote the estimator of the stationary probability of state 1,<math>\hat{p_n}(2)=\frac{1}{n}\sum_{k=1}^n I(X_k=2)</math> denote the estimator of the stationary probability of state 2, where <math>\displaystyle I(X_k=1)</math> and <math>\displaystyle I(X_k=2)</math> are indicator variables which equal 1 if <math>X_k=1</math>(or <math>X_k=2</math> for the latter one).<br />
<br />
Matlab codes for this explanation is<br />
<br />
n=1;<br />
if rand<0.1<br />
x(1)=1;<br />
else<br />
x(1)=0;<br />
end<br />
p1(1)=sum(x)/n;<br />
p2(1)=1-p1(1);<br />
for i=2:10000<br />
n=n+1;<br />
if (x(i-1)==1&rand<0.2)|(x(i-1)==0&rand<0.6)<br />
x(i)=1;<br />
else<br />
x(i)=0;<br />
end<br />
p1(i)=sum(x)/n;<br />
p2(i)=1-p1(i); <br />
end<br />
plot(p1,'red');<br />
hold on;<br />
plot(p2)<br />
<br />
The results can be easily seen from the graph below:<br />
<br />
[[File:Stationary distribution.png|300px]]<br />
<br />
Additionally, we can plot the marginal distribution as it converges without estimating it. The following Matlab code shows this:<br />
<br />
%transition matrix<br />
P=[0.2 0.8; 0.6 0.4];<br />
%mu at time 0<br />
mu=[0.1 0.9];<br />
%number of points for simulation<br />
n=20;<br />
for i=1:n<br />
mu_a(i)=mu(1);<br />
mu_b(i)=mu(2);<br />
mu=mu*P;<br />
end<br />
t=[1:n];<br />
plot(t, mu_a, t, mu_b);<br />
hleg1=legend('state a', 'state b');<br />
<br />
[[File:Marginal distribution convergence.png|300px]]<br />
<br />
Note that there are chains with stationary distributions that don't converge (the chain might not naturally reach the stationary distribution, and isn't limiting). An example of this is:<br />
<br />
<math>P = \begin{pmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \end{pmatrix}, \mu_0 = \begin{pmatrix} 1/3 & 1/3 & 1/3 \end{pmatrix}</math><br />
<br />
<math>\!\mu_0</math> is a stationary distribution, so <math>\!\mu P</math> is the same for all iterations.<br />
<br />
But,<br />
<br />
<math>P^{1000} = \begin{pmatrix} 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{pmatrix} \ne \begin{pmatrix} \mu \\ \mu \\ \mu \end{pmatrix}</math><br />
<br />
So <math>\!\mu</math> is not a limiting distribution. Also, if<br />
<br />
<math>\mu = \begin{pmatrix} 0.2 & 0.1 & 0.7 \end{pmatrix}</math><br />
<br />
Then <math>\!\mu = \mu P</math> does not converge.<br />
<br />
This can be observed through the following Matlab code.<br />
<br />
P = [0 0 1; 1 0 0; 0 1 0];<br />
mu = [0.2 0.1 0.7]; <br />
for i= 1:4 <br />
mu = mu * P;<br />
disp(mu);<br />
end<br />
<br />
This outputs<br />
0.1000 0.7000 0.2000<br />
0.7000 0.2000 0.1000<br />
0.2000 0.1000 0.7000<br />
0.1000 0.7000 0.2000<br />
<br />
Note that <math>\!\mu_1 = \!\mu_4</math>, which indicates that <math>\!\mu</math> will cycle forever.<br />
<br />
This means that this chain has a stationary distribution, but is not limiting.<br />
<br />
===Page Rank===<br />
<br />
Page Rank was the original ranking algorithm used by Google's search engine to rank web pages.<ref><br />
http://ilpubs.stanford.edu:8090/422/<br />
</ref> The algorithm was created by the founders of Google, Larry Page and Sergey Brin as part of Page's PhD thesis. When a query is entered in a search engine, there are a set of web pages which are matched by this query, but this set of pages must be ordered by their "importance" in order to identify the most meaningful results first. Page Rank is an algorithm which assigns importance to every web page based on the links in each page.<br />
<br />
==== Intuition ====<br />
<br />
We can represent web pages by a set of nodes, where web links are represented as edges connecting these nodes. Based on our intuition, there are three main factors in deciding whether a web page is important or not.<br />
<br />
# A web page is important if many other pages point to it.<br />
# The more important a webpage is, the more weight is placed on its links.<br />
# The more links a webpage has, the less weight is placed on its links.<br />
<br />
====Modelling====<br />
<br />
We can model the set of links as a N-by-N matrix L, where N is the number of web pages we are interested in:<br />
<br />
<math>L_{ij} =<br />
\left\{<br />
\begin{array}{lr}<br />
1 : \text{if page j points to i}\\<br />
0 : \text{otherwise}<br />
\end{array}<br />
\right. <br />
</math><br />
<br />
<br />
<br />
The number of outgoing links from page j is<br />
<br />
<math>c_j = \sum_{i=1}^N L_{ij}</math><br />
<br />
For example, consider the following set of links between web pages:<br />
<br />
[[File:PageRank.png|250px]]<br />
<br />
According to the factors relating to importance of links, we can consider two possible rankings :<br />
<br />
<br />
<math>\displaystyle 3 > 2 > 1 > 4 </math> <br />
<br />
or<br />
<br />
<math>\displaystyle 3>1>2>4 </math> <br />
if we consider that the high importance of the link from 3 to 1 is more influent than the fact that there are two outgoing links from page 1 and only one from page 2.<br />
<br />
<br />
We have <math>L = \begin{bmatrix} <br />
0 & 0 & 1 & 0 \\ <br />
1 & 0 & 0 & 0 \\ <br />
1 & 1 & 0 & 1 \\<br />
0 & 0 & 0 & 0<br />
\end{bmatrix}</math>, and <math>c = \begin{pmatrix}2 & 1 & 1 & 1\end{pmatrix} </math><br />
<br />
We can represent the ranks of web pages as the vector P, where the i<sup>th</sup> element is the rank of page i:<br />
<br />
<math>P_i = (1-d) + d\sum_j \frac{L_{ij}}{c_j} P_j</math><br />
<br />
Here we take the sum of the weights of the incoming links, where links are reduced in weight if the linking page has a lot of outgoing links, and links are increased in weight if the linking page has a lot of incoming links. <br />
<br />
We don't want to completely ignore pages with no incoming links, which is why we add the constant (1 - d).<br />
<br />
If <br />
<br />
<math>L = \begin{bmatrix} L_{11} & \cdots & L_{1N} \\<br />
\vdots & \ddots & \vdots \\<br />
L_{N1} & \cdots & L_{NN} \end{bmatrix}</math><br />
<br />
<math>D = \begin{bmatrix} c_1 & \cdots & 0 \\<br />
\vdots & \ddots & \vdots \\<br />
0 & \cdots & c_N \end{bmatrix}</math><br />
<br />
Then <math>D^{-1} = \begin{bmatrix} c_1^{-1} & \cdots & 0 \\<br />
\vdots & \ddots & \vdots \\<br />
0 & \cdots & c_N^{-1} \end{bmatrix}</math><br />
<br />
<math>\!P = (1-d)e + dLD^{-1}P</math><br />
<br />
where <math>\!e = \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}</math> is the vector with all 1's<br />
<br />
To simplify the problem, we let <math>\!e^T P = N \Rightarrow \frac{e^T P}{N} = 1</math>. This means that the average importance of all pages on the internet is 1.<br />
<br />
Then<br />
<math>\!P = (1-d)\frac{ee^TP}{N} + dLD^{-1}P</math><br />
::<math>\! = \left [ (1-d)\frac{ee^T}{N} + dLD^{-1} \right ] P</math><br />
::<math>\! = \left [ \left ( \frac{1-d}{N} \right ) E + dLD^{-1} \right ] P</math>, where <math> E </math> is an NxN matrix filled with ones.<br />
<br />
Let <math>\!A = \left [ \left ( \frac{1-d}{N} \right ) E + dLD^{-1} \right ]</math><br />
<br />
Then <math>\!P = AP</math>.<br />
<br />
<br />
Note that P is a stationary distribution and, more importantly, P is an eigenvector of A with eigenvalue 1. Therefore, we can find the ranks of all web pages by solving this equation for P. <br />
<br />
We can find the vector P for the example above, using the following Matlab code:<br />
L = [0 0 1 0; 1 0 0 0; 1 1 0 1; 0 0 0 0];<br />
D = [2 0 0 0; 0 1 0 0; 0 0 1 0; 0 0 0 1];<br />
d = 0.8 ;% pages with no links get a weight of 0.2<br />
N = 4 ;<br />
<br />
A = ((1-d)/N) * ones(N) + d * L * inv(D);<br />
[EigenVectors, EigenValues] = eigs(A)<br />
s=sum(EigenVectors(:,1));% we should note that the average entry of P should be 1 according to our assumption<br />
P=(EigenVectors(:,1))/s*N<br />
<br />
This outputs:<br />
<br />
EigenVectors =<br />
-0.6363 0.7071 0.7071 -0.0000 <br />
-0.3421 -0.3536 + 0.3536i -0.3536 - 0.3536i -0.7071 <br />
-0.6859 -0.3536 - 0.3536i -0.3536 + 0.3536i 0.0000 <br />
-0.0876 0.0000 + 0.0000i 0.0000 - 0.0000i 0.7071 <br />
<br />
<br />
EigenValues =<br />
1.0000 0 0 0 <br />
0 -0.4000 - 0.4000i 0 0 <br />
0 0 -0.4000 + 0.4000i 0 <br />
0 0 0 0.0000 <br />
<br />
P =<br />
<br />
1.4528<br />
0.7811<br />
1.5660<br />
0.2000<br />
<br />
Note that there is an eigenvector with eigenvalue 1. <br />
The reason why there always exist a 1-eigenvector is that A is a stochastic matrix. <br />
<br />
Thus our vector P is <math> <br />
\begin{bmatrix}1.4528 \\ 0.7811 \\ 1.5660\\ 0.2000 \end{bmatrix}</math><br />
<br />
However, this method is not practical, because there are simply too many web pages on the internet. So instead Google uses a method to approximate an eigenvector with eigenvalue 1.<br />
<br />
Note that page three has the rank with highest magnitude and page four has the rank with lowest magnitude, as expected.<br />
<br />
==''' Markov Chain Monte Carlo - Metropolis-Hastings - October 25th, 2011'''==<br />
<br />
We want to find <math> \int h(x)f(x)\, \mathrm dx </math>, but we don't know how to sample from <math>\,f</math>.<br />
<br />
We have seen simple techniques before. This one is used in real life.<br />
It consists of the search of a Markov Chain such that its stationary distribution is <math>\,f</math>.<br />
<br />
==== Main procedure ====<br />
<br />
Let us suppose that <math>\,q(y|x)</math> is a friendly distribution: we can sample from this function.<br />
<br />
1. Initialize the chain with <math>\,x_{i}</math> and set <math>\,i=0</math>.<br />
<br />
2. Draw a point from <math>\,q(y|x)</math> i.e. <math>\,Y \backsim q(y|x_{i})</math>.<br />
<br />
3. Evaluate <math>\,r(x,y)=min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\}</math><br />
<br />
<br />
4. Draw a point <math>\,U \backsim Unif[0,1]</math>.<br />
<br />
5. <math>\,x_{i+1}=\begin{cases}y & \text{ if } U<r \\x_{i} & \text{ otherwise } \end{cases} </math>.<br />
<br />
6. <math>\,i=i+1</math>. Go back to 2.<br />
<br />
==== Remark 1 ====<br />
<br />
A very common choice for <math>\,q(y|x)</math> is <math>\,N(y;x,b^{2})</math>, a normal distribution centered at the current point.<br />
<br />
Note : In this case <math>\,q(y|x)</math> is symmetric i.e. <math>\,q(y|x)=q(x|y)</math>.<br />
<br />
(Because <math>\,q(y|x)=\frac{1}{\sqrt{2\pi}b}e^{-\frac{1}{2b^{2}}(y-x)^{2}}</math> and <math>\,(y-x)^{2}=(x-y)^{2}</math>).<br />
<br />
Thus we have <math>\,\frac{q(x|y)}{q(y|x)}=1</math>, which implies :<br />
<br />
<math>\,r(x,y)=min\left\{\frac{f(y)}{f(x)},1\right\}</math>.<br />
<br />
In general, if <math>\,q(x|y)</math> is symmetric then the algorithm is called Metropolis in reference to the original algorithm (made in 1953)<ref>http://en.wikipedia.org/wiki/Equations_of_State_Calculations_by_Fast_Computing_Machines</ref>.<br />
<br />
<br />
<br />
====Remark 2====<br />
<br />
The value y is accepted if <math>\,u<min\left\{\frac{f(y)}{f(x)},1\right\}</math> so it is accepted with the probability <math>\,min\left\{\frac{f(y)}{f(x)},1\right\}</math>.<br />
<br />
Thus, if <math>\,f(y)>f(x)</math>, then <math>\,y</math> is always accepted.<br />
<br />
The higher that value of the pdf is in the vicinity of a point <math>\,y_1</math>, the more likely it is that a random variable will take on values around <math>\,y_1</math>. As a result it makes sense that we would want a high probability of acceptance for points generated near <math>\,y_1</math>.<br />
<br />
====Remark 3====<br />
<br />
One strength of the Metropolis-Hastings algorithm is that normalizing constants, which are often quite difficult to determine, can be cancelled out in the ratio <math> r </math>. For example, consider the case where we want to sample from the beta distribution, which has the pdf:<br />
<br />
<math><br />
\begin{align}<br />
f(x;\alpha,\beta)& = \frac{1}{\mathrm{B}(\alpha,\beta)}\, x^{\alpha-1}(1-x)^{\beta-1}\end{align}<br />
</math><br />
<br />
The beta function, ''B'', appears as a normalizating constant but it can be simplified by construction of the method.<br />
<br />
====Example====<br />
<br />
<math>\,f(x)=\frac{1}{\pi^{2}}\frac{1}{1+x^{2}}</math><br />
<br />
Then, we have <math>\,f(x)\propto\frac{1}{1+x^{2}}</math>.<br />
<br />
And let us take <math>\,q(x|y)=\frac{1}{\sqrt{2\pi}b}e^{-\frac{1}{2b^{2}}(y-x)^{2}}</math>.<br />
<br />
Then <math>\,q(x|y)</math> is symmetric.<br />
<br />
Therefore Y can be simplified.<br />
<br />
<br />
We get :<br />
<br />
<math>\,\begin{align}<br />
\displaystyle r(x,y) <br />
& =min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} \\<br />
& =min\left\{\frac{f(y)}{f(x)},1\right\} \\<br />
& =min\left\{ \frac{ \frac{1}{1+y^{2}} }{ \frac{1}{1+x^{2}} },1\right\}\\<br />
& =min\left\{ \frac{1+x^{2}}{1+y^{2}},1\right\}\\<br />
\end{align}<br />
</math>.<br />
<br />
<br />
<br />
The Matlab code of the algorithm is the following :<br />
<br />
<pre><br />
clear all<br />
close all<br />
clc<br />
b=2;<br />
x(1)=randn;<br />
for i=2:10000<br />
y=b*randn+x(i-1);<br />
r=min((1+x(i-1)^2)/(1+y^2),1);<br />
u=rand;<br />
if u<r<br />
x(i)=y;<br />
else<br />
x(i)=x(i-1);<br />
end<br />
<br />
end<br />
hist(x(5000:end));<br />
%The Markov Chain usually takes some time to converge and this is known as the "burning time".<br />
%Therefore, we don't display the first 5000 points because they don't show the limiting behaviour of the Markov <br />
Chain.<br />
</pre><br />
<br />
As we can see, the choice of the value of b is made by us.<br />
<br />
Changing this value has a significant impact on the results we obtain. There is a pitfall when b is too big or too small.<br />
<br />
Example with <math>\,b=0.1</math> (Also with graph after we run j=5000:10000; plot(j,x(5000:10000)):<br />
<br />
[[File:redaccoursb01.JPG|300px]] [[File:001Metr.PNG|300px]]<br />
<br />
With <math>\,b=0.1</math>, the chain takes small steps so the chain doesn't explore enough of the sample space. It doesn't give an accurate report of the function we want to sample.<br />
<br />
<br />
<br />
Example with <math>\,b=10</math> :<br />
<br />
[[File:redaccoursb10.JPG|300px]] [[File:010metro.PNG|300px]]<br />
<br />
With <math>\,b=10</math>, jumps are very unlikely to be accepted as they deviate far from the mean (i.e. <math>\,y</math> is rejected as <math>\ u<r </math> and <math>\,x(i)=x(i-1)</math> most of the time, hence most sample points stay fairly close to the origin.<br />
The third graph that resembles white noise (as in the case of <math>\,b=2</math>) indicates better sampling as more points are covered and accepted. For <math>\,b=0.1</math>, we have lots of jumps but most values are not repeated, hence the stationary distribution is less obvious; whereas in the <math>\,b=10</math> case, many points remains around 0. Approximately 73% were selected as x(i-1).<br />
<br />
<br />
Example with <math>\,b=2</math> :<br />
<br />
[[File:redaccoursb2.JPG|300px]] [[File:100metr.PNG|300px]]<br />
<br />
With <math>\,b=2</math>, we get a more accurate result as we avoid these extremes. Approximately 37% were selected as x(i-1).<br />
<br />
<br />
If the sample from the Markov Chain starts to look like the target distribution quickly, we say the chain is mixing well.<br />
<br />
==''' Theory and Applications of Metropolis-Hastings - October 27th, 2011'''==<br />
<br />
As mentioned in the previous section, the idea of the Metropolis-Hastings (MH) algorithm is to produce a Markov chain that converges to a stationary distribution <math>f</math> which we are interested in sampling from.<br />
<br />
====Convergence====<br />
<br />
One important fact to check is that <math>\displaystyle f</math> is indeed a stationary distribution in the MH scheme. For this, we can appeal to the implications of the detailed balance property:<br />
<br />
Given a probability vector <math>\!\pi</math> and a transition matrix <math>\displaystyle P</math>, <math>\!\pi</math> has the detailed balance property if <math>\!\pi_iP_{ij} = P_{ji}\pi_j</math><br />
<br />
If <math>\!\pi</math> satisfies detailed balance, then it is a stationary distribution.<br />
<br />
The above definition applies to the case where the states are discrete. In the continuous case, <math>\displaystyle f</math> satisfies detailed balance if <math>\displaystyle f(x)p(x,y)=f(y)p(y,x)</math>. Where <math>\displaystyle p(x,y)</math> and <math>\displaystyle p(y,x)</math> are the probabilities of transitioning from x to y and y to x respectively. If we can show that <math>\displaystyle f</math> has the detailed balance property, we can conclude that it is a stationary distribution. Because <math>\int^{}_y f(y)p(y,x)dy=\int^{}_y f(x)p(x,y)dy=f(x)</math>.<br />
<br />
In the MH algorithm, we use a proposal distribution to generate y~<math>\displaystyle q(y|x)</math>, and accept y with probability <math>min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\}</math><br />
<br />
Suppose, without loss of generality, that <math>\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)} <= 1</math>. This implies that <math>\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)} >= 1</math><br />
<br />
Let <math>\,r(x,y)</math> be the chance of accepting point y given that we are at point x.<br />
<br />
So <math>\,r(x,y) = min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} = \frac{f(x)}{f(y)} \frac{q(x|y)}{q(y|x)}</math><br />
<br />
Let <math>\,r(y,x)</math> be the chance of accepting point x given that we are at point y.<br />
<br />
So <math>\,r(y,x) = min\left\{\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)},1\right\} = 1</math><br />
<br />
<br />
<math>\,p(x,y)</math> is the probability of generating and accepting y, while at point x.<br />
<br />
So <math>\,p(x,y) = q(y|x)r(x,y) = q(y|x) \frac{f(y)}{f(x)} \frac{q(x|y)}{q(y|x)} = \frac{f(y)q(x|y)}{f(x)}</math><br />
<br />
<br />
<math>\,p(y,x)</math> is the probability of generating and accepting x, while at point y.<br />
<br />
So <math>\,p(y,x) = q(x|y)r(y,x) = q(x|y)</math><br />
<br />
<br />
<math>\,f(x)p(x,y) = f(x)\frac{f(y)q(x|y)}{f(x)} = f(y)q(x|y) = f(y)p(y,x)</math><br />
<br />
Thus, detailed balance holds.<br />
:i.e. <math>\,f(x)</math> is stationary distribution<br />
<br />
It can be shown (although not here) that <math>f</math> is a limiting distribution as well. Therefore, the MH algorithm generates a sequence whose distribution converges to <math>f</math>, the target.<br />
<br />
====Implementation====<br />
<br />
In the implementation of MH, the proposal distribution is commonly chosen to be symmetric, which simplifies the calculations and makes the algorithm more intuitively understandable. The MH algorithm can usually be regarded as a random walk along the distribution we want to sample from. Suppose we have a distribution <math>f</math>:<br />
<br />
[[File:Standard normal distribution.gif]]<br />
<br />
Suppose we start the walk at point <math>x</math>. The point <math>y_{1}</math> is in a denser region than <math>x</math>, therefore, the walk will always progress from <math>x</math> to <math>y_{1}</math>. On the other hand, <math>y_{2}</math> is in a less dense region, so it is not certain that the walk will progress from <math>x</math> to <math>y_{2}</math>. In terms of the MH algorithm:<br />
<br />
<math>r(x,y_{1})=min(\frac{f(y_{1})}{f(x)},1)=1</math> since <math>f(y_{1})>f(x)</math>. Thus, any generated value with a higher density will be accepted.<br />
<br />
<math>r(x,y_{2})=\frac{f(y_{2})}{f(x)}</math>. The lower the density of <math>y_{2}</math> is, the less chance it will have of being accepted.<br />
<br />
A certain class of proposal distributions can be written in the form:<br />
<br />
<math>\,y|x_i = x_i + \epsilon_i</math><br />
<br />
where <math>\,\epsilon_i = g(|x-y|)</math><br />
<br />
The density depends only on the distance between the current point and the next one (which can be seen as the "step" being taken). These proposal distributions give the Markov chain the random walk nature. The normal distribution that we frequently use in our examples satisfies the above definition.<br />
<br />
In actual implementations of the MH algorithm, the proposal distribution needs to be chosen judiciously, because not all proposals will work well with all target distributions we want to sample from. Take a trimodal distribution for example:<br />
<br />
[[File:trimodal.jpg]]<br />
<br />
If we choose the proposal distribution to be a standard normal as we have done before, problems will arise. The low densities between the peaks means that the MH algorithm will almost never walk to any points generated in these regions and get stuck at one peak. One way to address this issue is to increase the variance, so that the steps will be large enough to cross the gaps. Of course, in this case, it would probably be beneficial to come up with a different proposal function. As a rule of thumb, such functions should result in an approximately 50% acceptance rate for generated points.<br />
<br />
====Simulated Annealing====<br />
<br />
Metropolis-Hastings is very useful in simulation methods for solving optimization problems. One such application is simulated annealing, which addresses the problems of minimizing a function <math>h(x)</math>. This method will not always produce the global solution, but it is intuitively simple and easy to implement.<br />
<br />
Consider <math>e^{\frac{-h(x)}{T}}</math>, maximizing this expression is equivalent to minimizing <math>h(x)</math>. Suppose <math>\mu</math> is the maximizing value and <math>h(x)=(x-\mu)^2</math>, then the maximization function is a gaussian distribution <math>e^{-\frac{(x-\mu)^2}{T}}</math>. When many samples are taken from this distribution, the mean will converge to the desired maximizing value. The annealing comes into play by lowering T (the temperature) as the sampling progresses, making the distribution narrower. The steps of simulated annealing are outlined below:<br />
<br />
1. start with a random <math>x</math> and set T to a large number<br />
<br />
2. generate <math>y</math> from a proposal distribution <math>q(y|x)</math>, which should be symmetric<br />
<br />
3. accept <math>y</math> with probability <math>min(\frac{f(y)}{f(x)},1)</math><br />
<br />
4. decrease T, and then go to step 2<br />
<br />
The following plot and Matlab code illustrates the simulated annealing procedure as temperature ''T'', the variance, decreases for a Gaussian distribution with zero mean. Starting off with a large value for the temperature ''T'' allows the Metropolis-Hastings component of the procedure to capture the mean, before gradually decreasing the temperature ''T'' in order to converge to the mean. <br />
<br />
[[File:Simulated annealing illustration.png]]<br />
<br />
x=-10:0.1:10;<br />
mu=0;<br />
T=5;<br />
colour = ['b', 'g', 'm', 'r', 'k'];<br />
for i=1:5<br />
pdfNormal=normpdf(x, mu, T);<br />
plot(x, pdfNormal, colour(i));<br />
T=T-1;<br />
hold on<br />
end<br />
hleg1=legend('T=5', 'T=4', 'T=3', 'T=2', 'T=1');<br />
title('Simulated Annealing Illustration');<br />
<br />
=='''References'''==<br />
<br />
<references/><br />
<br />
=='''Simulated Annealing and Gibbs Sampling - November 1, 2011'''==<br />
<br />
continued from previous lecture...<br />
<br />
We will now look at a couple cases where <math> \displaystyle h(y) > h(x) </math> or <math> \displaystyle h(y) < h(x) </math>, and explore whether to accept or reject <math> y </math>.<br />
<br />
Recall r(x,y)=min{<math>\frac{f(y)}{f(x)}</math>,1} where <math> \frac{f(y)}{f(x)} = \frac{e^{\frac{-h(x)}{T}}}{e^{\frac{-h(y)}{T}}} = e^{\frac{h(x)-h(y)}{T}}</math>. And r(x,y) represents the probability of accepting <math>y</math>.<br />
<br />
====Cases====<br />
<br />
Case a)<br />
Suppose <math> \displaystyle h(y) < h(x) </math>. Since we want to find the minimum value for <math>\displaystyle h(x) </math>, and the point <math>\displaystyle y </math> creates a lower value than our previous point, we accept the new point. Mathematically, <math>\displaystyle h(y) < h(x) </math> implies that:<br />
<br />
<math> \frac{f(y)}{f(x)} > 1 </math>. Therefore,<br />
<math> \displaystyle r = 1 </math>.<br />
So, we will always accept <math>\displaystyle y </math>.<br />
<br />
Case b)<br />
Suppose <math> \displaystyle h(y) > h(x) </math>. This is bad, since our goal is to minimize <math>\displaystyle h(x) </math>. However, we may still accept <math>\displaystyle y </math> with some chance:<br />
<br />
<math> \frac{f(y)}{f(x)} < 1 </math>. Therefore,<br />
<math>\displaystyle r < 1 </math>.<br />
So, we may accept <math>\displaystyle y </math> with probability <math>\displaystyle r </math>.<br />
<br />
<br />
Next, we will look at these cases as <math>\displaystyle T\to0 </math>.<br />
<br />
As <math>\displaystyle T\to0 </math> and case a) happens, <math> e^{\frac{h(x)-h(y)}{T}} </math> approaches infinity, so we will always accept <math>\displaystyle y </math>.<br />
<br />
As <math>\displaystyle T\to0 </math> and case b) happens, <math> e^{\frac{h(x)-h(y)}{T}} </math> approaches zero, so the probability that <math>\displaystyle y </math> will be accepted gets extremely small.<br />
<br />
It is worth noting that if we simply start with a small value of T, we may end up rejecting all the generated points, and hence we will get stuck somewhere in the function (due to case b)), which might be a maximum value in some intervals (local maxima), but not in the whole domain (global maxima). It is therefore necessary to start with a large value of T in order to explore the whole function. At the same time, a good estimation of x0 is needed (at least cannot differ from the maximum point too much). <br />
<br />
=====Example=====<br />
<br />
Let <math>\displaystyle h(x) = (x-2)^2 </math>.<br />
The graph of it is:<br />
[[File:PCh(x).jpg|center|500]]<br />
<br />
Then, <math> e^{\frac{-h(x)}{T}} = e^{\frac{-(x-2)^2}{T}} </math> . Take an initial value of T = 20. A graph of this is:<br />
[[File:PC-highT.jpg|center|500]]<br />
<br />
<br />
In comparison, we look a graph of T = 0.2:<br />
[[File:PC-lowT.jpg|center|500]]<br />
<br />
One can see that with a low T value, the graph has a lot of r = 0, and r >1, while having a bigger T value gives smoother transitions in the graph.<br />
<br />
The MATLAB code for the above graphs are:<br />
<pre><br />
ezplot('(x-2)^2',[-6,10])<br />
ezplot('exp((-(x-2)^2)/20)',[-6,10])<br />
ezplot('exp((-(x-2)^2)/0.2)',[-6,10])<br />
</pre><br />
<br />
=====Travelling Salesman Problem=====<br />
<br />
The simulated annealing method can be applied to compute the solution to the travelling salesman problem. Suppose there are N cities and the salesman only have to visit each city once. The objective is to find out the shortest path (i.e. shortest total length of journey) connecting the cities. An algorithm using simulated annealing on the problem can be found here ([http://www.cs.ubbcluj.ro/~csatol/mestint/pdfs/Numerical_Recipes_Simulated_Annealing.pdf Reference]).<br />
<br />
===Gibbs Sampling===<br />
<br />
Gibbs sampling is another Markov chain Monte Carlo method, similar to Metropolis-Hastings. There are two main differences between Metropolis-Hastings and Gibbs sampling. First, the candidate state is always accepted as the next state in Gibbs sampling. Second, it is assumed that the full conditional distributions are known, i.e. <math>P(X_i=x|X_j=x_j, \forall j\neq i)</math> for all <math>\displaystyle i</math>. The idea is that it is easier to sample from conditional distributions which are sets of one dimensional distributions than to sample from a joint distribution which is a higher dimensional distribution. Gibbs is a way to turn the joint distribution into multiple conditional distribution. <br />
<br />
<b>Advantages:</b><br /><br />
- sampling from conditional distributions may be easier than sampling from joint distributions<br />
<br />
<b>Disadvantages:</b><br /><br />
- we do not necessarily know the conditional distributions<br />
<br />
For example, if we want to sample from <math>\, f_{X,Y}(x,y)</math>, we need to know how to sample from <math>\, f_{X|Y}(x|y)</math> and <math>\, f_{Y|X}(y|x)</math>. Suppose the chain starts with <math>\,(X_0,Y_0)</math> and <math>(X_1,Y_1), \dots , (X_n,Y_n)</math> have been sampled. Then,<br />
<br />
<math>\, (X_{n+1},Y_{n+1})=(f_{X|Y}(x|Y_n),f_{Y|X}(y|X_{n+1}))</math><br />
<br />
Gibbs sampling turns a multi-dimensional distribution into a set of one-dimensional distributions. If we want to sample from <br />
<br />
<math>P_{X^1,\dots ,X^p}(x^1,\dots ,x^p)</math> <br />
<br />
and the full conditionals are known, then:<br />
<br />
<math>X^1_{n+1}=f(X^1|X^2_n,\dots ,X^p_n)</math><br />
<br />
<math>X^2_{n+1}=f(X^2|X^1_{n+1},X^3_n\dots ,X^p_n)</math><br />
<br />
<math>\vdots</math><br />
<br />
<math>X^{p-1}_{n+1}=f(X^{p-1}|X^1_{n+1},\dots ,X^{p-2}_{n+1},X^p_n)</math><br />
<br />
<math>X^p_{n+1}=f(X^p|X^1_{n+1},\dots ,X^{p-1}_{n+1})</math><br />
<br />
With Gibbs sampling, we can simulate <math>\displaystyle n</math> random variables sequentially from <math>\displaystyle n</math> univariate conditionals rather than generating one <math>n</math>-dimensional vector using the full joint distribution, which could be a lot more complicated.<br />
<br />
Computational inference deals with probabilistic graphical models. Gibbs sampling is useful here: graphical models show the dependence relations among random variables. For instance, Bayesian networks are graphical models represented using directed acyclic graphs. Looking at such a graphical model tells us on which random variable the distribution of a certain random variable depends (i.e. its parent). The model can be used to "factor" a joint distribution into conditional distributions.<br />
<br />
[[File:stat341_nov_1_graphical_model.png|200px|thumb|left|Sample graphical model of five RVs]]<br />
<br />
For example, consider the five random variables A, B, C, D, and E. Without making any assumptions about dependence relations among them, all we know is <br />
<br />
<math>\, P(A,B,C,D,E)=</math><math>\, P(A|B,C,D,E) P(B|C,D,E) P(C|D,E) P(D|E) P(E)</math><br />
<br />
However, if we know the relation between the random variables, e.g. given the graphical model on the left, we can simplify this expression:<br />
<br />
<math>\, P(A,B,C,D,E)=P(A) P(B|A) P(C|A) P(D|C) P(E|C)</math><br />
<br />
Although the joint distribution may be very complicated, the conditional distributions may not be.<br />
<br />
Check out the following notes on Gibbs sampling:<br />
<br />
* [http://web.mit.edu/~wingated/www/introductions/mcmc-gibbs-intro.pdf MCMC and Gibbs Sampling, MIT Lecture Notes]<br />
* chapter 7.4 in [http://stat.fsu.edu/~anuj/pdf/classes/CompStatI09/BOOK.pdf Notes on Computational Methods in Statistics]<br />
* chapter 4.9 in [http://www.ma.hw.ac.uk/~foss/StochMod/Ross_S.pdf Introduction to Probability Models] by Sheldon Ross<br />
<br />
====Example of Gibbs sampling: Multi-variate normal====<br />
<br />
We'd like to generate samples from a bivariate normal with parameters<br />
<br />
<math>\mu = \begin{bmatrix}1\\ 2 \end{bmatrix} = \begin{bmatrix}\mu_1 \\ \mu_2 \end{bmatrix}</math> <br />
and <math>\sigma = \begin{bmatrix}1 && 0.9 \\ 0.9 && 1 \end{bmatrix}= \begin{bmatrix}1 && \rho \\ \rho && 1 \end{bmatrix}</math><br />
<br />
The conditional distributions of multi-variate normal random variables are also normal:<br />
<br />
<math>\, f(x_1|x_2)=N(\mu_1 + \rho(x_2-\mu_2), 1-\rho^2)</math><br />
<br />
<math>\, f(x_2|x_1)=N(\mu_2 + \rho(x_1-\mu_1), 1-\rho^2)</math><br />
<br />
(In general, if the joint distribution has parameters<br />
<br />
<math>\mu = \begin{bmatrix}\mu_1 \\ \mu_2 \end{bmatrix}</math> and <math>\Sigma = \begin{bmatrix} \Sigma _{1,1} && \Sigma _{1,2} \\ \Sigma _{2,1} && \Sigma _{2,2} \end{bmatrix}</math><br />
<br />
then the conditional distribution <math>\, f(x_1|x_2)</math> has mean <math>\, \mu_1 + \Sigma _{1,2}(\Sigma _{1,1})^{-1}(x_2-\mu_2)</math> and variance <math>\, \Sigma _{1,1}-\Sigma _{1,2}(\Sigma _{2,2})^{-1}\Sigma _{2,1})</math>.<br />
<br />
=='''Principal Component Analysis (PCA) - November 8, 2011'''==<br />
<br />
Principal component analysis is an 100 year old algorithm used for the dimensionality reduction of data. As dimensions increase, the data points needed to sample accurately increase by an exponential factor.<br />
<br />
<math>\, x\in \mathbb{R}^D \rarr y\in \mathbb{R}^d</math><br />
<br />
<math>\ d \le D </math><br />
<br />
We want to transform <math>\, x</math> to <math>\, y</math> by reducing dimensionality yet losing little information.<br />
<br />
For example, consider dots in a three dimensional space. By unrolling the 2D manifold that they are on, we can reduce the data to 2D while losing little information. Note: This is not an application of PCA, but simple illustrates one way we can reduce dimensionality.<br />
<br />
Principle Component Analysis lets us reduce data to a linear subspace of its original space. It works best when data is in a lower dimensional subspace of its original space, or is close to.<br />
<br />
<br />
'''Probabilistic View'''<br />
<br />
We can see data set <math>\, x</math> as a high dimensional random variable governed by a low dimensional random variable <math>\, y</math>. Given <math>\, x</math>, we are trying to estimate <math>\, y</math>.<br />
<br />
We can see this in 2D linear regression, as the locations of data points in a scatter plot are governed by its approximate linear regression. The subspace that we have reduced the data to here is in the direction of variation in the data.<br />
<br />
'''Principal Component Analysis'''<br />
<br />
Principal component analysis is an orthogonal linear transform on a data set. It transforms the data coordinates to associate with a new set of orthogonal vectors, each representing the direction of the maximum variance of the the data. E.G. the first principal component is the direction of the max variance, the second principal component is the direction of the max variance orthogonal to the first vector, the third principal component is the direction of the max variance orthogonal to the first and second vectors and etc. until we have D vectors, where D is the dimension of the original data.<br />
<br />
Suppose we have data represented by <math>\, X = \begin{bmatrix}<br />
x^1\\<br />
x^2\\<br />
\vdots \\ <br />
x^D<br />
\end{bmatrix}<br />
\in \mathbb{R}^{D \times n} </math><br />
<br />
For some <math>\, W = \begin{bmatrix}<br />
w^1\\<br />
w^2\\<br />
\vdots \\ <br />
w^D<br />
\end{bmatrix}<br />
\in \mathbb{R}^{D} </math><br />
<br />
We can write any vector in <math>\, \mathbb{R}^D </math> as<br />
<br />
<math>\, w^1x^1 + w^2x^2 + \cdots + w^dx^d = W^TX</math><br />
<br />
To find the first principal component, we want to maximize the variance of <math>\,W^TX</math>.<br />
<br />
The variance of <math>\,W^TX</math> is <math>\,W^TSW</math> where <math>\,S</math> is the covariance matrix of X.<br />
<br />
<math>\, S = (x-\mu)(x-\mu)^T</math><br />
<br />
<br />
So we have to solve the problem<br />
<br />
<math>\, \text {Max } W^TSW</math><br />
<br />
<math>\, \text{such that } W^TW = 1</math>.<br />
<br />
<br />
We restrict W to unit vectors as otherwise the maximum is unbounded. We are only looking for the direction of of the vector, the actual scale of it is unnecessary.<br />
<br />
Using the method of Lagrange multipliers, we have<br />
<br />
<math>\,L(W, \lambda) = W^TSW - \lambda(W^TW - 1) </math><br />
<br />
We set<br />
<br />
<math>\, \frac{\partial L}{\partial W} = 0 </math><br />
<br />
<br />
<br />
Note that <math>\, W^TSW</math> is a quadratic form. So we have<br />
<br />
<br />
<br />
<math>\, \frac{\partial L}{\partial W} = 2SW - 2\lambda W = 0 </math><br />
<br />
<math>\, SW = \lambda W </math><br />
<br />
Since S is a matrix and lambda is a scaler, W is an eigenvector of S and lambda is its corresponding eigenvalue.<br />
<br />
Suppose that<br />
<br />
<math>\, \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_d</math><br />
are eigenvalues of S and <math>\, u_1, u_2, \cdots u_d</math> are their corresponding eigenvectors.<br />
<br />
We want to choose some <math>\, W = u </math><br />
<br />
<math>\,u^TSu =u^T\lambda u = \lambda u^Tu = \lambda</math><br />
<br />
So to maximize <math>\, u^TSu</math>, choose the eigenvector corresponding to the max eiegenvalue, e.g. <math>\, u_1</math>.<br />
<br />
So we let <math>\, W = u_1 </math> be the first principal component.<br />
<br />
The principal component's decompose the total variance in the data.<br />
<br />
<math>\, \sum_{i=1}^D \text{Var}(u_i) = \sum_{i=1}^D \lambda_i = \text{Tr}(S) = \sum_{i=1}^D \text{Var}(x_i)</math><br />
<br />
<br><br />
===Singular Value Decomposition===<br />
Singular value decomposition is a "generalization" of eigenvalue decomposition "to rectangular matrices of size ''mxn''."<ref name="Abdel_SVD">Abdel-Rahman, E. (2011). Singular Value Decomposition [Lecture notes]. Retrieved from http://uwace.uwaterloo.ca</ref> Singular value decomposition solves:<br><br><br />
:<math>\ A_{mxn}\ v_{nx1}=s\ u_{mx1}</math><br><br><br />
"for the right singular vector ''v'', the singular value ''s'', and the left singular vector ''u''. There are ''n'' singular values ''s''<sub>''i''</sub> and ''n'' right and left singular vectors that must satisfy the following conditions"<ref name="Abdel_SVD"/>:<br />
# "All singular values are non-negative"<ref name="Abdel_SVD"/>, <br> <math>\ s_i \ge 0.</math><br />
# All "right singular vectors are pairwise orthonormal"<ref name="Abdel_SVD"/>, <br> <math>\ v_iv_j=\delta_{i,j}.</math><br />
# All "left singular vectors are pairwise orthonormal"<ref name="Abdel_SVD"/>, <br> <math>\ u_iu_j=\delta_{i,j}.</math><br />
where<br />
:<math>\delta_{i,j}=\left\{\begin{matrix}1 & \mathrm{if}\ i=j \\ 0 & \mathrm{if}\ i\neq j\end{matrix}\right.</math><br><br><br />
<br />
'''Procedure to find the singular values and vectors'''<br><br />
Observe the following about the eigenvalue decomposition of a real square matrix ''A'' where ''v'' is the unit eigenvector:<br><br />
::<math><br />
\begin{align}<br />
& Av=\lambda v \\<br />
& (Av)^T=(\lambda v)^T \\<br />
& (Av)^TAv=(\lambda v)^T\lambda v \\<br />
& v^TA^TAv=\lambda^2v^Tv \\<br />
& vv^TA^TAv=v\lambda^2 \\<br />
& A^TAv=\lambda^2v<br />
\end{align}<br />
</math><br />
As a result:<br />
# "The matrices ''A'' and ''A''<sup>''T''</sup>''A'' have the same eigenvectors."<ref name="Abdel_SVD"/><br />
# "The eigenvalues of matrix ''A''<sup>''T''</sup>''A'' are the square of the eigenvalues of matrix ''A''."<ref name="Abdel_SVD"/><br />
# Since matrix ''A''<sup>''T''</sup>''A'' is symmetric,<br />
## "all the eigenvalues of matrix ''A''<sup>''T''</sup>''A'' are real and distinct."<ref name="Abdel_SVD"/><br />
## "the eigenvectors of matrix ''A''<sup>''T''</sup>''A'' are orthogonal and can be chosen to be orthonormal."<ref name="Abdel_SVD"/><br />
# "The eigenvalues of matrix ''A''<sup>''T''</sup>''A'' are non-negative"<ref name="Abdel_SVD"/> since <math>\ \lambda^2_i \ge 0.</math><br />
Conclusions 3 and 4 are "true even for a rectangular matrix ''A'' since ''A''<sup>''T''</sup>''A'' is still a square symmetric matrix"<ref name="Abdel_SVD"/> and its eigenvalues and eigenvectors can be found.<br><br><br />
Therefore, for a rectangular matrix ''A'', assuming ''m>n'', the singular values and vectors can be found by:<br />
# "Form the ''nxn'' symmetric matrix ''A''<sup>''T''</sup>''A''."<ref name="Abdel_SVD"/><br />
# Perform an eigenvalue decomposition to get ''n'' eigenvalues and their "corresponding eigenvectors, ordered such that"<ref name="Abdel_SVD"/> <br><math>\lambda^2_1 \ge \lambda^2_2 \ge \dots \ge \lambda^2_n \ge 0</math> and <math>\{v_1, v_2, \dots, v_n\}.</math><br />
# "The singular values are"<ref name="Abdel_SVD"/>: <br><math>s_1=\sqrt{\lambda^2_1} \ge s_2=\sqrt{\lambda^2_2} \ge \dots \ge s_n=\sqrt{\lambda^2_n} \ge 0.</math><br>"The non-zero singular values are distinct; the equal sign applies only to the singular values that are equal to zero."<ref name="Abdel_SVD"/><br />
# "The ''n''-dimensional right singular vectors are"<ref name="Abdel_SVD"/><br><math>\{v_1, v_2, \dots, v_n\}.</math><br />
# "For the first <math>r \le n</math> singular values such that ''s''<sub>''i''</sub> ''> 0'', the left singular vectors are obtained as unit vectors"<ref name="Abdel_SVD"/> by <math>\tfrac{1}{s_i}Av_i=u_i.</math><br />
# Select "the <math>\ m-r</math> left singular vectors corresponding to the zero singular values such that they are unit vectors orthogonal to each other and to the first ''r'' left singular vectors"<ref name="Abdel_SVD"/> <math>\{u_1, u_2, \dots, u_r\}.</math><br><br><br />
<br />
'''Finding Singular value Decomposition Using MATLAB Code'''<br />
Please refer to the following link: http://www.mathworks.com/help/techdoc/ref/svd-singular-value-decomposition.html<br />
<br />
'''Formal definition'''<br><br />
"We can now decompose the rectangular matrix ''A'' in terms of singular values and vectors as follows"<ref name="Abdel_SVD"/>:<br><br><br />
<math>A_{mxn} \begin{bmatrix} v_1 & | & \cdots & | & v_n \end{bmatrix}_{nxn} = \begin{bmatrix} u_1 & | & \cdots & | & u_n & | I_{n+1} & | & \cdots & | & I_m \end{bmatrix}_{mxm} \begin{bmatrix} s_1 & 0 & \cdots & 0 \\ 0 & s_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & s_n \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix}_{mxn}</math><br><br />
:<math>\ AV=US</math><br><br><br />
Since "the matrices ''V'' and ''U'' are orthogonal"<ref name="Abdel_SVD"/>, ''V ''<sup>''-1''</sup>=''V''<sup>T</sup> and ''U ''<sup>''-1''</sup>=''U''<sup>T</sup>:<br><br><br />
:<math>\ A=USV^T</math><br><br><br />
"which is the formal definition of the singular value decomposition."<ref name="Abdel_SVD"/><br><br><br />
<br />
'''Relevance to PCA'''<br><br />
In order to perform PCA, one needs to do eigenvalue decomposition on the covariance matrix. By transforming the mean for all attributes to zero, the covariance matrix can be simplified to:<br><br><br />
<math>\ S=XX^T</math><br><br><br />
Since the eigenvalue decomposition of ''A''<sup>''T''</sup>''A'' gives the same eigenvectors as the singular value decomposition of ''A'', an additional and more consistent method (if a matrix has eigenvectors that are not invertible, its eigenvalue decomposition does not exist) for performing PCA is through the singular value decomposition of ''X''.<br />
<br />
The following MATLAB code uses singular value decomposition for performing PCA; 20 principal components, and thus the top 20 maximum variation directions, are selected for reconstructing facial images that have had noise applied to them:<br />
<br />
load noisy.mat<br />
%first noisy image; each image has a resolution of 20x28<br />
imagesc(reshape(X(:,1),20,28)')<br />
%to grayscale<br />
colormap gray<br />
%singular value decomposition <br />
[u s v]=svd(X);<br />
%reduced feature space: 20 principal components<br />
Xh=u(:,1:20)*s(1:20,1:20)*v(:,1:20)';<br />
figure<br />
imagesc(reshape(Xh(:,1),20,28)')<br />
colormap gray<br />
<br />
Since the reduced feature space image is noiseless, the added noise feature has less variation than the 20 principal components.<br />
<br />
=='''References'''==<br />
<br />
<references/><br />
<br />
==''' PCA and Introduction to Kernel Function-November,10,2011'''==<br />
===Continue with the last lecture===<br />
Some notations:<br />
Let <math>\displaystyle X_{d\times n}</math> be a matrix. <br />
<br />
Let <math>\displaystyle X_j,j=1,2,...,d</math> be the j th the data point,and <math>\displaystyle X_j\in\R^d</math>.<br />
<br />
Let <math>\displaystyle Q=\sum_{j=1}^n(X_j-\bar{X})(X_j-\bar{X})^T</math>, where <math> \bar{X}=\frac{1}{n}\sum_{j=1}^n X_j)</math>.<br />
<br />
But now, we are assuming that we have already centered the data, which means our <math>\displaystyle Q=\sum_{j=1}^n(X_j)(X_j)^T=X X^T </math>.<br />
<br />
*Find PC,which means finding eigenvectors of Q or do the singular value decomposition,[u s v]=svd(X), where the columns of u are eigenvectors of <math>\displaystyle Q=X X^T</math>.<br />
<br />
*Map the data in lower dimension space.<br />
We can choose the first p (p<d) eigenvectors, which means <math>\displaystyle u^T</math> is a <math>\displaystyle p\times n</math> matrix.<br />
Thus,we can project our original data points <math>\displaystyle x_j</math> to p dimension.<br />
Mathematically, it is <math>\displaystyle Y_{p\times n}={u^T}_{p\times d} X_{d\times n}</math>.Also,this means that we can reduce our original d variables to p principal components.<br />
<br />
*Reconstruct Points.<br />
We can also use those dimension-reduced data to project back to high dimension.<br />
However, we will lose some information because when we map those points into lower dimension, we throw away the last (d-p) eigenvectors which contain some of the original information.<br />
Since <math>\displaystyle u^T</math> is an orthogonal matrix, we can have <math> u_{d\times p} Y_{p\times n}=u_{d\times p}{u^T}_{p\times d}\hat{x}_{d\times n}= \hat{x}_{d\times n} </math>.<br />
<br />
*Map a new data point to a lower dimensional space and reconstruct it to the high dimension <math>\displaystyle y_{d\times 1}={u^T}_{p\times d} x_{d\times 1}=x_{d\times 1}=u_{d\times p} y_{p\times 1}</math><br />
<br />
===3 and 2 digits example===<br />
The data X is a 64 by 400 matrix. Every column can be imaged out as either "3" or "2". The first 200 columns are "2" and the last 200 columns are "3".<br />
We can first modify the data to centered data, and then try to find the first p(p<d) columns of the singular value decomposition of u.<br />
<br />
MATLAB CODE:<br />
MU=repmat(mean(X,2),1,400);<br />
% mean(X,2) is the average of each row <br />
%In order to center the data, we should change mean(X,2) which is a 64 by 1 matrix into a 64 by 400 matirx<br />
Xt=X-MU;<br />
% modify the data to zero mean data<br />
[u s v]=svd(Xt);<br />
%note that size(u)=64*64, and the columns of u are eigenvectors of VCM<br />
Y=u(:,1:2)'*X;<br />
%using the first two PCs to transform the high dimensional points to lower onces<br />
One way to look at this case is that, we can plot Principle Component #1 and Principle Component #2 in a two dimensional space.<br />
plot(Y(1,:)',Y(2,:)')<br />
The result is as follows, we can see clearly there are two classes.<br />
<br />
[[file:pca2.png|350px|400px]]<br />
<br />
To dig more into what kind of difference of these two classes, we can try to seperate the first 200 columns and the last 200 columns to find whether it has a significant difference due to the different types of digits.<br />
plot(Y(1,1:200)',Y(2,1:200)','d')<br />
% Note that the first 200 columns represent digit "2",and are in the form of "diamond"<br />
hold on<br />
% draw different graphs in one figure<br />
plot(Y(1,201:400)',Y(2,201:400)','ro')<br />
% Note that the first 200 columns represent digit "3",and are in the form of "o"<br />
<br />
[[file:pca3.png|350px|400px]]<br />
<br />
image=reshape(X,8,8,400);<br />
plotdigits(image,Y,.1,1);<br />
The result can be seen more clearly from the following picture.<br />
It is clearly to seperate "3" and "2" apart.<br />
<br />
[[file:Pca.png|350px|400px]]<br />
<br />
===Introduction to Kernel Function===<br />
PCA is useful when those data points spread in or close to a plane. This means that PCA is powerful when dealing with linear problems. But when data points spread in a manifold space, PCA is hard to implement. But there is a solution to this problem---we can use a "trick" to change the nonlinear classification problems into linear ones. And this is called the "Kernel Trick".<br />
<br />
'''An intuitive example'''<br />
<br />
[[File:Kernel trick.png|400px|300px]]<br />
<br />
From the picture, we can see the red dots are in the middle of the blue ones.However,it is hard to separate those two classes by using any lines(linear in the two dimensional space). But we can pull the red ones out of the two dimensional space to form a three dimensional space, in which case, we can easily tell them apart.<br />
<br />
For more details about this trick,please see http://omega.albany.edu:8008/machine-learning-dir/notes-dir/ker1/ker1.pdf<br />
<br />
More in detail,the significance of Kernel Function is that we can change the data points into a high dimension implicitly.<br />
Let's look at how this is possible:<br />
<br />
<math>Z_1=<br />
\begin{bmatrix}<br />
x_1\\<br />
y_1<br />
\end{bmatrix}\xrightarrow{\phi}<br />
</math><br />
<math>\phi(Z_1)=<br />
\begin{bmatrix}<br />
x_1^2\\<br />
y_1^2\\<br />
\sqrt2x_1y_1<br />
\end{bmatrix}.<br />
<br />
</math><br />
<math>Z_2=<br />
\begin{bmatrix}<br />
x_2\\<br />
y_2<br />
\end{bmatrix}\xrightarrow{\phi}<br />
</math><br />
<math>\phi(Z_2)=<br />
\begin{bmatrix}<br />
x_2^2\\<br />
y_2^2\\<br />
\sqrt2x_2y_2<br />
\end{bmatrix}<br />
</math><br />
<br />
The inner product of <math>\displaystyle \phi(Z1)</math> and <math>\displaystyle\phi(Z2)</math>, which is denoted as <math>\displaystyle\phi(Z1)\phi(Z2)^T</math>, is equal to:<br />
<math><br />
\begin{bmatrix}<br />
x_1^2&y_1^2&\sqrt2x_1y_1 <br />
\end{bmatrix}<br />
\begin{bmatrix}<br />
x_2^2\\<br />
y_2^2\\<br />
\sqrt2x_2y_2 <br />
\end{bmatrix}=</math> <math>\displaystyle (x_1x_2+y_1y_2)^2=K(Z_1,Z_2)</math>.<br />
<br />
'''The most common Kernel functions are as follows:'''<br />
*Linear: <math>\displaystyle K_{ij}=<X_i,X_j></math><br />
*Polynomial:<math>\displaystyle K_{ij}=(1+<X_i,X_j>)^p</math><br />
*Gausian:<math>\displaystyle K_{ij}=e^\frac{-{\left\Vert X_i-X_j\right\|}^2}{2\sigma^2}</math>,<br />
where <math>\displaystyle <X_i,X_j></math> denotes the inner product of <math>\displaystyle X_i</math> and <math>\displaystyle X_j</math>, <math>{\left\Vert X_i-X_j\right\|}^2</math> denotes the distance between vector<math>\displaystyle X_i</math> and vector <math>\displaystyle X_j</math>.<br />
<br />
<br />
==''' Kernel PCA -November,15,2011'''==<br />
<br />
First we look at the algorithm for PCA and see how we can kernelize PCA:<br />
<br />
== PCA ==<br />
<br />
Find eigenvectors of <math>XX^T</math>, call it U<br />
<br />
<math><br />
\begin{align}<br />
Y &= U^{T}X \\<br />
\hat{X} & = UY \\<br />
Y & = U^{T}X \\<br />
\hat{X} & = UY<br />
\end{align}<br />
</math><br />
<br />
== Modifying PCA ==<br />
<br />
<math><br />
\begin{align}<br />
\left[ U \Sigma V \right] & = svd(X) \\<br />
Z & = U\Sigma{V^T}<br />
\end{align}<br />
</math><br />
<br />
U is eigenvectors of <math>XX^T</math><br />
<br />
V is eigenvectors of <math>X^T{X}</math><br />
<br />
Now we want to kernelize this classical version of PCA.<br />
<br />
We would like to express everything based on V which is eigenvectors of X^T{X} which can be kernelized. This is called Dual PCA.<br />
<br />
<math><br />
\begin{align}<br />
X&= U \Sigma V^T \\<br />
XV&=U \Sigma V^T V<br />
&= U\Sigma \\<br />
U&=XV\Sigma^{-1}<br />
\end{align}<br />
</math><br />
<br />
Find eigenvectors of <math>X^TX</math>, call it V.<br />
<br />
<math><br />
\begin{align}<br />
X&=U \Sigma V^T \\<br />
U^TX &= U^TU\Sigma V^T \\<br />
U^TX &= \Sigma V^T \\<br />
Y&=\Sigma V^T \\<br />
\end{align}<br />
</math><br />
<br />
Reconstruct Points<br />
<br />
<math><br />
\begin{align}<br />
\hat{X}&=UY \\<br />
X &=XV\Sigma^{-1}\Sigma{V^T} \\<br />
\hat{X} &= XVV^T<br />
\end{align}<br />
</math><br />
<br />
Map an out of sample point x to low-dimensional space<br />
<br />
<math><br />
\begin{align}<br />
Y &=U^TX \\<br />
& = (XV\Sigma^1)^TX \\<br />
& = \Sigma^{-1}{V^T}{X^T}X<br />
\end{align}<br />
</math><br />
<br />
Reconstruct an out of sample point <br />
<br />
<br />
<math><br />
\begin{align}<br />
\hat{X} &= UY=XV\Sigma^{-1}\Sigma{-1}V^T{X^T}X \\<br />
&= XV\Sigma^{-2}V^T{X^T}X<br />
\end{align}<br />
</math></div>S9huhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341f11&diff=14859stat341f112011-11-15T19:42:34Z<p>S9hu: /* To solve PCA */</p>
<hr />
<div>Please contribute to the discussion of splitting up this page into multiple pages on the [[{{TALKPAGENAME}}|talk page]].<br />
<br />
==[[signupformStat341F11| Editor Sign Up]]==<br />
<br />
==Notation==<br />
<br />
The following guidelines on notation were posted on the Wiki Course Note page for [[Stat946f11|STAT 946]]. Add to them as necessary for consistent notation on this page.<br />
<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
<br />
==Sampling - September 20, 2011==<br />
<br />
The meaning of sampling is to generate data points or numbers such that these data follow a certain distribution.<br /><br />
i.e. From <math>x \sim~f(x)</math> sample <math>\,x_{1}, x_{2}, ..., x_{1000}</math><br />
<br />
In practice, it maybe difficult to find the joint distribution of random variables. Through simulating the random variables, we can make an inference from the data.<br />
<br />
===Sampling from Uniform Distribution===<br />
Computers cannot generate random numbers as they are deterministic; however they can produce pseudo random numbers using algorithms. Generated numbers mimic the properties of random numbers but they are never truly random. One famous algorithm that is considered highly reliable is the Mersenne twister[http://en.wikipedia.org/wiki/Mersenne_twister], which generates random numbers in an almost uniform distribution. <br />
<br />
<br />
====Multiplicative Congruential====<br />
*involves four parameters: integers <math>\,a, b, m</math>, and an initial value <math>\,x_0</math> which we call the seed<br />
*a sequence of integers is defined as<br />
:<math>x_{k+1} \equiv (ax_{k} + b) \mod{m}</math><br />
<br />
'''Example:''' <math>\,a=13, b=0, m=31, x_0=1</math> creates a uniform histogram.<br />
<br />
MATLAB code for generating 1000 random numbers using the multiplicative congruential method:<br />
<br />
<pre><br />
a = 13;<br />
b = 0;<br />
m = 31;<br />
x(1) = 1;<br />
<br />
for ii = 2:1000<br />
x(ii) = mod(a*x(ii-1)+b, m);<br />
end<br />
</pre><br />
<br />
MATLAB code for displaying the values of x generated:<br />
<br />
<pre><br />
x<br />
</pre><br />
<br />
MATLAB code for plotting the histogram of x:<br />
<br />
<pre><br />
hist(x)<br />
</pre><br />
<br />
Histogram Output:<br />
<br />
[[File:uniform.jpg]]<br />
<br />
Facts about this algorithm:<br />
*In this example, the first 30 terms in the sequence are a permutation of integers from 1 to 30 and then the sequence repeats itself.<br />
*Values are between <b>0</b> and <b>m-1</b>, inclusive.<br />
*Dividing the numbers by <b> m-1 </b> yields numbers in the interval <b>[0,1]</b>.<br />
*MATLAB's <code>rand</code> function once used this algorithm with <b>a= 7<sup>5</sup></b>, <b>b= 0</b>, <b>m= 2<sup>31</sup>-1</b>,for reasons described in Park and Miller's 1988 paper "Random Number Generators: Good Ones are Hard to Find" (available [http://www.firstpr.com.au/dsp/rand31/p1192-park.pdf online]).<br />
*Visual Basic's <code>RND</code> function also used this algorithm with <b>a= 1140671485</b>, <b>b= 12820163</b>, <b>m= 2<sup>24</sup></b>. ([http://support.microsoft.com/kb/231847 Reference])<br />
<br />
===Inverse Transform Method===<br />
This is a basic method for sampling. Theoretically using this method we can generate sample numbers at random from any probability distribution once we know its cumulative distribution function (cdf).<br />
<br />
====Theorem====<br />
Take <math>U \sim~ \mathrm{Unif}[0, 1]</math> and let <math>X = F^{-1}(U) </math>. Then <math>X</math> has distribution function <math>F(\cdot)</math>, where <math>F(x)=P(X \leq x)</math> and <math>F^{-1}(\cdot)</math> is the inverse of <math>F(\cdot)</math>.<br />
<br />
Therefore <math>F(x)=u\implies x=F^{-1}(u)</math><br />
<br />
'''Proof'''<br />
<br />
Recall that<br />
<br />
:<math>P(a \leq X<b)=\int_a^{b} f(x) dx</math><br />
<br />
:<math>cdf=F(x)=P(X \leq x)=\int_{-\infty}^{x} f(x) dx</math><br />
<br />
Note that if <math>U \sim~ \mathrm{Unif}[0, 1]</math>, we have <math>P(U \leq a)=a</math><br />
<br />
:<math>\begin{align}<br />
<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
====Continuous Case====<br />
Generally it takes two steps to get random numbers using this method.<br />
<br />
*Step 1. Draw <math>U \sim~ \mathrm{Unif}[0, 1]</math><br />
*Step 2. <b><i>X=F <sup>&minus;1</sup>(U)</i></b><br />
<br />
'''Example'''<br />
<br />
Take the exponential distribution for example<br />
:<math>\,f(x)={\lambda}e^{-{\lambda}x}</math><br />
:<math>\,F(x)=\int_0^x {\lambda}e^{-{\lambda}u} du=[-e^{-{\lambda}u}]_0^x=1-e^{-{\lambda}x}</math><br />
<br />
Let: <math>\,F(x)=y</math><br />
:<math>\,y=1-e^{-{\lambda}x}</math><br />
:<math>\,ln(1-y)={-{\lambda}x}</math><br />
:<math>\,x=\frac{ln(1-y)}{-\lambda}</math><br />
:<math>\,F^{-1}(x)=\frac{-ln(1-x)}{\lambda}</math><br />
<br />
Therefore, to get a exponential distribution from a uniform distribution takes 2 steps.<br />
*Step 1. Draw <math>U \sim~ \mathrm{Unif}[0, 1]</math><br />
*Step 2. <math>x=\frac{-ln(1-U)}{\lambda}</math><br />
<br />
Note: If U~Unif[0, 1], then (1 - U) and U have the same distribution. This allows us to slightly simplify step 2 into an alternate form:<br />
*Alternate Step 2. <math>x=\frac{-ln(U)}{\lambda}</math><br />
<br />
'''MATLAB code'''<br />
for exponential distribution case,assuming <math>\lambda=0.5</math><br />
<br />
<pre><br />
for ii = 1:1000<br />
u = rand;<br />
x(ii) = -log(1-u)/0.5;<br />
end<br />
hist(x)<br />
</pre><br />
<br />
MATLAB result<br />
<br />
[[File:MATLAB_Exp.jpg|center|300px]]<br />
<br />
====Discrete Case - September 22, 2011====<br />
This same technique can be applied to the discrete case. Generate a discrete random variable <math>\,x</math> that has probability mass function <math>\,P(X=x_i)=P_i </math> where <math>\,x_0<x_1<x_2...</math> and <math>\,\sum_i P_i=1</math><br />
*Step 1. Draw <math>u \sim~ \mathrm{Unif}[0, 1]</math><br />
*Step 2. <math>\,x=x_i</math> if <math>\,F(x_{i-1})<u \leq F(x_i)</math><br />
<br />
'''Example'''<br />
<br />
Let x be a discrete random variable with the following probability mass function:<br />
<br />
:<math>\begin{align}<br />
P(X=0) = 0.3 \\<br />
P(X=1) = 0.2 \\<br />
P(X=2) = 0.5<br />
\end{align}</math><br />
<br />
Given the pmf, we now need to find the cdf.<br />
<br />
We have:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0 & x < 0 \\<br />
0.3 & 0 \leq x < 1 \\<br />
0.5 & 1 \leq x < 2 \\<br />
1 & 2 \leq x<br />
\end{cases}</math><br />
<br />
We can apply the inverse transform method to obtain our random numbers from this distribution.<br />
<br />
'''Pseudo Code for generating the random numbers:'''<br />
<pre><br />
Draw U ~ Unif[0,1] <br />
if U <= 0.3 <br />
return 0 <br />
else if 0.3 < U <= 0.5 <br />
return 1<br />
else if 0.5 < U <= 1 <br />
return 2<br />
</pre><br />
<br />
'''MATLAB code for generating 1000 random numbers in the discrete case:'''<br />
<br />
<pre><br />
for ii = 1:1000<br />
u = rand;<br />
<br />
if u <= 0.3<br />
x(ii) = 0;<br />
else if u <= 0.5<br />
x(ii) = 1;<br />
else<br />
x(ii) = 2;<br />
end<br />
end<br />
</pre><br />
<br />
Matlab Output:<br />
<br />
[[File:Discreteinv.jpg]]<br />
<br />
'''Pseudo code for the Discrete Case:'''<br />
<br />
1. Draw U ~ Unif [0,1]<br />
<br />
2. If <math> U \leq P_0 </math>, deliver <b><i>X= x<sub>0</sub></i></b><br />
<br />
3. Else if <math> U \leq P_0 + P_1 </math>, deliver <b><i>X= x<sub>1</sub></i></b><br />
<br />
4. Else If <math> U \leq P_0 +....+ P_k </math>, deliver <b><i>X= x<sub>k</sub></i></b><br />
<br />
====Limitations====<br />
<br />
Although this method is useful, it isn't practical in many cases since we can't always obtain <math>F</math> or <math> F^{-1} </math> as some functions are not integrable or invertible, and sometimes even <math>f(x)</math> itself cannot be obtained in closed form. Let's look at some examples:<br />
*Continuous case<br />
If we want to use this method to draw the ''pdf'' of '''normal distribution''', we may find ourselves get stuck in finding its ''cdf''. <br />
The simplest case of '''normal distribution''' is <math>f(x)=\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}</math>,<br />
whose ''cdf'' is <math>F(x)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{x}{e^{-\frac{u^2}{2}}}du</math>. This integral cannot be expressed in terms of elementary functions. So evaluating it and then finding the inverse is a very difficult task.<br />
*Discrete case <br />
It is easy for us to simulate when there are only a few values taken by the particular random variable, like the case above.<br />
And it is easy to simulate the '''binomial distribution''' <math>X \sim~ \mathrm{B}(n,p)</math> when the parameter n is not too large.<br />
But when n takes on values that are very large, say 50, it is hard to do so.<br />
<br />
===Acceptance/Rejection Method===<br />
<br />
<br />
The aforementioned difficulties of the inverse transform method motivates a sampling method that does not require analytically calculating cdf's and their inverses, which is the acceptance/rejection sampling method. Here, <math> \displaystyle f(x)</math> is approximated by another function, say <math>\displaystyle g(x)</math>, with the idea being that <math>\displaystyle g(x)</math> is a "nicer" function to work with than <math>\displaystyle f(x)</math>.<br />
<br />
Suppose we assume the following:<br />
<br />
1. There exists another distribution <math>\displaystyle g(x)</math> that is easier to work with and that you know how to sample from, and<br />
<br />
2. There exists a constant c such that <math>f(x) \leq c \cdot g(x)</math> for all x<br />
<br />
Under these assumptions, we can sample from <math>\displaystyle f(x)</math> by sampling from <math>\displaystyle g(x)</math><br />
<br />
====General Idea====<br />
<br />
Looking at the image below we have graphed <math> c \cdot g(x) </math> and <math>\displaystyle f(x)</math>.<br />
<br />
[[File:Graph_updated.jpg]]<br />
<br />
Using the acceptance/rejection method we will accept some of the points from <math>\displaystyle g(x)</math> and reject some of the points from <math>\displaystyle g(x)</math>. The points that will be accepted from <math>\displaystyle g(x)</math> will have a distribution similar to <math>\displaystyle f(x)</math>. We can see from the image that the values around <math>\displaystyle x_1</math> will be sampled more often under <math>c \cdot g(x)</math> than under <math>\displaystyle f(x)</math>, so we will have to reject more samples taken at x<sub>1</sub>. Around <math>\displaystyle x_2</math> the number of samples that are drawn and the number of samples we need are much closer, so we accept more samples that we get at <math>\displaystyle x_2</math><br />
<br />
====Procedure====<br />
<br />
1. Draw y ~ g<br />
<br />
2. Draw U ~ Unif [0,1]<br />
<br />
3. If <math> U \leq \frac{f(y)}{c \cdot g(y)}</math> then x=y; else return to 1<br />
<br />
Note that the choice of <math> c </math> plays an important role in the efficiency of the algorithm. We want <math> c \cdot g(x) </math> to be "tightly fit" over <math> f(x) </math> to increase the probability of accepting points, and therefore reducing the number of sampling attempts. Mathematically, we want to minimize <math> c </math> such that <math>f(x) \leq c \cdot g(x) \ \forall x</math>. We do this by setting<br />
<br />
<math> \frac{d}{dx}(\frac{f(x)}{g(x)}) = 0 </math>, solving for a maximum point <math> x_0 </math> and setting <math> c = \frac{f(x_0)}{g(x_0)}. </math><br />
<br />
====Proof====<br />
<br />
Mathematically, we need to show that the sample points given that they are accepted have a distribution of f(x).<br />
<br />
<math>\begin{align} P(y|accepted) &= \frac{P(y, accepted)}{P(accepted)} \\<br />
<br />
&= \frac{P(accepted|y) P(y)}{P(accepted)}\end{align} </math> (Bayes' Rule)<br />
<br />
<br />
<br />
<math>\displaystyle P(y) = g(y)</math><br />
<br />
<math>P(accepted|y) =P(u\leq \frac{f(y)}{c \cdot g(y)}) =\frac{f(y)}{c \cdot g(y)} </math>,where u ~ Unif [0,1]<br />
<br />
<math>P(accepted) = \sum P(accepted|y)\cdot P(y)=\int^{}_y \frac{f(y)}{c \cdot g(y)}g(y) dy=\int^{}_y \frac{f(y)}{c} dy=\frac{1}{c} \cdot\int^{}_y f(y) dy=\frac{1}{c}</math><br />
<br />
So,<br />
<br />
<math> P(y|accepted) = \frac{ \frac {f(y)}{c \cdot g(y)} \cdot g(y)}{\frac{1}{c}} =f(y) </math><br />
<br />
====Continuous Case====<br />
<br />
'''Example'''<br />
<br />
Sample from Beta(2,1)<br />
<br />
In general:<br />
<br />
Beta(<math>\alpha, \beta) = \frac{\Gamma (\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}</math> <math>\displaystyle x^{\alpha-1}</math> <math>\displaystyle(1-x)^{\beta-1}</math>, <math>\displaystyle 0<x<1</math><br />
<br />
Note: <math>\!\Gamma(n) = (n-1)!</math> if n is a positive integer<br />
<br />
<math>\begin{align} f(x) &= Beta(2,1) \\<br />
&= \frac{\Gamma(3)}{\Gamma(2)\Gamma(1)} x^1(1-x)^0 \\<br />
&= \frac{2!}{1! 0!}\cdot (1) x \\<br />
&= 2x \end{align}</math><br />
<br />
We want to choose <math>\displaystyle g(x)</math> that is easy to sample from. So we choose <math>\displaystyle g(x)</math> to be uniform distribution.<br />
<br />
We now want a constant c such that <math>f(x) \leq c \cdot g(x) </math> for all x from Unif(0,1)<br />
<br />
<br />
So,<br /><br />
<br />
<math>c \geq \frac{f(x)}{g(x)}</math>, for all x from (0,1)<br />
<br />
<br />
<math>\begin{align}c &\geq max (\frac {f(x)}{g(x)}, 0<x<1) \\<br />
<br />
<br />
&= max (\frac {2x}{1},0<x<1) \\<br />
<br />
<br />
&= 2 \end{align}</math><br />
<br />
<br />
<br />
Now that we have c =2,<br />
<br />
1. Draw y ~ g(x) => Draw y ~ Unif [0,1] <br />
<br />
2. Draw u ~ Unif [0,1] <br />
<br />
3. if <math>u \leq \frac{2y}{2 \cdot 1}</math> then x=y; else return to 1<br />
<br />
<br />
'''MATLAB code for generating 1000 samples following Beta(2,1):'''<br />
<br />
<pre><br />
close all<br />
clear all<br />
ii=1;<br />
while ii < 1000<br />
y = rand;<br />
u = rand;<br />
<br />
if u <= y<br />
x(ii)=y;<br />
ii=ii+1;<br />
end<br />
end<br />
hist(x)<br />
</pre><br />
<br />
'''MATLAB result'''<br />
<br />
[[File:MATLAB_Beta.jpg]]<br />
<br />
====Discrete Example====<br />
<br />
Generate random variables according to the p.m.f:<br />
<br />
:<math>\begin{align}<br />
P(Y=1) = 0.15 \\<br />
P(Y=2) = 0.22 \\<br />
P(Y=3) = 0.33 \\<br />
P(Y=4) = 0.10 \\<br />
P(Y=5) = 0.20 <br />
\end{align}</math><br />
<br />
find a g(y) discrete uniform distribution from 1 to 5<br />
<br />
<math>c \geq \frac{P(y)}{g(y)} </math><br><br />
<math>c = \max \left(\frac{P(y)}{g(y)} \right)</math><br><br />
<math>c = \max \left(\frac{0.33}{0.2} \right) = 1.65</math> Since P(Y=3) is the max of P(Y) and g(y) = 0.2 for all y.<br><br />
<br />
1. Generate Y according to the discrete uniform between 1 - 5<br />
<br />
2. U ~ unif[0,1]<br />
<br />
3. If <math>U \leq \frac{P(y)}{1.65 \times 0.2} \leq \frac{P(y)}{0.33} </math>, then x = y; else return to 1.<br />
<br />
In MATLAB, the code would be:<br />
<br />
py = [0.15 0.22 0.33 0.1 0.2];<br />
ii =1;<br />
while ii <= 1000<br />
y = unidrnd(5);<br />
u = rand;<br />
if u <= py(y)/0.33<br />
x(ii) = y;<br />
ii = ii+1;<br />
end<br />
end<br />
hist(x);<br />
<br />
MATLAB result<br />
<br />
[[File:MATLAB_Y.jpg]]<br />
<br />
====Limitations====<br />
<br />
Most of the time we have to sample many more points from g(x) before we can obtain an acceptable amount of samples from f(x), hence this method may not be computationally efficient. It depends on our choice of g(x). For example, in the example above to sample from Beta(2,1), we need roughly 2000 samples from g(X) to get 1000 acceptable samples of f(x).<br />
<br />
In addition, in situations where a g(x) function is chosen and used, there can be a discrepancy between the functional behaviors of f(x) and g(x) that render this method unreliable. For example, given the normal distribution function as g(x) and a function of f(x) with a "fat" mid-section and "thin tails", this method becomes useless as more points near the two ends of f(x) will be rejected, resulting in a tedious and overwhelming number of sampling points having to be sampled due to the high rejection rate of such a method.<br />
<br />
===Sampling From Gamma and Normal Distribution - September 27, 2011===<br />
<br />
====Sampling From Gamma====<br />
<br />
'''Gamma Distribution'''<br />
<br />
The Gamma function is written as <math>X \sim~ Gamma (t, \lambda) </math><br />
<br />
:<math> F(x) = \int_{0}^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If you have t samples of the exponential distribution,<br><br />
<br> <math> \begin{align} X_1 \sim~ Exp(\lambda)\\ \vdots \\ X_t \sim~ Exp(\lambda) \end{align}<br />
</math><br />
<br />
The sum of these t samples has a gamma distribution,<br />
<br />
:<math> X_1+X_2+ ... + X_t \sim~ Gamma (t, \lambda) </math><br><br />
:<math> \sum_{i=1}^{t} X_i \sim~ Gamma (t, \lambda) </math> where <math>X_i \sim~Exp(\lambda)</math><br><br />
<br />
'''Method'''<br />
<br />
We can sample the exponential distribution using the inverse transform method from previous class,<br><br />
:<math>\,f(x)={\lambda}e^{-{\lambda}x}</math><br><br />
:<math>\,F^{-1}(u)=\frac{-ln(1-u)}{\lambda}</math><br><br />
:<math>\,F^{-1}(u)=\frac{-ln(u)}{\lambda}</math> <br />
1 - u is the same as x since <math>U \sim~ unif [0,1] </math><br><br />
:<math> \begin{align} \frac{-ln(u_1)}{\lambda} - \frac{ln(u_2)}{\lambda} - ... - \frac{ln(u_t)}{\lambda} = x_1\\ \vdots \\ \frac{-ln(u_1)}{\lambda} - \frac{ln(u_2)}{\lambda} - ... - \frac{ln(u_t)}{\lambda} = x_t \end{align}<br />
:</math><br><br />
:<math> \frac {-\sum_{i=1}^{t} ln(u_i)}{\lambda} = x</math><br />
<br />
'''MATLAB code''' for a Gamma(3,1) is<br />
<br />
<pre><br />
x = sum(-log(rand(1000,3)),2); <br />
hist(x)<br />
</pre><br />
<br />
And the Histogram of X follows a Gamma distribution with long tail: <br />
<br />
[[File:Hist.PNG|center|500px]]<br />
<br />
We can improve the quality of histogram by adding the number of bins we want, like hist(x, number_of_bins)<br />
<br />
<pre><br />
x = sum(-log(rand(20000,3)),2); <br />
hist(x,40)<br />
</pre><br />
<br />
[[File:untitled.jpg|center|500px]]<br />
<br />
''' R code''' for a Gamma(3,1) is<br />
<pre><br />
a<-apply(-log(matrix(runif(3000),nrow=1000)),1,sum);<br />
hist(a);<br />
</pre><br />
And the histogram is <br />
<br />
[[File:hist_gamma.png|center|500px]]<br />
<br />
Here is another histogram of Gamma coding with R<br />
<pre><br />
a<-apply(-log(matrix(runif(3000),nrow=1000)),1,sum);<br />
hist(a,freq=F);<br />
lines(density(a),col="blue");<br />
rug(jitter(a));<br />
</pre><br />
[[File:hist_gamma_2.png|center|500px]]<br />
<br />
====Sampling from Normal Distribution using Box-Muller Transform - September 29, 2011====<br />
<br />
=====Procedure=====<br />
<br />
# Generate <math>\displaystyle u_1</math> and <math>\displaystyle u_2</math>, two values sampled from a uniform distribution between 0 and 1.<br />
# Set <math>\displaystyle R^2 = -2log(u_1)</math> so that <math>\displaystyle R^2</math> is exponential with mean 1/2 <br> Set <math>\!\theta = 2*\pi*u_2</math> so that <math>\!\theta</math> ~ Unif[0, 2<math>\displaystyle\pi</math>]<br />
# Set <math>\displaystyle X = R cos(\theta)</math> <br> Set <math>\displaystyle Y = R sin(\theta)</math><br />
<br />
=====Justification=====<br />
<br />
Suppose we have X ~ N(0, 1) and Y ~ N(0, 1) where X and Y are independent normal random variables. The relative probability density function of these two random variables using Cartesian coordinates is:<br />
<br />
<math> f(X, Y) dxdy= f(X) f(Y) dxdy= \frac{1}{\sqrt{2\pi}}e^{-x^2/2} \frac{1}{\sqrt{2\pi}}e^{-y^2/2} dxdy= \frac{1}{2\pi}e^{-(x^2+y^2)/2}dxdy </math> <br><br />
<br />
In polar coordinates <math>\displaystyle R^2 = x^2 + y^2</math>, so the relative probability density function of these two random variables using polar coordinates is:<br />
<br />
<math> f(R, \theta) = \frac{1}{2\pi}e^{-R^2/2} </math> <br><br />
<br />
If we have <math>\displaystyle R^2 \sim exp(1/2)</math> and <math>\!\theta \sim unif[0, 2\pi]</math> we get an equivalent relative probability density function. Notice that after the two on two transformation, a determinant of jocobian should be added according to the change of variable and rule of differential multiplication where<br />
<br />
<math> |J|=\left|\frac{\partial(x,y)}{\partial(R,\theta)}\right|= \left|\begin{matrix}\frac{\partial x}{\partial R}&\frac{\partial x}{\partial \theta}\\\frac{\partial y}{\partial R}&\frac{\partial y}{\partial \theta}\end{matrix}\right|=R </math> <br><br />
<br />
<math> f(X, Y) dxdy = f(R, \theta)|J|dRd\theta = \frac{1}{2\pi}e^{-R^2/2}R dRd\theta= \frac{1}{4\pi}e^{-\frac{s}{2}} dSd\theta </math> <br>where <math> S=R^2. </math> <br><br />
<br />
Therefore we can generate a point in polar coordinates using the uniform and exponential distributions, then convert the point to Cartesian coordinates and the resulting X and Y values will be equivalent to samples generated from N(0, 1).<br />
<br />
'''MATLAB code'''<br />
<br />
In MatLab this algorithm can be implemented with the following code, which generates 20,000 samples from N(0, 1):<br />
<br />
<pre><br />
x = zeros(10000, 1);<br />
y = zeros(10000, 1);<br />
for ii = 1:10000<br />
u1 = rand;<br />
u2 = rand;<br />
R2 = -2 * log(u1);<br />
theta = 2 * pi * u2;<br />
x(ii) = sqrt(R2) * cos(theta);<br />
y(ii) = sqrt(R2) * sin(theta);<br />
end<br />
hist(x)<br />
</pre><br />
<br />
In one execution of this script, the following histogram for x was generated:<br />
<br />
[[File:Hist standard normal.jpg|center|500px]]<br />
<br />
=====Non-Standard Normal Distributions=====<br />
<br />
'''Example 1: Single-variate Normal'''<br />
<br />
If X ~ Norm(0, 1) then (a + bX) has a normal distribution with a mean of <math>\displaystyle a</math> and a standard deviation of <math>\displaystyle b</math> (which is equivalent to a variance of <math>\displaystyle b^2</math>). Using this information with the Box-Muller transform, we can generate values sampled from some random variable <math>\displaystyle Y\sim N(a,b^2) </math> for arbitrary values of <math>\displaystyle a,b</math>.<br />
<br />
# Generate a sample u from Norm(0, 1) using the Box-Muller transform.<br />
# Set v = a + bu.<br />
<br />
The values for v generated in this way will be equivalent to sample from a <math>\displaystyle N(a, b^2)</math>distribution. We can modify the MatLab code used in the last section to demonstrate this. We just need to add one line before we generate the histogram:<br />
<br />
<pre><br />
x = a + b * x;<br />
</pre><br />
<br />
For instance, this is the histogram generated when b = 15, a = 125:<br />
<br />
[[File:Hist normal.jpg|center|500]]<br />
<br />
'''Example 2: Multi-variate Normal'''<br />
<br />
The Box-Muller method can be extended to higher dimensions to generate multivariate normals. The objects generated will be nx1 vectors, and their variance will be described by nxn covariance matrices.<br />
<br />
<math>\mathbf{z} = N(\mathbf{u}, \Sigma)</math> defines the n by 1 vector <math>\mathbf{z}</math> such that:<br />
<br />
* <math>\displaystyle u_i</math> is the average of <math>\displaystyle z_i</math><br />
* <math>\!\Sigma_{ii}</math> is the variance of <math>\displaystyle z_i</math><br />
* <math>\!\Sigma_{ij}</math> is the co-variance of <math>\displaystyle z_i</math> and <math>\displaystyle z_j</math><br />
<br />
If <math>\displaystyle z_1, z_2, ..., z_d</math> are normal variables with mean 0 and variance 1, then the vector <math>\displaystyle (z_1, z_2,..., z_d) </math> has mean 0 and variance <math>\!I</math>, where 0 is the zero vector and <math>\!I</math> is the identity matrix. This fact suggests that the method for generating a multivariate normal is to generate each component individually as single normal variables.<br />
<br />
The mean and the covariance matrix of a multivariate normal distribution can be adjusted in ways analogous to the single variable case. If <math>\mathbf{z} \sim N(0,I)</math>, then <math>\Sigma^{1/2}\mathbf{z}+\mu \sim N(\mu,\Sigma)</math>. Note here that the covariance matrix is symmetric and nonnegative, so its square root should always exist.<br />
<br />
We can compute <math>\mathbf{z}</math> in the following way:<br />
<br />
# Generate an n by 1 vector <math>\mathbf{x} = \begin{bmatrix}x_{1} & x_{2} & ... & x_{n}\end{bmatrix}</math> where <math>x_{i}</math> ~ Norm(0, 1) using the Box-Muller transform.<br />
# Calculate <math>\!\Sigma^{1/2}</math> using singular value decomposition.<br />
# Set <math>\mathbf{z} = \Sigma^{1/2} \mathbf{x} + \mathbf{u}</math>.<br />
<br />
The following MatLab code provides an example, where a scatter plot of 10000 random points is generated. In this case x and y have a co-variance of 0.9 - a very strong positive correlation.<br />
<br />
<pre><br />
x = zeros(10000, 1);<br />
y = zeros(10000, 1);<br />
for ii = 1:10000<br />
u1 = rand;<br />
u2 = rand;<br />
R2 = -2 * log(u1);<br />
theta = 2 * pi * u2;<br />
x(ii) = sqrt(R2) * cos(theta);<br />
y(ii) = sqrt(R2) * sin(theta);<br />
end<br />
<br />
E = [1, 0.9; 0.9, 1];<br />
[u s v] = svd(E);<br />
root_E = u * (s ^ (1 / 2));<br />
<br />
z = (root_E * [x y]);<br />
z(:,1) = z(:,1) + 5;<br />
z(:,2) = z(:,2) + -8;<br />
<br />
scatter(z(:,1), z(:,2))<br />
</pre><br />
<br />
This code generated the following scatter plot:<br />
<br />
[[File:scatter covar.jpg|center|500px]]<br />
<br />
In Matlab, we can also use the function "sqrtm()" or "chol()" (Cholesky Decomposition) to calculate square root of a matrix directly. Note that the resulting root matrices may be different but this does materially affect the simulation.<br />
Here is an example:<br />
<br />
<pre><br />
E = [1, 0.9; 0.9, 1];<br />
r1 = sqrtm(E);<br />
r2 = chol(E);<br />
</pre><br />
<br />
R code for a multivariate normal distribution:<br />
<br />
<pre><br />
n=10000;<br />
r2<--2*log(runif(n));<br />
theta<-2*pi*(runif(n));<br />
x<-sqrt(r2)*cos(theta);<br />
<br />
y<-sqrt(r2)*sin(theta);<br />
a<-matrix(c(x,y),nrow=n,byrow=F);<br />
e<-matrix(c(1,.9,09,1),nrow=2,byrow=T);<br />
svde<-svd(e);<br />
root_e<-svde$u %*% diag(svde$d)^1/2;<br />
z<-t(root_e %*%t(a));<br />
z[,1]=z[,1]+5;<br />
z[,2]=z[,2]+ -8;<br />
par(pch=19);<br />
plot(z,col=rgb(1,0,0,alpha=0.06))<br />
</pre><br />
<br />
[[File:m_normal.png|center|500px]]<br />
<br />
=====Remarks=====<br />
MATLAB's randn function uses the ziggurat method to generate normal distributed samples. It is an efficient rejection method based on covering the probability density function with a set of horizontal rectangles so as to obtain points within each rectangle. It is reported that a 800 MHz Pentium III laptop can generate over 10 million random numbers from normal distribution in less than one second. ([http://www.mathworks.com/company/newsletters/news_notes/clevescorner/spring01_cleve.html Reference])<br />
<br />
===Sampling From Binomial Distributions===<br />
<br />
In order to generate a sample x from <math>\displaystyle X \sim Bin(n, p)</math>, we can follow the following procedure:<br />
<br />
1. Generate n uniform random numbers sampled from <math>\displaystyle Unif [0, 1] </math>: <math>\displaystyle u_1, u_2, ..., u_n</math>.<br />
<br />
2. Set x to be the total number of cases where <math>\displaystyle u_i <= p</math> for all <math>\displaystyle 1 <= i <= n</math>.<br />
<br />
In MatLab this can be coded with a single line. The following generates a sample from <math>\displaystyle X \sim Bin(n, p)</math> <br />
<br />
>> sum(rand(n, 1) <= p, 1)<br />
<br />
==Bayesian Inference and Frequentist Inference - October 4, 2011==<br />
<br />
===Bayesian inference vs Frequentist inference===<br />
The Bayesian method has become popular in the last few decades as simulation and computer technology makes it more applicable. For more information about its history and application, please refer to http://en.wikipedia.org/wiki/Bayesian_inference.<br />
As for frequentists, please refer to http://en.wikipedia.org/wiki/Frequentist_inference.<br />
<br />
====Example====<br />
Consider: A person drinks a cup of coffee on a specific day.<br />
<br><br><br />
Frequentist: There is no explanation to this situation. It is essentially meaningless since it has only occurred once. Therefore, it is not a probability.<br />
<br><br />
Bayesian: Probability is not just about the frequent occurrences but it is what you believe about this probability.<br />
<br />
<br />
====Example of face identification====<br />
Take the face as input x. And the person as output y. The person can be either Ali or Tom. If it is Ali, y=1. Otherwise, y=0. We can divide the picture into 100*100 pixels and then list them into a 10,000*1 column vector which is x.<br />
<br />
If you are a frequentist, you would compare Pr(X=x|y=1) with Pr(X=x|y=0) and see which one is higher. But if you are a Bayesianist, you would compare Pr(y=1|X=x) with Pr(y=0|X=x).<br />
<br />
====Summary of differences between two schools====<br />
*Frequentist: Probability refers to limiting relative frequency. (objective)<br />
*Bayesian: Probability describes degree of belief not frequency. (subjective)<br />
e.g. The probability that you drank a cup of tea on May 20, 2001 is 0.62 does not refer to any frequency.<br />
----<br />
*Frequentist: Parameters are fixed, unknown constants.<br />
*Bayesian: Parameters are random variables and we can make probabilistic statement about them.<br />
----<br />
*Frequentist: Statistical procedures should have long run frequency probabilities.<br />
e.g. a 95% confidence interval should trap true value of the parameter for at least 95% of limited frequency<br />
*Bayesian: It makes inferences about <math>\theta</math> by producing a prbability distribution for <math>\theta</math>. Inference (e.g. point estimation) will be extracted from this distribution.<br />
<br />
====Bayesian inference====<br />
<br />
Bayesian inference is usually carried out in the following way:<br />
<br />
1. Choose a prior probability density function of <math>\!\theta</math> which is <math>f(\!\theta)</math>. This is our belief about <math>\theta</math> before we see any data.<br />
<br />
2. Choose a statistical model <math>\displaystyle f(x|\theta)</math> that reflects our beliefs about X.<br />
<br />
3. After observing data <math>\displaystyle x_1,...,x_n</math>, we update our beliefs and calculate the posterior probability.<br />
<br />
<math>f(\theta|x) = \frac{f(\theta,x)}{f(x)}=\frac{f(x|\theta) \cdot f(\theta)}{f(x)}=\frac{f(x|\theta) \cdot f(\theta)}{\int^{}_\theta f(x|\theta) \cdot f(\theta) d\theta}</math>, where <math>\displaystyle f(\theta|x)</math> is the posterior probability, <math>\displaystyle f(\theta)</math> is the prior probability, <math>\displaystyle f(x|\theta)</math> is the likelihood of observing X=x given <math>\!\theta</math> and f(x) is the marginal probability of X=x.<br />
<br />
If we have i.i.d. observations <math>\displaystyle x_1,...,x_n</math>, we can replace <math>\displaystyle f(x|\theta)</math> with <math>f({x_1,...,x_n}|\theta)=\prod_{i=1}^n f(x_i|\theta)</math> because of independency.<br />
<br />
We denote <math>\displaystyle f({x_1,...,x_n}|\theta)</math> as <math>\displaystyle L_n(\theta)</math> which is called likelihood. And we use <math>\displaystyle x^n</math> to denote <math>\displaystyle (x_1,...,x_n)</math>.<br />
<br />
<math>f(\theta|x^n) = \frac{f(x^n|\theta) \cdot f(\theta)}{f(x^n)}=\frac{f(x^n|\theta) \cdot f(\theta)}{\int^{}_\theta f(x^n|\theta) \cdot f(\theta) d\theta}</math> , where <math>\int^{}_\theta f(x^n|\theta) \cdot f(\theta) d\theta</math> is a constant <math>\displaystyle c_n</math>. So <math>f(\theta|x^n) \propto f(x^n|\theta) \cdot f(\theta)</math>. The posterior probability is proportional to the likelihood times prior probability.<br />
<br />
<math>E(\theta)=\int^{}_\theta \theta \cdot f(\theta|x^n) d\theta</math> which is the posterior mean of <math>\!\theta</math>.<br />
<br />
Let <math>\tilde{\theta}=(\theta_1,...,\theta_d)^T</math>, then <math>f(\theta_1|x^n) = \int^{} \int^{} \dots \int^{}f(\theta|X)d\theta_2d\theta_3 \dots d\theta_d </math> and <math>E(\theta_1)=\int^{}\theta_1 \cdot f(\theta_1|x^n) d\theta_1</math><br />
<br />
====Example 1: Estimating parameters of a univariate Gaussian distribution====<br />
<br />
Suppose X follows a univariate Gaussian distribution (i.e. a Normal distribution) with parameters <math>\!\mu</math> and <br />
<math>\displaystyle {\sigma^2}</math>.<br />
<br />
(a) For Frequentists:<br />
<br />
<math>f(x|\theta)= \frac{1}{\sqrt{2\pi}\sigma} \cdot e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}</math><br />
<br />
<math>L_n(\theta)= \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma} \cdot e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2}</math><br />
<br />
<br />
<math>\ln L_n(\theta) = l(\theta) = \sum_{i=1}^n -\frac{1}{2}\ln 2\pi-\ln \sigma-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2</math><br />
<br />
To get the maximum likelihood estimator of <math>\!\mu</math> (mle), we find the <math>\hat{\mu}</math> which maximizes <math>\displaystyle L_n(\theta)</math>:<br />
<br />
<math>\frac{\partial l(\theta)}{\partial \mu}= \sum_{i=1}^n \frac{1}{\sigma}(\frac{x_i-\mu}{\sigma})=0 \Rightarrow \sum_{i=1}^n x_i = n\mu \Rightarrow \hat{\mu}_{mle}=\bar{x}</math><br />
<br />
(b) For Bayesians:<br />
<br />
<math>f(\theta|x) \propto f(x|\theta) \cdot f(\theta)</math><br />
<br />
We assume that the mean of the above normal distribution is itself distributed normally with mean <math>\!\mu_0</math> and variance <math>\!\Gamma</math>.<br />
<br />
Suppose <math>\!\mu\sim N(\mu_0, \!\Gamma^2</math>),<br />
<br />
so <math>f(\mu) = \frac{1}{\sqrt{2\pi}\Gamma} \cdot e^{-\frac{1}{2}(\frac{\mu-\mu_0}{\Gamma})^2}</math><br />
<br />
<math>f(\mu|x) = \frac{1}{\sqrt{2\pi}\tilde{\sigma}} \cdot e^{-\frac{1}{2}(\frac{\mu-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
<br />
<math>\tilde{\mu} = \frac{\frac{n}{\sigma^2}}{\frac{n}{\sigma^2}+\frac{1}{\Gamma^2}}\bar{x}+\frac{\frac{1}{\Gamma^2}}{\frac{n}{\sigma^2}+\frac{1}{\Gamma^2}}\mu_0</math>, where <math>\tilde{\mu}</math> is the estimator of <math>\!\mu</math>.<br />
<br />
* If prior belief about <math>\!\mu_0</math> is strong, then <math>\!\Gamma</math> is small and <math>\frac{1}{\Gamma^2}</math> is large. <math>\tilde{\mu}</math> is close to <math>\!\mu_0</math> and the observations will not affect too much. On the contrary, if prior belief about <math>\!\mu_0</math> is weak, <math>\!\Gamma</math> is large and <math>\frac{1}{\Gamma^2}</math> is small. <math>\tilde{\mu}</math> depends more on observations.(This is intuitive, when our original belief is reliable, then the sample is not important in improving the result; when the belief is not reliable, then we depend a lot on the sample.)<br />
<br />
* When the sample is large (i.e. n <math>\to \infty</math>), <math>\tilde{\mu} \to \bar{x}</math> and the impact of prior belief about <math>\!\mu</math> is weakened.<br />
<br />
=='''Basic Monte Carlo Integration - October 6th, 2011'''==<br />
<br />
Three integration methods would be taught in this course:<br />
*Basic Monte Carlo Integration<br />
*Importance Sampling<br />
*Markov Chain Monte Carlo (MCMC)<br />
<br />
The first, and most basic, method of numerical integration we will see is Monte Carlo Integration. We use this to solve an integral of the form: <math> I = \int_{a}^{b} h(x) dx </math><br />
<br />
Note the following derivation: <br />
<br />
<math>\begin{align}<br />
\displaystyle I & = \int_{a}^{b} h(x)dx \\<br />
& = \int_{a}^{b} h(x)((b-a)/(b-a))dx \\<br />
& = \int_{a}^{b} (h(x)(b-a))(1/(b-a))dx \\<br />
& = \int_{a}^{b} w(x)f(x)dx \\<br />
& = E[w(x)] \\<br />
\end{align}<br />
</math><br />
<br />
~<math>(1/n) \sum_{i=1}^{n} w(x) </math><br />
<br />
Where w(x) = h(x)(b-a) and f(x) is the probability density function of a uniform random variable on the interval [a,b]. The expectation, with respect to the distribution of f, of w is taken from n samples of x.<br />
<br />
<br />
===='''General Procedure'''====<br />
<br />
i) Draw n samples <math> x_i \sim~ U[a,b] </math><br />
<br />
ii) Compute <math> \ w(x_i) </math> for every sample<br />
<br />
iii) Obtain an estimate of the integral, <math> \hat{I} </math>, as follows:<br />
<br />
<math> \hat{I} = 1/n \sum_{i=1}^{n} w(x</math><sub>i</sub><math> )</math> . Clearly, this is just the average of the simulation results.<br />
<br />
By the strong law of large numbers <math> \hat{I} </math> converges to <math> \ I </math> as <math> \ n \rightarrow \infty </math>. Because of this, we can compute all sorts of useful information, such as variance, standard error, and confidence intervals.<br />
<br />
Standard Error: <math> SE = Standard Deviation / \sqrt{n} </math><br />
<br />
Variance: <math> V = (\sum_{i=1}^{n} (w(x)-I)^2)/(n-1) </math><br />
<br />
Confidence Interval: <math> I \pm t_{(\alpha/2)} SE </math><br />
<br />
==='''Example: Uniform Distribution'''===<br />
<br />
Consider the integral, <math> \int_{0}^{1} x^3dx </math>, which is easily solved through standard analytical integration methods, and is equal to .25. Now, let us check this answer with a numerical approximation using Monte Carlo Integration. <br />
<br />
We generate a 1 by 10000 vector of uniform (on the interval [0,1]) random variables and call that vector 'u'. We see that our 'w' in this case is <math> x^3 </math>, so we set <math> w = u^3 </math>. Our I<sup>^</sup> is equal to the mean of w.<br />
<br />
In Matlab, we can solve this integration problem with the following code:<br />
<br />
<pre><br />
u = rand(1,10000);<br />
w = u.^3;<br />
mean(w)<br />
ans = 0.2475<br />
</pre><br />
<br />
Note the '.' after 'u' in the second line of code, indicating that each entry in the matrix is cubed. Also, our approximation is close to the actual value of .25. Now let's try to get an even better approximation by generating more sample points. <br />
<br />
<pre><br />
u= rand(1,100000);<br />
w= u.^3;<br />
mean(w)<br />
ans = .2503<br />
</pre><br />
<br />
We see that when the number of sample points is increased, our approximation improves, as one would expect.<br />
<br />
==='''Generalization'''===<br />
<br />
Up to this point we have seen how to numerically approximate an integral when the distribution of f is uniform. Now we will see how to generalize this to other distributions.<br />
<br />
<math> I = \int h(x)f(x)dx </math> <br />
<br />
If f is a distribution function (pdf), then <math> I </math> can be estimated as E<sub>f</sub>[h(x)]. This means taking the expectation of h with respect to the distribution of f. Our previous example is the case where f is the uniform distribution between [a,b].<br />
<br />
'''Procedure for the General Case'''<br />
<br />
i) Draw n samples from f <br />
<br />
ii) Compute h(x<sub>i</sub>)<br />
<br />
iii) <math>\hat{I} = 1/n \sum_{i=1}^{n} h(x</math><sub>i</sub><math>)</math><br />
<br />
==='''Example: Exponential Distribution'''===<br />
<br />
Find <math> E[\sqrt{x}] </math> for <math> \displaystyle f = e^{-x} </math>, which is the exponential distribution with mean 1.<br />
<br />
<math> I = \int_{0}^{\infty} \sqrt{x} e^{-x}dx </math><br />
<br />
We can see that we must draw samples from f, the exponential distribution.<br />
<br />
To find a numerical solution using Monte Carlo Integration we see that: <br />
<br />
u= rand(1,10000)<br />
X= -log(u)<br />
h= <math> \sqrt{x} </math> <br />
I= mean(h)<br />
<br />
To implement this procedure in Matlab, use the following code:<br />
<br />
<pre><br />
u = rand(1,10000);<br />
X = -log(u);<br />
h = x.^.5;<br />
mean(h)<br />
ans = .8841<br />
</pre><br />
<br />
An easy way to check whether your approximation is correct is to use the built in Matlab function 'quadl' which takes a function and bounds for the integral and returns a solution for the definite integral of that function. For this specific example, we can enter:<br />
<br />
<pre><br />
f = @(x) sqrt(x).*exp(-x);<br />
% quadl runs into computational problems when the upper bound is "inf" or an extremely large number, <br />
% so choose just a moderately large number.<br />
quadl(f,0,100)<br />
ans =<br />
0.8862<br />
</pre><br />
<br />
From the above result, we see that our approximation was quite close.<br />
<br />
==='''Example: Normal Distribution'''===<br />
<br />
Let <math> f(x) = (1/(2 \pi)^{1/2}) e^{(-x^2)/2} </math>. Compute the cumulative distribution function at some point x.<br />
<br />
<math> F(x)= \int_{-\infty}^{x} f(s)ds = \int_{-\infty}^{x}(1)(1/(2 \pi)^{1/2}) e^{(-s^2)/2}ds </math>. The (1) is inserted to illustrate that our h(x) will be the constant function 1, and our f(x) is the normal distribution. To take into account the upper bound of integration, x, any values sampled that are greater than x will be set to zero. <br />
<br />
This is the Matlab code for solving F(2):<br />
<br />
<pre><br />
<br />
u = randn(1,10000)<br />
h = u < 2;<br />
mean(h)<br />
ans = .9756<br />
<br />
</pre><br />
<br />
We generate a 1 by 10000 vector of standard normal random variables and we return a value of 1 if u is less than 2, and 0 otherwise.<br />
<br />
We can also build the function F(x) in matlab in the following way:<br />
<br />
<pre><br />
function F(x)<br />
u=rand(1,1000000);<br />
h=u<x;<br />
mean(h)<br />
</pre><br />
<br />
<br />
==='''Example: Binomial Distribution'''===<br />
<br />
In this example we will see the Bayesian Inference for 2 Binomial Distributions.<br />
<br />
Let <math> X ~ Bin(n,p) </math> and <math> Y ~ Bin(m,q) </math>, and let <math> \!\delta = p-q </math>.<br />
<br />
Therefore, <math> \displaystyle \!\delta = x/n - y/m </math> which is the frequentist approach.<br />
<br />
Bayesian wants <math> \displaystyle f(p,q|x,y) = f(x,y|p,q)f(p,q)/f(x,y) </math>, where <math> f(x,y)=\iint\limits_{\!\theta} f(x,y|p,q)f(p,q)\,dp\,dq</math> is a constant.<br />
<br />
Thus, <math> \displaystyle f(p,q|x,y)\propto f(x,y|p,q)f(p,q) </math>. Now we assume that <math>\displaystyle f(p,q) = f(p)f(q) = 1 </math> and f(p) and f(q) are uniform.<br />
<br />
Therefore, <math> \displaystyle f(p,q|x,y)\propto p^x(1-p)^{n-x}q^y(1-q)^{m-y} </math>.<br />
<br />
<math> E[\delta] = \int_{0}^{1} \int_{0}^{1} (p-q)f(p,q|x,y)dxdy </math>.<br />
<br />
As you can see this is much tougher than the frequentist approach.<br />
<br />
=='''Importance Sampling and Basic Monte Carlo Integration - October 11th, 2011'''==<br />
<br />
==='''Example: Binomial Distribution (Continued)'''===<br />
<br />
Suppose we are given two independent Binomial Distributions <math>\displaystyle X \sim Bin(n, p_1)</math>, <math>\displaystyle Y \sim Bin(m, p_2)</math>. We would like to give an Monte Carlo estimate of <math>\displaystyle \delta = p_1 - p_2</math><br><br />
<br />
Frequentist approach: <br><br><math>\displaystyle \hat{p_1} = \frac{X}{n}</math> ; <math>\displaystyle \hat{p_2} = \frac{Y}{m}</math><br><br><math>\displaystyle \hat{\delta} = \hat{p_1} - \hat{p_2} = \frac{X}{n} - \frac{Y}{m}</math><br><br><br />
Bayesian approach to compute the expected value of <math>\displaystyle \delta</math>:<br><br><br />
<math>\displaystyle E(\delta) = \int\int(p_1-p_2) f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Assume that <math>\displaystyle n = 100, m = 100, p_1 = 0.5, p_2 = 0.8</math> and the sample size is 1000.<br><br />
MATLAB code of the above example:<br />
<pre><br />
n = 100;<br />
m = 100;<br />
p_1 = 0.5;<br />
p_2 = 0.8;<br />
p1 = mean(rand(n,1000)<p_1);<br />
p2 = mean(rand(m,1000)<p_2);<br />
delta = p2 - p1;<br />
hist(delta)<br />
mean(delta)<br />
</pre><br />
<br />
In one execution of the code, the mean of delta was 0.3017. The histogram of delta generated was:<br />
[[File:Hist delta.jpg|center|]]<br />
<br />
Through Monte Carlo simulation, we can obtain an empirical distribution of delta and carry out inference on the data obtained, such as computing the mean, maximum, variance, standard deviation and the standard error of delta.<br />
<br />
==='''Importance Sampling'''===<br />
<br />
====Motivation====<br />
<br />
Consider the integral <math>\displaystyle I = \int h(x)f(x)\,dx</math><br><br><br />
According to basic Monte Carlo Integration, if we can sample from the probability density function <math>\displaystyle f(x)</math> and feed the samples of <math>\displaystyle f(x)</math> back to <math>\displaystyle h(x)</math>, <math>\displaystyle I</math> can be estimated as an average of <math>\displaystyle h(x)</math> ( i.e. <math>\hat{I} = \frac{1}{n} \sum_{i=1}^{n} h(x_i)</math> )<br><br />
However, the Monte Carlo method works when we know how to sample from <math>\displaystyle f(x)</math>. In the case where it is difficult to sample from <math>\displaystyle f(x)</math>, importance sampling is a technique that we can apply. Importance Sampling relies on another function <math>\displaystyle g(x)</math> which we know how to sample from.<br />
<br />
The above integral can be rewritten as follow:<br><br />
<math>\begin{align}<br />
\displaystyle I & = \int h(x)f(x)\,dx \\<br />
& = \int h(x)f(x)\frac{g(x)}{g(x)}\,dx \\<br />
& = \int \frac{h(x)f(x)}{g(x)}g(x)\,dx \\<br />
& = \int y(x)g(x)\,dx \\<br />
& = E_g(y(x)) \\<br />
\end{align}<br />
</math><br><br />
<math>where \ y(x) = \frac{h(x)f(x)}{g(x)}</math><br><br />
<br />
The integral can thus be simulated as <math>\displaystyle \hat{I} = \frac{1}{n} \sum_{i=1}^{n} Y_i \ , \ where \ Y_i = \frac{h(x_i)f(x_i)}{g(x_i)}</math><br><br />
<br />
====Procedure====<br />
<br />
Suppose we know how to sample from <math>\displaystyle g(x)</math><br><br />
#Choose a suitable <math>\displaystyle g(x)</math> and draw n samples <math>x_1,x_2....,x_n \sim~ g(x)</math><br />
#Set <math>Y_i =\frac{h(x_i)f(x_i)}{g(x_i)}</math><br />
#Compute <math> \hat{I} = \frac{1}{n}\sum_{i=1}^{n} Y_i </math><br><br />
<br />
By the Law of large numbers, <math>\displaystyle \hat{I} \rightarrow I </math> provided that the sample size n is large enough.<br><br><br />
<br />
'''Remarks:''' One can think of <math>\frac{f(x)}{g(x)}</math> as a weight to <math>\displaystyle h(x)</math> in the computation of <math>\hat{I}</math><br><br><br />
<math>\displaystyle i.e. \ \hat{I} = \frac{1}{n}\sum_{i=1}^{n} Y_i = \frac{1}{n}\sum_{i=1}^{n} (\frac{f(x_i)}{g(x_i)})h(x_i)</math><br><br><br />
Therefore, <math>\displaystyle \hat{I} </math> is a weighted average of <math>\displaystyle h(x_i)</math><br><br><br />
<br />
====Problem====<br />
<br />
If <math>\displaystyle g(x)</math> is not chosen appropriately, then the variance of the estimate <math>\hat{I}</math> may be very large. Here we actually face a similar problem with Rejection-Acceptance Approach. Consider the second moment of <math>\displaystyle I</math>:<br><br><br />
<math>\begin{align}<br />
\displaystyle I & = E_g((y(x))^2) \\<br />
& = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx \\<br />
& = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx \\<br />
& = \int \frac{h^2(x)f^2(x)}{g(x)} dx \\<br />
\end{align}<br />
</math><br><br><br />
<br />
When <math>\displaystyle g(x)</math> is very small, then the above integral could be very large, hence the variance can be very large when g is not chosen appropriately. This occurs when <math>\displaystyle g(x)</math> has a thinner tail than <math>\displaystyle f(x)</math> such that the quantity <math>\displaystyle \frac{h^2(x)f^2(x)}{g(x)}</math> is large.<br />
<br />
'''Remarks:''' <br />
<br />
1. We can actually compute the form of <math>\displaystyle g(x)</math> to have optimal variance. <br>Mathematically, it is to find <math>\displaystyle g(x)</math> subject to <math>\displaystyle \min_g [\ E_g([y(x)]^2) - (E_g[y(x)])^2\ ]</math><br><br />
It can be shown that the optimal <math>\displaystyle g(x)</math> is <math>\displaystyle \frac{|h(x)|f(x)}{\int_{-\infty}^{\infty}|h(s)|f(s)ds}</math>. Using the optimal <math>\displaystyle g(x)</math> will minimize the variance of estimation in Importance Sampling. This is of theoretical interest but not useful in practice. As we can see, if we can actually show the expression of g(x), we must first have the value of the integration---which is what we want in the first place.<br />
<br />
2. In practice, we shall choose <math>\displaystyle g(x)</math> which has similar shape as <math>\displaystyle f(x)</math> but with a thicker tail than <math>\displaystyle f(x)</math> in order to avoid the problem mentioned above.<br><br />
<br />
====Example====<br />
<br />
Estimate <math>\displaystyle I = Pr(Z>3),\ where\ Z \sim N(0,1)</math><br><br><br />
'''Method 1: Basic Monte Carlo'''<br />
<br />
<math>\begin{align} Pr(Z>3) & = \int^\infty_3 f(x)\,dx \\<br />
& = \int^\infty_{-\infty} h(x)f(x)\,dx \end{align}</math><br /><br />
<math> where \ <br />
h(x) = \begin{cases}<br />
0, & \text{if } x \le 3 \\<br />
1, & \text{if } x > 3<br />
\end{cases}</math><br />
<math>\ ,\ f(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}</math><br />
<br />
MATLAB code to compute <math>\displaystyle I</math> from 100 samples of standard normal distribution:<br />
<pre><br />
h = randn(100,1) > 3;<br />
I = mean(h)<br />
</pre><br />
<br />
In one execution of the code, it returns a value of 0 for <math>\displaystyle I</math>, which differs significantly from the true value of <math>\displaystyle I \approx 0.0013 </math>. The problem of using Basic Monte Carlo in this example is that <math>\displaystyle Pr(Z>3)</math> has a small value, and hence many points sampled from the standard normal distribution will be wasted. Therefore, although Basic Monte Carlo is a feasible method to compute <math>\displaystyle I</math>, it gives a poor estimation.<br />
<br />
'''Method 2: Importance Sampling'''<br />
<br />
<math>\displaystyle I = Pr(Z>3)= \int^\infty_3 f(x)\,dx </math><br><br />
<br />
To apply importance sampling, we have to choose a <math>\displaystyle g(x)</math> which we will sample from. In this example, we can choose <math>\displaystyle g(x)</math> to be the probability density function of exponential distribution, normal distribution with mean 0 and variance greater than 1 or normal distribution with mean greater than 0 and variance 1 etc.. For the following, we take <math>\displaystyle g(x)</math> to be the pdf of <math>\displaystyle N(4,1)</math>.<br><br />
<br />
Procedure:<br />
#Draw n samples <math>x_1,x_2....,x_n \sim~ g(x)</math><br />
#Calculate <math>\begin{align} \frac{f(x)}{g(x)} & = \frac{ \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}<br />
}{ \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x-4)^2} } \\<br />
& = e^{8-4x} \end{align} </math><br><br />
#Set <math> Y_i = h(x_i)e^{8-4x_i}\ with\ h(x) = \begin{cases}<br />
0, & \text{if } x \le 3 \\<br />
1, & \text{if } x > 3<br />
\end{cases}<br />
</math><br><br />
#Compute <math> \hat{Y} = \frac{1}{n}\sum_{i=1}^{n} Y_i </math><br><br />
<br />
The above procedure from 100 samples of <math>\displaystyle g(x)</math>can be implemented in MATLAB as follow:<br />
<pre><br />
for ii = 1:100<br />
x = randn + 4 ;<br />
h = x > 3 ;<br />
y(ii) = h * exp(8-4*x) ;<br />
end<br />
mean(y)<br />
</pre><br />
<br />
In one execution of the code, it returns a value of 0.001271 for <math> \hat{Y} </math>, which is much closer to the true value of <math>\displaystyle I \approx 0.0013 </math>. From many executions of the code, the variance of basic monte carlo is approximately 150 times that of importance sampling. This demonstrates that this method can provide a better estimate than the Basic Monte Carlo method.<br />
<br />
==''' Importance Sampling with Normalized Weight and Markov Chain Monte Carlo - October 13th, 2011'''==<br />
==='''Importance Sampling with Normalized Weight'''===<br />
<br />
Recall that we can think of <math>\displaystyle b(x) = \frac{f(x)}{g(x)}</math> as a weight applied to the samples <math>\displaystyle h(x)</math>. If the form of <math>\displaystyle f(x)</math> is known only up to a constant, we can use an alternate, normalized form of the weight, <math>\displaystyle b^*(x)</math>. (This situation arises in Bayesian inference.) Importance sampling with normalized or standard weight is also called indirect importance sampling.<br />
<br />
We derive the normalized weight as follows:<br><br />
<math>\begin{align}<br />
\displaystyle I & = \int h(x)f(x)\,dx \\<br />
&= \int h(x)\frac{f(x)}{g(x)}g(x)\,dx \\<br />
&= \frac{\int h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int f(x) dx} \\<br />
&= \frac{\int h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int \frac{f(x)}{g(x)}g(x) dx} \\<br />
&= \frac{\int h(x)b(x)g(x)\,dx}{\int\ b(x)g(x) dx} <br />
\end{align}</math><br />
<br />
<math>\hat{I}= \frac{\sum_{i=1}^{n} h(x_i)b(x_i)}{\sum_{i=1}^{n} b(x_i)} </math><br />
<br />
Then, the normalized weight is <math>b^*(x) = \displaystyle \frac{b(x_i)}{\sum_{i=1}^{n} b(x_i)}</math><br />
<br />
Note that <math> \int f(x) dx = 1 = \int b(x)g(x) dx = 1 </math><br />
<br />
We can also determine the associated Monte Carlo variance of this estimate by<br />
<br />
<math> Var(\hat{I})= \frac{\sum_{i=1}^{n} b(x_i)(h(x_i) - \hat{I})^2}{\sum_{i=1}^{n} b(x_i)} </math><br />
<br />
==='''Markov Chain Monte Carlo'''===<br />
We still want to solve <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
====Stochastic Process====<br />
A stochastic process <math> \{ x_t : t \in T \}</math> is a collection of random variables. Variables <math>\displaystyle x_t</math> take values in some set <math>\displaystyle X</math> called the '''space set.''' The set <math>\displaystyle T</math> is called the '''index set.'''<br />
<br />
====Markov Chain====<br />
A Markov Chain is a stochastic process for which the distribution of <math>\displaystyle x_t</math> depends only on <math>\displaystyle x_{t-1}</math>. It is a random process characterized as being memoryless; meaning that the next occurrence of a defined event is only dependent on the current event and not on the sequence of events that preceded it. <br />
Formal Definition: The process <math> \{ x_t : t \in T \}</math> is a Markov Chain if <math>\displaystyle Pr(x_t|x_0, x_1,..., x_{t-1})= Pr(x_t|x_{t-1})</math> for all <math> \{t \in T \}</math> and for all <math> \{x \in X \}</math><br />
For a Markov Chain, <math>\displaystyle f(x_1,...x_n)= f(x_1)f(x_2|x_1)f(x_3|x_2)...f(x_n|x_{n-1})</math><br />
<br><br>Real Life Example:<br />
<br>When going for an interview, the employer only looks at your highest education achieved. The employer would not look at the past educations received (elementary school, high school etc.) because the employer believes that the highest education achieved summarizes your previous educations. Therefore, anything before your most recent previous education is irrelevant. In other word, we can say that<math> x_t </math>is regarded as the summary of <math>x_{t-1},...,x_2,x_1</math>, so when we need to determine <math>x_{t+1}</math>, we only need to pay attention in <math>x_{t}</math>.<br />
<br />
====Transition Probabilities====<br />
A Transition Probability is the probability of jumping from one state to another state.<br />
Formal Definition: We call <math>\displaystyle P_{ij} = Pr(x_{t+1}=j|x_t=i)</math> the transition probability.<br />
That is, P(i,j) is the probability of going to state j from state i. The matrix P whose (i,j) element is <math>\displaystyle P_{ij}</math> is called the Transition Matrix.<br />
<br />
Properties of P: <br />
:1) <math>\displaystyle P_{ij} >= 0</math> The probability of going to another state cannot be negative<br />
:2) <math>\displaystyle \sum_{\forall i}P_{ij} = 1</math> The probability of going to some state from state i (including remaining in state i) is certainty<br />
<br />
====Random Walk====<br />
Example: Start at one point and flip a coin where <math>\displaystyle Pr(H)=p</math> and <math>\displaystyle Pr(T)=1-p=q</math>. Take one step right if heads and one step left if tails. If at an endpoint, stay there.<br />
The transition matrix is<br />
<math>P=\left(\begin{matrix}1&0&0&\dots&\dots&0\\<br />
q&0&p&0&\dots&0\\<br />
0&q&0&p&\dots&0\\<br />
\vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\<br />
\vdots&\vdots&\vdots&\vdots&\ddots&\vdots\\<br />
0&0&\dots&\dots&\dots&1<br />
\end{matrix}\right)</math><br />
<br />
Let <math>\displaystyle P_n</math> be the matrix such that its (i,j) element is <math>\displaystyle P_{ij}(n)</math>. This is called n-step probability.<br />
<br />
:<math>\displaystyle P_n = P^n</math><br />
:<math>\displaystyle P_1 = P</math><br />
:<math>\displaystyle P_2 = P^2</math><br />
<br />
<br />
==''' Markov Chain Properties and Page Rank - October 18th, 2011'''==<br />
<br />
===Summary of Terminology===<br />
<br />
====Transition Matrix====<br />
<br />
A matrix <math>\!P</math> that defines a Markov Chain has the form:<br />
<br />
<math>P = \begin{bmatrix}<br />
P_{11} & \cdots & P_{1N} \\<br />
\vdots & \ddots & \vdots \\ <br />
P_{N1} & \cdots & P_{NN}<br />
\end{bmatrix}</math><br />
<br />
where <math>\!P(i,j) = P_{ij} = Pr(x_{t+1} = j | x_t = i) </math> is the probability of transitioning from state i to state j in the Markov Chain in a single step. Note that this implies that all rows add up to one.<br />
<br />
====n-step Transition matrix====<br />
<br />
A matrix <math>\!P_n</math> whose (i,j)<sup>th</sup> entry is the probability of moving from state i to state j after n transitions:<br />
<br />
<math>\!P_n(i,j) = Pr(x_{m+n}=j|x_m = i)</math><br />
<br />
This probability is called the n-step transition probability. A nice property of this matrix is that<br />
<br />
<math>\!P_n = P^n</math><br />
<br />
For all n >= 0, where P is the transition matrix. Note that the rows of <math>P_n</math> should still add up to one.<br />
<br />
====Marginal distribution of a Markov Chain====<br />
<br />
We represent the state at time t as a vector.<br />
<br />
<math>\mu_t = (\mu_t(1) \; \mu_t(2) \; ... \; \mu_t(n))</math><br />
<br />
Consider this Markov Chain:<br />
<br />
[[File:MarkovSample.png|300px]]<br />
<br />
<math>\mu_t = (A \; B)</math>, where A is the probability of being in state a at time t, and B is the probability of being in state b at time t.<br />
<br />
For example if <math>\mu_t = (0.1 \; 0.9)</math>, we have a 10% chance of being in state a at time t, and a 90% chance of being in state b at time t.<br />
<br />
Suppose we run this Markov chain many times, and record the state at each step.<br />
<br />
In this example, we run 4 trials, up until t=5.<br />
<br />
{| class="wikitable"<br />
|-<br />
! t<br />
! Trial 1<br />
! Trial 2<br />
! Trial 3<br />
! Trial 4<br />
! Observed <math>\mu</math><br />
|-<br />
| 1<br />
| a<br />
| b<br />
| b<br />
| a<br />
| (0.5, 0.5)<br />
|-<br />
| 2<br />
| b<br />
| a<br />
| a<br />
| a<br />
| (0.75, 0.25)<br />
|-<br />
| 3<br />
| a<br />
| a<br />
| b<br />
| a<br />
| (0.75, 0.25)<br />
|-<br />
| 4<br />
| b<br />
| b<br />
| a<br />
| b<br />
| (0.25, 0.75)<br />
|-<br />
| 5<br />
| b<br />
| b<br />
| b<br />
| a<br />
| (0.25, 0.75)<br />
|}<br />
<br />
Imagine simulating the chain many times. If we collect all the outcomes at time t from all the chains, the histogram of this data would look like <math>\!\mu_t</math>.<br />
<br />
We can find the marginal probabilities as <math>\!\mu_n = \mu_0 P^n</math><br />
<br />
====Stationary Distribution====<br />
<br />
Let <math>\pi = (\pi_i \mid i \in \chi)</math> be a vector of non-negative numbers that sum to 1. (i.e. <math>\!\pi</math> is a pmf)<br />
<br />
If <math>\!\pi = \pi P</math>, then <math>\!\pi</math> is a stationary distribution, also known as an invariant distribution.<br />
<br />
====Limiting Distribution====<br />
<br />
A Markov chain has limiting distribution <math>\!\pi </math> if <math>\lim_{n \to \infty} P^n = \begin{bmatrix} \pi \\ \vdots \\ \pi \end{bmatrix}</math><br />
<br />
That is, <math>\!\pi_j = \lim_{n \to \infty}\left [ P^n \right ]_{ij}</math> exists and is independent of i.<br />
<br />
Here is an example:<br />
<br />
Suppose we want to find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/3&1/3&1/3\\<br />
1/4&3/4&0\\<br />
1/2&0&1/2<br />
\end{matrix}\right)</math><br />
<br />
We want to solve <math>\pi=\pi P</math> and we want <math>\displaystyle \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
<math>\displaystyle \pi_0 = 1/3\pi_0 + 1/4\pi_1 + 1/2\pi_2</math><br /><br />
<math>\displaystyle \pi_1 = 1/3\pi_0 + 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_2 = 1/3\pi_0 + 1/2\pi_2</math><br /><br />
<br />
Solving the system of equations, we get <br /> <br />
<math>\displaystyle \pi_1 = 4/3\pi_0</math><br /><br />
<math>\displaystyle \pi_2 = 2/3\pi_0</math><br /><br />
<br />
So using our condition above, we have <math>\displaystyle \pi_0 + 4/3\pi_0 + 2/3\pi_0 = 1</math> and by solving we get <math>\displaystyle \pi_0 = 1/3</math><br />
<br />
Using this in our system of equations, we obtain: <br /><br />
<math>\displaystyle \pi_1 = 4/9</math><br /><br />
<math>\displaystyle \pi_2 = 2/9</math><br />
<br />
Thus, the limiting distribution is <math>\displaystyle \pi = (1/3, 4/9, 2/9)</math><br />
<br />
====Detailed Balance====<br />
<br />
<math>\!\pi</math> has the detailed balance property if <math>\!\pi_iP_{ij} = P_{ji}\pi_j</math><br />
<br />
'''Theorem'''<br />
<br />
If <math>\!\pi</math> satisfies detailed balance, then <math>\!\pi</math> is a stationary distribution.<br />
<br />
In other words, if <math>\!\pi_iP_{ij} = P_{ji}\pi_j</math>, then <math>\!\pi = \pi P</math><br />
<br />
'''Proof:''' <br />
<br />
<math>\!\pi P =<br />
\begin{bmatrix}\pi_1 & \pi_2 & \cdots & \pi_N\end{bmatrix} \begin{bmatrix}P_{11} & \cdots & P_{1N} \\ \vdots & \ddots & \vdots \\ P_{N1} & \cdots & P_{NN}\end{bmatrix}</math><br />
<br />
Observe that the j<sup>th</sup> element of <math>\!\pi P</math> is<br />
<br />
<math>\!\left [ \pi P \right ]_j = \pi_1 P_{1j} + \pi_2 P_{2j} + \dots + \pi_N P_{Nj}</math><br />
<br />
::<math>\! = \sum_{i=1}^N \pi_i P_{ij}</math><br />
<br />
::<math>\! = \sum_{i=1}^N P_{ji} \pi_j</math>, by the definition of detailed balance.<br />
<br />
::<math>\! = \pi_j \sum_{i=1}^N P_{ji}</math><br />
<br />
::<math>\! = \pi_j</math>, as the sum of the entries in a column of P must sum to 1.<br />
<br />
So <math>\!\pi = \pi P</math>.<br />
<br />
<br />
'''Example'''<br />
<br />
Find the marginal distribution of <br />
<br />
[[File:MarkovSample.png|300px]]<br />
<br />
Start by generating the matrix P.<br />
<br />
<math>\!P = \begin{pmatrix} 0.2 & 0.8 \\ 0.6 & 0.4 \end{pmatrix}</math><br />
<br />
We must assume some starting value for <math>\mu_0</math><br />
<br />
<math>\!\mu_0 = \begin{pmatrix} 0.1 & 0.9 \end{pmatrix}</math><br />
<br />
For t = 1, the marginal distribution is<br />
<br />
<math>\!\mu_1 = \mu_0 P</math><br />
<br />
Notice that this <math>\mu</math> converges. <br />
<br />
If you repeatedly run:<br />
<br />
<math>\!\mu_{i+1} = \mu_i P</math><br />
<br />
It converges to <math>\mu = \begin{pmatrix} 0.4286 & 0.5714 \end{pmatrix}</math><br />
<br />
This can be seen by running the following Matlab code:<br />
P = [0.2 0.8; 0.6 0.4];<br />
mu = [0.1 0.9]; <br />
while 1 <br />
mu_old = mu; <br />
mu = mu * P;<br />
if mu_old == mu <br />
disp(mu);<br />
break;<br />
end<br />
end<br />
<br />
Another way of looking at this simple question is that we can see whether the ultimate pmf converges:<br />
<br />
Let <math>\hat{p_n}(1)=\frac{1}{n}\sum_{k=1}^n I(X_k=1)</math> denote the estimator of the stationary probability of state 1,<math>\hat{p_n}(2)=\frac{1}{n}\sum_{k=1}^n I(X_k=2)</math> denote the estimator of the stationary probability of state 2, where <math>\displaystyle I(X_k=1)</math> and <math>\displaystyle I(X_k=2)</math> are indicator variables which equal 1 if <math>X_k=1</math>(or <math>X_k=2</math> for the latter one).<br />
<br />
Matlab codes for this explanation is<br />
<br />
n=1;<br />
if rand<0.1<br />
x(1)=1;<br />
else<br />
x(1)=0;<br />
end<br />
p1(1)=sum(x)/n;<br />
p2(1)=1-p1(1);<br />
for i=2:10000<br />
n=n+1;<br />
if (x(i-1)==1&rand<0.2)|(x(i-1)==0&rand<0.6)<br />
x(i)=1;<br />
else<br />
x(i)=0;<br />
end<br />
p1(i)=sum(x)/n;<br />
p2(i)=1-p1(i); <br />
end<br />
plot(p1,'red');<br />
hold on;<br />
plot(p2)<br />
<br />
The results can be easily seen from the graph below:<br />
<br />
[[File:Stationary distribution.png|300px]]<br />
<br />
Additionally, we can plot the marginal distribution as it converges without estimating it. The following Matlab code shows this:<br />
<br />
%transition matrix<br />
P=[0.2 0.8; 0.6 0.4];<br />
%mu at time 0<br />
mu=[0.1 0.9];<br />
%number of points for simulation<br />
n=20;<br />
for i=1:n<br />
mu_a(i)=mu(1);<br />
mu_b(i)=mu(2);<br />
mu=mu*P;<br />
end<br />
t=[1:n];<br />
plot(t, mu_a, t, mu_b);<br />
hleg1=legend('state a', 'state b');<br />
<br />
[[File:Marginal distribution convergence.png|300px]]<br />
<br />
Note that there are chains with stationary distributions that don't converge (the chain might not naturally reach the stationary distribution, and isn't limiting). An example of this is:<br />
<br />
<math>P = \begin{pmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \end{pmatrix}, \mu_0 = \begin{pmatrix} 1/3 & 1/3 & 1/3 \end{pmatrix}</math><br />
<br />
<math>\!\mu_0</math> is a stationary distribution, so <math>\!\mu P</math> is the same for all iterations.<br />
<br />
But,<br />
<br />
<math>P^{1000} = \begin{pmatrix} 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{pmatrix} \ne \begin{pmatrix} \mu \\ \mu \\ \mu \end{pmatrix}</math><br />
<br />
So <math>\!\mu</math> is not a limiting distribution. Also, if<br />
<br />
<math>\mu = \begin{pmatrix} 0.2 & 0.1 & 0.7 \end{pmatrix}</math><br />
<br />
Then <math>\!\mu = \mu P</math> does not converge.<br />
<br />
This can be observed through the following Matlab code.<br />
<br />
P = [0 0 1; 1 0 0; 0 1 0];<br />
mu = [0.2 0.1 0.7]; <br />
for i= 1:4 <br />
mu = mu * P;<br />
disp(mu);<br />
end<br />
<br />
This outputs<br />
0.1000 0.7000 0.2000<br />
0.7000 0.2000 0.1000<br />
0.2000 0.1000 0.7000<br />
0.1000 0.7000 0.2000<br />
<br />
Note that <math>\!\mu_1 = \!\mu_4</math>, which indicates that <math>\!\mu</math> will cycle forever.<br />
<br />
This means that this chain has a stationary distribution, but is not limiting.<br />
<br />
===Page Rank===<br />
<br />
Page Rank was the original ranking algorithm used by Google's search engine to rank web pages.<ref><br />
http://ilpubs.stanford.edu:8090/422/<br />
</ref> The algorithm was created by the founders of Google, Larry Page and Sergey Brin as part of Page's PhD thesis. When a query is entered in a search engine, there are a set of web pages which are matched by this query, but this set of pages must be ordered by their "importance" in order to identify the most meaningful results first. Page Rank is an algorithm which assigns importance to every web page based on the links in each page.<br />
<br />
==== Intuition ====<br />
<br />
We can represent web pages by a set of nodes, where web links are represented as edges connecting these nodes. Based on our intuition, there are three main factors in deciding whether a web page is important or not.<br />
<br />
# A web page is important if many other pages point to it.<br />
# The more important a webpage is, the more weight is placed on its links.<br />
# The more links a webpage has, the less weight is placed on its links.<br />
<br />
====Modelling====<br />
<br />
We can model the set of links as a N-by-N matrix L, where N is the number of web pages we are interested in:<br />
<br />
<math>L_{ij} =<br />
\left\{<br />
\begin{array}{lr}<br />
1 : \text{if page j points to i}\\<br />
0 : \text{otherwise}<br />
\end{array}<br />
\right. <br />
</math><br />
<br />
<br />
<br />
The number of outgoing links from page j is<br />
<br />
<math>c_j = \sum_{i=1}^N L_{ij}</math><br />
<br />
For example, consider the following set of links between web pages:<br />
<br />
[[File:PageRank.png|250px]]<br />
<br />
According to the factors relating to importance of links, we can consider two possible rankings :<br />
<br />
<br />
<math>\displaystyle 3 > 2 > 1 > 4 </math> <br />
<br />
or<br />
<br />
<math>\displaystyle 3>1>2>4 </math> <br />
if we consider that the high importance of the link from 3 to 1 is more influent than the fact that there are two outgoing links from page 1 and only one from page 2.<br />
<br />
<br />
We have <math>L = \begin{bmatrix} <br />
0 & 0 & 1 & 0 \\ <br />
1 & 0 & 0 & 0 \\ <br />
1 & 1 & 0 & 1 \\<br />
0 & 0 & 0 & 0<br />
\end{bmatrix}</math>, and <math>c = \begin{pmatrix}2 & 1 & 1 & 1\end{pmatrix} </math><br />
<br />
We can represent the ranks of web pages as the vector P, where the i<sup>th</sup> element is the rank of page i:<br />
<br />
<math>P_i = (1-d) + d\sum_j \frac{L_{ij}}{c_j} P_j</math><br />
<br />
Here we take the sum of the weights of the incoming links, where links are reduced in weight if the linking page has a lot of outgoing links, and links are increased in weight if the linking page has a lot of incoming links. <br />
<br />
We don't want to completely ignore pages with no incoming links, which is why we add the constant (1 - d).<br />
<br />
If <br />
<br />
<math>L = \begin{bmatrix} L_{11} & \cdots & L_{1N} \\<br />
\vdots & \ddots & \vdots \\<br />
L_{N1} & \cdots & L_{NN} \end{bmatrix}</math><br />
<br />
<math>D = \begin{bmatrix} c_1 & \cdots & 0 \\<br />
\vdots & \ddots & \vdots \\<br />
0 & \cdots & c_N \end{bmatrix}</math><br />
<br />
Then <math>D^{-1} = \begin{bmatrix} c_1^{-1} & \cdots & 0 \\<br />
\vdots & \ddots & \vdots \\<br />
0 & \cdots & c_N^{-1} \end{bmatrix}</math><br />
<br />
<math>\!P = (1-d)e + dLD^{-1}P</math><br />
<br />
where <math>\!e = \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}</math> is the vector with all 1's<br />
<br />
To simplify the problem, we let <math>\!e^T P = N \Rightarrow \frac{e^T P}{N} = 1</math>. This means that the average importance of all pages on the internet is 1.<br />
<br />
Then<br />
<math>\!P = (1-d)\frac{ee^TP}{N} + dLD^{-1}P</math><br />
::<math>\! = \left [ (1-d)\frac{ee^T}{N} + dLD^{-1} \right ] P</math><br />
::<math>\! = \left [ \left ( \frac{1-d}{N} \right ) E + dLD^{-1} \right ] P</math>, where <math> E </math> is an NxN matrix filled with ones.<br />
<br />
Let <math>\!A = \left [ \left ( \frac{1-d}{N} \right ) E + dLD^{-1} \right ]</math><br />
<br />
Then <math>\!P = AP</math>.<br />
<br />
<br />
Note that P is a stationary distribution and, more importantly, P is an eigenvector of A with eigenvalue 1. Therefore, we can find the ranks of all web pages by solving this equation for P. <br />
<br />
We can find the vector P for the example above, using the following Matlab code:<br />
L = [0 0 1 0; 1 0 0 0; 1 1 0 1; 0 0 0 0];<br />
D = [2 0 0 0; 0 1 0 0; 0 0 1 0; 0 0 0 1];<br />
d = 0.8 ;% pages with no links get a weight of 0.2<br />
N = 4 ;<br />
<br />
A = ((1-d)/N) * ones(N) + d * L * inv(D);<br />
[EigenVectors, EigenValues] = eigs(A)<br />
s=sum(EigenVectors(:,1));% we should note that the average entry of P should be 1 according to our assumption<br />
P=(EigenVectors(:,1))/s*N<br />
<br />
This outputs:<br />
<br />
EigenVectors =<br />
-0.6363 0.7071 0.7071 -0.0000 <br />
-0.3421 -0.3536 + 0.3536i -0.3536 - 0.3536i -0.7071 <br />
-0.6859 -0.3536 - 0.3536i -0.3536 + 0.3536i 0.0000 <br />
-0.0876 0.0000 + 0.0000i 0.0000 - 0.0000i 0.7071 <br />
<br />
<br />
EigenValues =<br />
1.0000 0 0 0 <br />
0 -0.4000 - 0.4000i 0 0 <br />
0 0 -0.4000 + 0.4000i 0 <br />
0 0 0 0.0000 <br />
<br />
P =<br />
<br />
1.4528<br />
0.7811<br />
1.5660<br />
0.2000<br />
<br />
Note that there is an eigenvector with eigenvalue 1. <br />
The reason why there always exist a 1-eigenvector is that A is a stochastic matrix. <br />
<br />
Thus our vector P is <math> <br />
\begin{bmatrix}1.4528 \\ 0.7811 \\ 1.5660\\ 0.2000 \end{bmatrix}</math><br />
<br />
However, this method is not practical, because there are simply too many web pages on the internet. So instead Google uses a method to approximate an eigenvector with eigenvalue 1.<br />
<br />
Note that page three has the rank with highest magnitude and page four has the rank with lowest magnitude, as expected.<br />
<br />
==''' Markov Chain Monte Carlo - Metropolis-Hastings - October 25th, 2011'''==<br />
<br />
We want to find <math> \int h(x)f(x)\, \mathrm dx </math>, but we don't know how to sample from <math>\,f</math>.<br />
<br />
We have seen simple techniques before. This one is used in real life.<br />
It consists of the search of a Markov Chain such that its stationary distribution is <math>\,f</math>.<br />
<br />
==== Main procedure ====<br />
<br />
Let us suppose that <math>\,q(y|x)</math> is a friendly distribution: we can sample from this function.<br />
<br />
1. Initialize the chain with <math>\,x_{i}</math> and set <math>\,i=0</math>.<br />
<br />
2. Draw a point from <math>\,q(y|x)</math> i.e. <math>\,Y \backsim q(y|x_{i})</math>.<br />
<br />
3. Evaluate <math>\,r(x,y)=min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\}</math><br />
<br />
<br />
4. Draw a point <math>\,U \backsim Unif[0,1]</math>.<br />
<br />
5. <math>\,x_{i+1}=\begin{cases}y & \text{ if } U<r \\x_{i} & \text{ otherwise } \end{cases} </math>.<br />
<br />
6. <math>\,i=i+1</math>. Go back to 2.<br />
<br />
==== Remark 1 ====<br />
<br />
A very common choice for <math>\,q(y|x)</math> is <math>\,N(y;x,b^{2})</math>, a normal distribution centered at the current point.<br />
<br />
Note : In this case <math>\,q(y|x)</math> is symmetric i.e. <math>\,q(y|x)=q(x|y)</math>.<br />
<br />
(Because <math>\,q(y|x)=\frac{1}{\sqrt{2\pi}b}e^{-\frac{1}{2b^{2}}(y-x)^{2}}</math> and <math>\,(y-x)^{2}=(x-y)^{2}</math>).<br />
<br />
Thus we have <math>\,\frac{q(x|y)}{q(y|x)}=1</math>, which implies :<br />
<br />
<math>\,r(x,y)=min\left\{\frac{f(y)}{f(x)},1\right\}</math>.<br />
<br />
In general, if <math>\,q(x|y)</math> is symmetric then the algorithm is called Metropolis in reference to the original algorithm (made in 1953)<ref>http://en.wikipedia.org/wiki/Equations_of_State_Calculations_by_Fast_Computing_Machines</ref>.<br />
<br />
<br />
<br />
====Remark 2====<br />
<br />
The value y is accepted if <math>\,u<min\left\{\frac{f(y)}{f(x)},1\right\}</math> so it is accepted with the probability <math>\,min\left\{\frac{f(y)}{f(x)},1\right\}</math>.<br />
<br />
Thus, if <math>\,f(y)>f(x)</math>, then <math>\,y</math> is always accepted.<br />
<br />
The higher that value of the pdf is in the vicinity of a point <math>\,y_1</math>, the more likely it is that a random variable will take on values around <math>\,y_1</math>. As a result it makes sense that we would want a high probability of acceptance for points generated near <math>\,y_1</math>.<br />
<br />
====Remark 3====<br />
<br />
One strength of the Metropolis-Hastings algorithm is that normalizing constants, which are often quite difficult to determine, can be cancelled out in the ratio <math> r </math>. For example, consider the case where we want to sample from the beta distribution, which has the pdf:<br />
<br />
<math><br />
\begin{align}<br />
f(x;\alpha,\beta)& = \frac{1}{\mathrm{B}(\alpha,\beta)}\, x^{\alpha-1}(1-x)^{\beta-1}\end{align}<br />
</math><br />
<br />
The beta function, ''B'', appears as a normalizating constant but it can be simplified by construction of the method.<br />
<br />
====Example====<br />
<br />
<math>\,f(x)=\frac{1}{\pi^{2}}\frac{1}{1+x^{2}}</math><br />
<br />
Then, we have <math>\,f(x)\propto\frac{1}{1+x^{2}}</math>.<br />
<br />
And let us take <math>\,q(x|y)=\frac{1}{\sqrt{2\pi}b}e^{-\frac{1}{2b^{2}}(y-x)^{2}}</math>.<br />
<br />
Then <math>\,q(x|y)</math> is symmetric.<br />
<br />
Therefore Y can be simplified.<br />
<br />
<br />
We get :<br />
<br />
<math>\,\begin{align}<br />
\displaystyle r(x,y) <br />
& =min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} \\<br />
& =min\left\{\frac{f(y)}{f(x)},1\right\} \\<br />
& =min\left\{ \frac{ \frac{1}{1+y^{2}} }{ \frac{1}{1+x^{2}} },1\right\}\\<br />
& =min\left\{ \frac{1+x^{2}}{1+y^{2}},1\right\}\\<br />
\end{align}<br />
</math>.<br />
<br />
<br />
<br />
The Matlab code of the algorithm is the following :<br />
<br />
<pre><br />
clear all<br />
close all<br />
clc<br />
b=2;<br />
x(1)=randn;<br />
for i=2:10000<br />
y=b*randn+x(i-1);<br />
r=min((1+x(i-1)^2)/(1+y^2),1);<br />
u=rand;<br />
if u<r<br />
x(i)=y;<br />
else<br />
x(i)=x(i-1);<br />
end<br />
<br />
end<br />
hist(x(5000:end));<br />
%The Markov Chain usually takes some time to converge and this is known as the "burning time".<br />
%Therefore, we don't display the first 5000 points because they don't show the limiting behaviour of the Markov <br />
Chain.<br />
</pre><br />
<br />
As we can see, the choice of the value of b is made by us.<br />
<br />
Changing this value has a significant impact on the results we obtain. There is a pitfall when b is too big or too small.<br />
<br />
Example with <math>\,b=0.1</math> (Also with graph after we run j=5000:10000; plot(j,x(5000:10000)):<br />
<br />
[[File:redaccoursb01.JPG|300px]] [[File:001Metr.PNG|300px]]<br />
<br />
With <math>\,b=0.1</math>, the chain takes small steps so the chain doesn't explore enough of the sample space. It doesn't give an accurate report of the function we want to sample.<br />
<br />
<br />
<br />
Example with <math>\,b=10</math> :<br />
<br />
[[File:redaccoursb10.JPG|300px]] [[File:010metro.PNG|300px]]<br />
<br />
With <math>\,b=10</math>, jumps are very unlikely to be accepted as they deviate far from the mean (i.e. <math>\,y</math> is rejected as <math>\ u<r </math> and <math>\,x(i)=x(i-1)</math> most of the time, hence most sample points stay fairly close to the origin.<br />
The third graph that resembles white noise (as in the case of <math>\,b=2</math>) indicates better sampling as more points are covered and accepted. For <math>\,b=0.1</math>, we have lots of jumps but most values are not repeated, hence the stationary distribution is less obvious; whereas in the <math>\,b=10</math> case, many points remains around 0. Approximately 73% were selected as x(i-1).<br />
<br />
<br />
Example with <math>\,b=2</math> :<br />
<br />
[[File:redaccoursb2.JPG|300px]] [[File:100metr.PNG|300px]]<br />
<br />
With <math>\,b=2</math>, we get a more accurate result as we avoid these extremes. Approximately 37% were selected as x(i-1).<br />
<br />
<br />
If the sample from the Markov Chain starts to look like the target distribution quickly, we say the chain is mixing well.<br />
<br />
==''' Theory and Applications of Metropolis-Hastings - October 27th, 2011'''==<br />
<br />
As mentioned in the previous section, the idea of the Metropolis-Hastings (MH) algorithm is to produce a Markov chain that converges to a stationary distribution <math>f</math> which we are interested in sampling from.<br />
<br />
====Convergence====<br />
<br />
One important fact to check is that <math>\displaystyle f</math> is indeed a stationary distribution in the MH scheme. For this, we can appeal to the implications of the detailed balance property:<br />
<br />
Given a probability vector <math>\!\pi</math> and a transition matrix <math>\displaystyle P</math>, <math>\!\pi</math> has the detailed balance property if <math>\!\pi_iP_{ij} = P_{ji}\pi_j</math><br />
<br />
If <math>\!\pi</math> satisfies detailed balance, then it is a stationary distribution.<br />
<br />
The above definition applies to the case where the states are discrete. In the continuous case, <math>\displaystyle f</math> satisfies detailed balance if <math>\displaystyle f(x)p(x,y)=f(y)p(y,x)</math>. Where <math>\displaystyle p(x,y)</math> and <math>\displaystyle p(y,x)</math> are the probabilities of transitioning from x to y and y to x respectively. If we can show that <math>\displaystyle f</math> has the detailed balance property, we can conclude that it is a stationary distribution. Because <math>\int^{}_y f(y)p(y,x)dy=\int^{}_y f(x)p(x,y)dy=f(x)</math>.<br />
<br />
In the MH algorithm, we use a proposal distribution to generate y~<math>\displaystyle q(y|x)</math>, and accept y with probability <math>min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\}</math><br />
<br />
Suppose, without loss of generality, that <math>\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)} <= 1</math>. This implies that <math>\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)} >= 1</math><br />
<br />
Let <math>\,r(x,y)</math> be the chance of accepting point y given that we are at point x.<br />
<br />
So <math>\,r(x,y) = min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} = \frac{f(x)}{f(y)} \frac{q(x|y)}{q(y|x)}</math><br />
<br />
Let <math>\,r(y,x)</math> be the chance of accepting point x given that we are at point y.<br />
<br />
So <math>\,r(y,x) = min\left\{\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)},1\right\} = 1</math><br />
<br />
<br />
<math>\,p(x,y)</math> is the probability of generating and accepting y, while at point x.<br />
<br />
So <math>\,p(x,y) = q(y|x)r(x,y) = q(y|x) \frac{f(y)}{f(x)} \frac{q(x|y)}{q(y|x)} = \frac{f(y)q(x|y)}{f(x)}</math><br />
<br />
<br />
<math>\,p(y,x)</math> is the probability of generating and accepting x, while at point y.<br />
<br />
So <math>\,p(y,x) = q(x|y)r(y,x) = q(x|y)</math><br />
<br />
<br />
<math>\,f(x)p(x,y) = f(x)\frac{f(y)q(x|y)}{f(x)} = f(y)q(x|y) = f(y)p(y,x)</math><br />
<br />
Thus, detailed balance holds.<br />
:i.e. <math>\,f(x)</math> is stationary distribution<br />
<br />
It can be shown (although not here) that <math>f</math> is a limiting distribution as well. Therefore, the MH algorithm generates a sequence whose distribution converges to <math>f</math>, the target.<br />
<br />
====Implementation====<br />
<br />
In the implementation of MH, the proposal distribution is commonly chosen to be symmetric, which simplifies the calculations and makes the algorithm more intuitively understandable. The MH algorithm can usually be regarded as a random walk along the distribution we want to sample from. Suppose we have a distribution <math>f</math>:<br />
<br />
[[File:Standard normal distribution.gif]]<br />
<br />
Suppose we start the walk at point <math>x</math>. The point <math>y_{1}</math> is in a denser region than <math>x</math>, therefore, the walk will always progress from <math>x</math> to <math>y_{1}</math>. On the other hand, <math>y_{2}</math> is in a less dense region, so it is not certain that the walk will progress from <math>x</math> to <math>y_{2}</math>. In terms of the MH algorithm:<br />
<br />
<math>r(x,y_{1})=min(\frac{f(y_{1})}{f(x)},1)=1</math> since <math>f(y_{1})>f(x)</math>. Thus, any generated value with a higher density will be accepted.<br />
<br />
<math>r(x,y_{2})=\frac{f(y_{2})}{f(x)}</math>. The lower the density of <math>y_{2}</math> is, the less chance it will have of being accepted.<br />
<br />
A certain class of proposal distributions can be written in the form:<br />
<br />
<math>\,y|x_i = x_i + \epsilon_i</math><br />
<br />
where <math>\,\epsilon_i = g(|x-y|)</math><br />
<br />
The density depends only on the distance between the current point and the next one (which can be seen as the "step" being taken). These proposal distributions give the Markov chain the random walk nature. The normal distribution that we frequently use in our examples satisfies the above definition.<br />
<br />
In actual implementations of the MH algorithm, the proposal distribution needs to be chosen judiciously, because not all proposals will work well with all target distributions we want to sample from. Take a trimodal distribution for example:<br />
<br />
[[File:trimodal.jpg]]<br />
<br />
If we choose the proposal distribution to be a standard normal as we have done before, problems will arise. The low densities between the peaks means that the MH algorithm will almost never walk to any points generated in these regions and get stuck at one peak. One way to address this issue is to increase the variance, so that the steps will be large enough to cross the gaps. Of course, in this case, it would probably be beneficial to come up with a different proposal function. As a rule of thumb, such functions should result in an approximately 50% acceptance rate for generated points.<br />
<br />
====Simulated Annealing====<br />
<br />
Metropolis-Hastings is very useful in simulation methods for solving optimization problems. One such application is simulated annealing, which addresses the problems of minimizing a function <math>h(x)</math>. This method will not always produce the global solution, but it is intuitively simple and easy to implement.<br />
<br />
Consider <math>e^{\frac{-h(x)}{T}}</math>, maximizing this expression is equivalent to minimizing <math>h(x)</math>. Suppose <math>\mu</math> is the maximizing value and <math>h(x)=(x-\mu)^2</math>, then the maximization function is a gaussian distribution <math>e^{-\frac{(x-\mu)^2}{T}}</math>. When many samples are taken from this distribution, the mean will converge to the desired maximizing value. The annealing comes into play by lowering T (the temperature) as the sampling progresses, making the distribution narrower. The steps of simulated annealing are outlined below:<br />
<br />
1. start with a random <math>x</math> and set T to a large number<br />
<br />
2. generate <math>y</math> from a proposal distribution <math>q(y|x)</math>, which should be symmetric<br />
<br />
3. accept <math>y</math> with probability <math>min(\frac{f(y)}{f(x)},1)</math><br />
<br />
4. decrease T, and then go to step 2<br />
<br />
The following plot and Matlab code illustrates the simulated annealing procedure as temperature ''T'', the variance, decreases for a Gaussian distribution with zero mean. Starting off with a large value for the temperature ''T'' allows the Metropolis-Hastings component of the procedure to capture the mean, before gradually decreasing the temperature ''T'' in order to converge to the mean. <br />
<br />
[[File:Simulated annealing illustration.png]]<br />
<br />
x=-10:0.1:10;<br />
mu=0;<br />
T=5;<br />
colour = ['b', 'g', 'm', 'r', 'k'];<br />
for i=1:5<br />
pdfNormal=normpdf(x, mu, T);<br />
plot(x, pdfNormal, colour(i));<br />
T=T-1;<br />
hold on<br />
end<br />
hleg1=legend('T=5', 'T=4', 'T=3', 'T=2', 'T=1');<br />
title('Simulated Annealing Illustration');<br />
<br />
=='''References'''==<br />
<br />
<references/><br />
<br />
=='''Simulated Annealing and Gibbs Sampling - November 1, 2011'''==<br />
<br />
continued from previous lecture...<br />
<br />
We will now look at a couple cases where <math> \displaystyle h(y) > h(x) </math> or <math> \displaystyle h(y) < h(x) </math>, and explore whether to accept or reject <math> y </math>.<br />
<br />
Recall r(x,y)=min{<math>\frac{f(y)}{f(x)}</math>,1} where <math> \frac{f(y)}{f(x)} = \frac{e^{\frac{-h(x)}{T}}}{e^{\frac{-h(y)}{T}}} = e^{\frac{h(x)-h(y)}{T}}</math>. And r(x,y) represents the probability of accepting <math>y</math>.<br />
<br />
====Cases====<br />
<br />
Case a)<br />
Suppose <math> \displaystyle h(y) < h(x) </math>. Since we want to find the minimum value for <math>\displaystyle h(x) </math>, and the point <math>\displaystyle y </math> creates a lower value than our previous point, we accept the new point. Mathematically, <math>\displaystyle h(y) < h(x) </math> implies that:<br />
<br />
<math> \frac{f(y)}{f(x)} > 1 </math>. Therefore,<br />
<math> \displaystyle r = 1 </math>.<br />
So, we will always accept <math>\displaystyle y </math>.<br />
<br />
Case b)<br />
Suppose <math> \displaystyle h(y) > h(x) </math>. This is bad, since our goal is to minimize <math>\displaystyle h(x) </math>. However, we may still accept <math>\displaystyle y </math> with some chance:<br />
<br />
<math> \frac{f(y)}{f(x)} < 1 </math>. Therefore,<br />
<math>\displaystyle r < 1 </math>.<br />
So, we may accept <math>\displaystyle y </math> with probability <math>\displaystyle r </math>.<br />
<br />
<br />
Next, we will look at these cases as <math>\displaystyle T\to0 </math>.<br />
<br />
As <math>\displaystyle T\to0 </math> and case a) happens, <math> e^{\frac{h(x)-h(y)}{T}} </math> approaches infinity, so we will always accept <math>\displaystyle y </math>.<br />
<br />
As <math>\displaystyle T\to0 </math> and case b) happens, <math> e^{\frac{h(x)-h(y)}{T}} </math> approaches zero, so the probability that <math>\displaystyle y </math> will be accepted gets extremely small.<br />
<br />
It is worth noting that if we simply start with a small value of T, we may end up rejecting all the generated points, and hence we will get stuck somewhere in the function (due to case b)), which might be a maximum value in some intervals (local maxima), but not in the whole domain (global maxima). It is therefore necessary to start with a large value of T in order to explore the whole function. At the same time, a good estimation of x0 is needed (at least cannot differ from the maximum point too much). <br />
<br />
=====Example=====<br />
<br />
Let <math>\displaystyle h(x) = (x-2)^2 </math>.<br />
The graph of it is:<br />
[[File:PCh(x).jpg|center|500]]<br />
<br />
Then, <math> e^{\frac{-h(x)}{T}} = e^{\frac{-(x-2)^2}{T}} </math> . Take an initial value of T = 20. A graph of this is:<br />
[[File:PC-highT.jpg|center|500]]<br />
<br />
<br />
In comparison, we look a graph of T = 0.2:<br />
[[File:PC-lowT.jpg|center|500]]<br />
<br />
One can see that with a low T value, the graph has a lot of r = 0, and r >1, while having a bigger T value gives smoother transitions in the graph.<br />
<br />
The MATLAB code for the above graphs are:<br />
<pre><br />
ezplot('(x-2)^2',[-6,10])<br />
ezplot('exp((-(x-2)^2)/20)',[-6,10])<br />
ezplot('exp((-(x-2)^2)/0.2)',[-6,10])<br />
</pre><br />
<br />
=====Travelling Salesman Problem=====<br />
<br />
The simulated annealing method can be applied to compute the solution to the travelling salesman problem. Suppose there are N cities and the salesman only have to visit each city once. The objective is to find out the shortest path (i.e. shortest total length of journey) connecting the cities. An algorithm using simulated annealing on the problem can be found here ([http://www.cs.ubbcluj.ro/~csatol/mestint/pdfs/Numerical_Recipes_Simulated_Annealing.pdf Reference]).<br />
<br />
===Gibbs Sampling===<br />
<br />
Gibbs sampling is another Markov chain Monte Carlo method, similar to Metropolis-Hastings. There are two main differences between Metropolis-Hastings and Gibbs sampling. First, the candidate state is always accepted as the next state in Gibbs sampling. Second, it is assumed that the full conditional distributions are known, i.e. <math>P(X_i=x|X_j=x_j, \forall j\neq i)</math> for all <math>\displaystyle i</math>. The idea is that it is easier to sample from conditional distributions which are sets of one dimensional distributions than to sample from a joint distribution which is a higher dimensional distribution. Gibbs is a way to turn the joint distribution into multiple conditional distribution. <br />
<br />
<b>Advantages:</b><br /><br />
- sampling from conditional distributions may be easier than sampling from joint distributions<br />
<br />
<b>Disadvantages:</b><br /><br />
- we do not necessarily know the conditional distributions<br />
<br />
For example, if we want to sample from <math>\, f_{X,Y}(x,y)</math>, we need to know how to sample from <math>\, f_{X|Y}(x|y)</math> and <math>\, f_{Y|X}(y|x)</math>. Suppose the chain starts with <math>\,(X_0,Y_0)</math> and <math>(X_1,Y_1), \dots , (X_n,Y_n)</math> have been sampled. Then,<br />
<br />
<math>\, (X_{n+1},Y_{n+1})=(f_{X|Y}(x|Y_n),f_{Y|X}(y|X_{n+1}))</math><br />
<br />
Gibbs sampling turns a multi-dimensional distribution into a set of one-dimensional distributions. If we want to sample from <br />
<br />
<math>P_{X^1,\dots ,X^p}(x^1,\dots ,x^p)</math> <br />
<br />
and the full conditionals are known, then:<br />
<br />
<math>X^1_{n+1}=f(X^1|X^2_n,\dots ,X^p_n)</math><br />
<br />
<math>X^2_{n+1}=f(X^2|X^1_{n+1},X^3_n\dots ,X^p_n)</math><br />
<br />
<math>\vdots</math><br />
<br />
<math>X^{p-1}_{n+1}=f(X^{p-1}|X^1_{n+1},\dots ,X^{p-2}_{n+1},X^p_n)</math><br />
<br />
<math>X^p_{n+1}=f(X^p|X^1_{n+1},\dots ,X^{p-1}_{n+1})</math><br />
<br />
With Gibbs sampling, we can simulate <math>\displaystyle n</math> random variables sequentially from <math>\displaystyle n</math> univariate conditionals rather than generating one <math>n</math>-dimensional vector using the full joint distribution, which could be a lot more complicated.<br />
<br />
Computational inference deals with probabilistic graphical models. Gibbs sampling is useful here: graphical models show the dependence relations among random variables. For instance, Bayesian networks are graphical models represented using directed acyclic graphs. Looking at such a graphical model tells us on which random variable the distribution of a certain random variable depends (i.e. its parent). The model can be used to "factor" a joint distribution into conditional distributions.<br />
<br />
[[File:stat341_nov_1_graphical_model.png|200px|thumb|left|Sample graphical model of five RVs]]<br />
<br />
For example, consider the five random variables A, B, C, D, and E. Without making any assumptions about dependence relations among them, all we know is <br />
<br />
<math>\, P(A,B,C,D,E)=</math><math>\, P(A|B,C,D,E) P(B|C,D,E) P(C|D,E) P(D|E) P(E)</math><br />
<br />
However, if we know the relation between the random variables, e.g. given the graphical model on the left, we can simplify this expression:<br />
<br />
<math>\, P(A,B,C,D,E)=P(A) P(B|A) P(C|A) P(D|C) P(E|C)</math><br />
<br />
Although the joint distribution may be very complicated, the conditional distributions may not be.<br />
<br />
Check out the following notes on Gibbs sampling:<br />
<br />
* [http://web.mit.edu/~wingated/www/introductions/mcmc-gibbs-intro.pdf MCMC and Gibbs Sampling, MIT Lecture Notes]<br />
* chapter 7.4 in [http://stat.fsu.edu/~anuj/pdf/classes/CompStatI09/BOOK.pdf Notes on Computational Methods in Statistics]<br />
* chapter 4.9 in [http://www.ma.hw.ac.uk/~foss/StochMod/Ross_S.pdf Introduction to Probability Models] by Sheldon Ross<br />
<br />
====Example of Gibbs sampling: Multi-variate normal====<br />
<br />
We'd like to generate samples from a bivariate normal with parameters<br />
<br />
<math>\mu = \begin{bmatrix}1\\ 2 \end{bmatrix} = \begin{bmatrix}\mu_1 \\ \mu_2 \end{bmatrix}</math> <br />
and <math>\sigma = \begin{bmatrix}1 && 0.9 \\ 0.9 && 1 \end{bmatrix}= \begin{bmatrix}1 && \rho \\ \rho && 1 \end{bmatrix}</math><br />
<br />
The conditional distributions of multi-variate normal random variables are also normal:<br />
<br />
<math>\, f(x_1|x_2)=N(\mu_1 + \rho(x_2-\mu_2), 1-\rho^2)</math><br />
<br />
<math>\, f(x_2|x_1)=N(\mu_2 + \rho(x_1-\mu_1), 1-\rho^2)</math><br />
<br />
(In general, if the joint distribution has parameters<br />
<br />
<math>\mu = \begin{bmatrix}\mu_1 \\ \mu_2 \end{bmatrix}</math> and <math>\Sigma = \begin{bmatrix} \Sigma _{1,1} && \Sigma _{1,2} \\ \Sigma _{2,1} && \Sigma _{2,2} \end{bmatrix}</math><br />
<br />
then the conditional distribution <math>\, f(x_1|x_2)</math> has mean <math>\, \mu_1 + \Sigma _{1,2}(\Sigma _{1,1})^{-1}(x_2-\mu_2)</math> and variance <math>\, \Sigma _{1,1}-\Sigma _{1,2}(\Sigma _{2,2})^{-1}\Sigma _{2,1})</math>.<br />
<br />
=='''Principal Component Analysis (PCA) - November 8, 2011'''==<br />
<br />
Principal component analysis is an 100 year old algorithm used for the dimensionality reduction of data. As dimensions increase, the data points needed to sample accurately increase by an exponential factor.<br />
<br />
<math>\, x\in \mathbb{R}^D \rarr y\in \mathbb{R}^d</math><br />
<br />
<math>\ d \le D </math><br />
<br />
We want to transform <math>\, x</math> to <math>\, y</math> by reducing dimensionality yet losing little information.<br />
<br />
For example, consider dots in a three dimensional space. By unrolling the 2D manifold that they are on, we can reduce the data to 2D while losing little information. Note: This is not an application of PCA, but simple illustrates one way we can reduce dimensionality.<br />
<br />
Principle Component Analysis lets us reduce data to a linear subspace of its original space. It works best when data is in a lower dimensional subspace of its original space, or is close to.<br />
<br />
<br />
'''Probabilistic View'''<br />
<br />
We can see data set <math>\, x</math> as a high dimensional random variable governed by a low dimensional random variable <math>\, y</math>. Given <math>\, x</math>, we are trying to estimate <math>\, y</math>.<br />
<br />
We can see this in 2D linear regression, as the locations of data points in a scatter plot are governed by its approximate linear regression. The subspace that we have reduced the data to here is in the direction of variation in the data.<br />
<br />
'''Principal Component Analysis'''<br />
<br />
Principal component analysis is an orthogonal linear transform on a data set. It transforms the data coordinates to associate with a new set of orthogonal vectors, each representing the direction of the maximum variance of the the data. E.G. the first principal component is the direction of the max variance, the second principal component is the direction of the max variance orthogonal to the first vector, the third principal component is the direction of the max variance orthogonal to the first and second vectors and etc. until we have D vectors, where D is the dimension of the original data.<br />
<br />
Suppose we have data represented by <math>\, X = \begin{bmatrix}<br />
x^1\\<br />
x^2\\<br />
\vdots \\ <br />
x^D<br />
\end{bmatrix}<br />
\in \mathbb{R}^{D \times n} </math><br />
<br />
For some <math>\, W = \begin{bmatrix}<br />
w^1\\<br />
w^2\\<br />
\vdots \\ <br />
w^D<br />
\end{bmatrix}<br />
\in \mathbb{R}^{D} </math><br />
<br />
We can write any vector in <math>\, \mathbb{R}^D </math> as<br />
<br />
<math>\, w^1x^1 + w^2x^2 + \cdots + w^dx^d = W^TX</math><br />
<br />
To find the first principal component, we want to maximize the variance of <math>\,W^TX</math>.<br />
<br />
The variance of <math>\,W^TX</math> is <math>\,W^TSW</math> where <math>\,S</math> is the covariance matrix of X.<br />
<br />
<math>\, S = (x-\mu)(x-\mu)^T</math><br />
<br />
<br />
So we have to solve the problem<br />
<br />
<math>\, \text {Max } W^TSW</math><br />
<br />
<math>\, \text{such that } W^TW = 1</math>.<br />
<br />
<br />
We restrict W to unit vectors as otherwise the maximum is unbounded. We are only looking for the direction of of the vector, the actual scale of it is unnecessary.<br />
<br />
Using the method of Lagrange multipliers, we have<br />
<br />
<math>\,L(W, \lambda) = W^TSW - \lambda(W^TW - 1) </math><br />
<br />
We set<br />
<br />
<math>\, \frac{\partial L}{\partial W} = 0 </math><br />
<br />
<br />
<br />
Note that <math>\, W^TSW</math> is a quadratic form. So we have<br />
<br />
<br />
<br />
<math>\, \frac{\partial L}{\partial W} = 2SW - 2\lambda W = 0 </math><br />
<br />
<math>\, SW = \lambda W </math><br />
<br />
Since S is a matrix and lambda is a scaler, W is an eigenvector of S and lambda is its corresponding eigenvalue.<br />
<br />
Suppose that<br />
<br />
<math>\, \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_d</math><br />
are eigenvalues of S and <math>\, u_1, u_2, \cdots u_d</math> are their corresponding eigenvectors.<br />
<br />
We want to choose some <math>\, W = u </math><br />
<br />
<math>\,u^TSu =u^T\lambda u = \lambda u^Tu = \lambda</math><br />
<br />
So to maximize <math>\, u^TSu</math>, choose the eigenvector corresponding to the max eiegenvalue, e.g. <math>\, u_1</math>.<br />
<br />
So we let <math>\, W = u_1 </math> be the first principal component.<br />
<br />
The principal component's decompose the total variance in the data.<br />
<br />
<math>\, \sum_{i=1}^D \text{Var}(u_i) = \sum_{i=1}^D \lambda_i = \text{Tr}(S) = \sum_{i=1}^D \text{Var}(x_i)</math><br />
<br />
<br><br />
===Singular Value Decomposition===<br />
Singular value decomposition is a "generalization" of eigenvalue decomposition "to rectangular matrices of size ''mxn''."<ref name="Abdel_SVD">Abdel-Rahman, E. (2011). Singular Value Decomposition [Lecture notes]. Retrieved from http://uwace.uwaterloo.ca</ref> Singular value decomposition solves:<br><br><br />
:<math>\ A_{mxn}\ v_{nx1}=s\ u_{mx1}</math><br><br><br />
"for the right singular vector ''v'', the singular value ''s'', and the left singular vector ''u''. There are ''n'' singular values ''s''<sub>''i''</sub> and ''n'' right and left singular vectors that must satisfy the following conditions"<ref name="Abdel_SVD"/>:<br />
# "All singular values are non-negative"<ref name="Abdel_SVD"/>, <br> <math>\ s_i \ge 0.</math><br />
# All "right singular vectors are pairwise orthonormal"<ref name="Abdel_SVD"/>, <br> <math>\ v_iv_j=\delta_{i,j}.</math><br />
# All "left singular vectors are pairwise orthonormal"<ref name="Abdel_SVD"/>, <br> <math>\ u_iu_j=\delta_{i,j}.</math><br />
where<br />
:<math>\delta_{i,j}=\left\{\begin{matrix}1 & \mathrm{if}\ i=j \\ 0 & \mathrm{if}\ i\neq j\end{matrix}\right.</math><br><br><br />
<br />
'''Procedure to find the singular values and vectors'''<br><br />
Observe the following about the eigenvalue decomposition of a real square matrix ''A'' where ''v'' is the unit eigenvector:<br><br />
::<math><br />
\begin{align}<br />
& Av=\lambda v \\<br />
& (Av)^T=(\lambda v)^T \\<br />
& (Av)^TAv=(\lambda v)^T\lambda v \\<br />
& v^TA^TAv=\lambda^2v^Tv \\<br />
& vv^TA^TAv=v\lambda^2 \\<br />
& A^TAv=\lambda^2v<br />
\end{align}<br />
</math><br />
As a result:<br />
# "The matrices ''A'' and ''A''<sup>''T''</sup>''A'' have the same eigenvectors."<ref name="Abdel_SVD"/><br />
# "The eigenvalues of matrix ''A''<sup>''T''</sup>''A'' are the square of the eigenvalues of matrix ''A''."<ref name="Abdel_SVD"/><br />
# Since matrix ''A''<sup>''T''</sup>''A'' is symmetric,<br />
## "all the eigenvalues of matrix ''A''<sup>''T''</sup>''A'' are real and distinct."<ref name="Abdel_SVD"/><br />
## "the eigenvectors of matrix ''A''<sup>''T''</sup>''A'' are orthogonal and can be chosen to be orthonormal."<ref name="Abdel_SVD"/><br />
# "The eigenvalues of matrix ''A''<sup>''T''</sup>''A'' are non-negative"<ref name="Abdel_SVD"/> since <math>\ \lambda^2_i \ge 0.</math><br />
Conclusions 3 and 4 are "true even for a rectangular matrix ''A'' since ''A''<sup>''T''</sup>''A'' is still a square symmetric matrix"<ref name="Abdel_SVD"/> and its eigenvalues and eigenvectors can be found.<br><br><br />
Therefore, for a rectangular matrix ''A'', assuming ''m>n'', the singular values and vectors can be found by:<br />
# "Form the ''nxn'' symmetric matrix ''A''<sup>''T''</sup>''A''."<ref name="Abdel_SVD"/><br />
# Perform an eigenvalue decomposition to get ''n'' eigenvalues and their "corresponding eigenvectors, ordered such that"<ref name="Abdel_SVD"/> <br><math>\lambda^2_1 \ge \lambda^2_2 \ge \dots \ge \lambda^2_n \ge 0</math> and <math>\{v_1, v_2, \dots, v_n\}.</math><br />
# "The singular values are"<ref name="Abdel_SVD"/>: <br><math>s_1=\sqrt{\lambda^2_1} \ge s_2=\sqrt{\lambda^2_2} \ge \dots \ge s_n=\sqrt{\lambda^2_n} \ge 0.</math><br>"The non-zero singular values are distinct; the equal sign applies only to the singular values that are equal to zero."<ref name="Abdel_SVD"/><br />
# "The ''n''-dimensional right singular vectors are"<ref name="Abdel_SVD"/><br><math>\{v_1, v_2, \dots, v_n\}.</math><br />
# "For the first <math>r \le n</math> singular values such that ''s''<sub>''i''</sub> ''> 0'', the left singular vectors are obtained as unit vectors"<ref name="Abdel_SVD"/> by <math>\tfrac{1}{s_i}Av_i=u_i.</math><br />
# Select "the <math>\ m-r</math> left singular vectors corresponding to the zero singular values such that they are unit vectors orthogonal to each other and to the first ''r'' left singular vectors"<ref name="Abdel_SVD"/> <math>\{u_1, u_2, \dots, u_r\}.</math><br><br><br />
<br />
'''Finding Singular value Decomposition Using MATLAB Code'''<br />
Please refer to the following link: http://www.mathworks.com/help/techdoc/ref/svd-singular-value-decomposition.html<br />
<br />
'''Formal definition'''<br><br />
"We can now decompose the rectangular matrix ''A'' in terms of singular values and vectors as follows"<ref name="Abdel_SVD"/>:<br><br><br />
<math>A_{mxn} \begin{bmatrix} v_1 & | & \cdots & | & v_n \end{bmatrix}_{nxn} = \begin{bmatrix} u_1 & | & \cdots & | & u_n & | I_{n+1} & | & \cdots & | & I_m \end{bmatrix}_{mxm} \begin{bmatrix} s_1 & 0 & \cdots & 0 \\ 0 & s_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & s_n \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix}_{mxn}</math><br><br />
:<math>\ AV=US</math><br><br><br />
Since "the matrices ''V'' and ''U'' are orthogonal"<ref name="Abdel_SVD"/>, ''V ''<sup>''-1''</sup>=''V''<sup>T</sup> and ''U ''<sup>''-1''</sup>=''U''<sup>T</sup>:<br><br><br />
:<math>\ A=USV^T</math><br><br><br />
"which is the formal definition of the singular value decomposition."<ref name="Abdel_SVD"/><br><br><br />
<br />
'''Relevance to PCA'''<br><br />
In order to perform PCA, one needs to do eigenvalue decomposition on the covariance matrix. By transforming the mean for all attributes to zero, the covariance matrix can be simplified to:<br><br><br />
<math>\ S=XX^T</math><br><br><br />
Since the eigenvalue decomposition of ''A''<sup>''T''</sup>''A'' gives the same eigenvectors as the singular value decomposition of ''A'', an additional and more consistent method (if a matrix has eigenvectors that are not invertible, its eigenvalue decomposition does not exist) for performing PCA is through the singular value decomposition of ''X''.<br />
<br />
The following MATLAB code uses singular value decomposition for performing PCA; 20 principal components, and thus the top 20 maximum variation directions, are selected for reconstructing facial images that have had noise applied to them:<br />
<br />
load noisy.mat<br />
%first noisy image; each image has a resolution of 20x28<br />
imagesc(reshape(X(:,1),20,28)')<br />
%to grayscale<br />
colormap gray<br />
%singular value decomposition <br />
[u s v]=svd(X);<br />
%reduced feature space: 20 principal components<br />
Xh=u(:,1:20)*s(1:20,1:20)*v(:,1:20)';<br />
figure<br />
imagesc(reshape(Xh(:,1),20,28)')<br />
colormap gray<br />
<br />
Since the reduced feature space image is noiseless, the added noise feature has less variation than the 20 principal components.<br />
<br />
=='''References'''==<br />
<br />
<references/><br />
<br />
==''' PCA and Introduction to Kernel Function-November,10,2011'''==<br />
===Continue with the last lecture===<br />
Some notations:<br />
Let <math>\displaystyle X_{d\times n}</math> be a matrix. <br />
<br />
Let <math>\displaystyle X_j,j=1,2,...,d</math> be the j th the data point,and <math>\displaystyle X_j\in\R^d</math>.<br />
<br />
Let <math>\displaystyle Q=\sum_{j=1}^n(X_j-\bar{X})(X_j-\bar{X})^T</math>, where <math> \bar{X}=\frac{1}{n}\sum_{j=1}^n X_j)</math>.<br />
<br />
But now, we are assuming that we have already centered the data, which means our <math>\displaystyle Q=\sum_{j=1}^n(X_j)(X_j)^T=X X^T </math>.<br />
<br />
*Find PC,which means finding eigenvectors of Q or do the singular value decomposition,[u s v]=svd(X), where the columns of u are eigenvectors of <math>\displaystyle Q=X X^T</math>.<br />
<br />
*Map the data in lower dimension space.<br />
We can choose the first p (p<d) eigenvectors, which means <math>\displaystyle u^T</math> is a <math>\displaystyle p\times n</math> matrix.<br />
Thus,we can project our original data points <math>\displaystyle x_j</math> to p dimension.<br />
Mathematically, it is <math>\displaystyle Y_{p\times n}={u^T}_{p\times d} X_{d\times n}</math>.Also,this means that we can reduce our original d variables to p principal components.<br />
<br />
*Reconstruct Points.<br />
We can also use those dimension-reduced data to project back to high dimension.<br />
However, we will lose some information because when we map those points into lower dimension, we throw away the last (d-p) eigenvectors which contain some of the original information.<br />
Since <math>\displaystyle u^T</math> is an orthogonal matrix, we can have <math> u_{d\times p} Y_{p\times n}=u_{d\times p}{u^T}_{p\times d}\hat{x}_{d\times n}= \hat{x}_{d\times n} </math>.<br />
<br />
*Map a new data point to a lower dimensional space and reconstruct it to the high dimension <math>\displaystyle y_{d\times 1}={u^T}_{p\times d} x_{d\times 1}=x_{d\times 1}=u_{d\times p} y_{p\times 1}</math><br />
<br />
===3 and 2 digits example===<br />
The data X is a 64 by 400 matrix. Every column can be imaged out as either "3" or "2". The first 200 columns are "2" and the last 200 columns are "3".<br />
We can first modify the data to centered data, and then try to find the first p(p<d) columns of the singular value decomposition of u.<br />
<br />
MATLAB CODE:<br />
MU=repmat(mean(X,2),1,400);<br />
% mean(X,2) is the average of each row <br />
%In order to center the data, we should change mean(X,2) which is a 64 by 1 matrix into a 64 by 400 matirx<br />
Xt=X-MU;<br />
% modify the data to zero mean data<br />
[u s v]=svd(Xt);<br />
%note that size(u)=64*64, and the columns of u are eigenvectors of VCM<br />
Y=u(:,1:2)'*X;<br />
%using the first two PCs to transform the high dimensional points to lower onces<br />
One way to look at this case is that, we can plot Principle Component #1 and Principle Component #2 in a two dimensional space.<br />
plot(Y(1,:)',Y(2,:)')<br />
The result is as follows, we can see clearly there are two classes.<br />
<br />
[[file:pca2.png|350px|400px]]<br />
<br />
To dig more into what kind of difference of these two classes, we can try to seperate the first 200 columns and the last 200 columns to find whether it has a significant difference due to the different types of digits.<br />
plot(Y(1,1:200)',Y(2,1:200)','d')<br />
% Note that the first 200 columns represent digit "2",and are in the form of "diamond"<br />
hold on<br />
% draw different graphs in one figure<br />
plot(Y(1,201:400)',Y(2,201:400)','ro')<br />
% Note that the first 200 columns represent digit "3",and are in the form of "o"<br />
<br />
[[file:pca3.png|350px|400px]]<br />
<br />
image=reshape(X,8,8,400);<br />
plotdigits(image,Y,.1,1);<br />
The result can be seen more clearly from the following picture.<br />
It is clearly to seperate "3" and "2" apart.<br />
<br />
[[file:Pca.png|350px|400px]]<br />
<br />
===Introduction to Kernel Function===<br />
PCA is useful when those data points spread in or close to a plane. This means that PCA is powerful when dealing with linear problems. But when data points spread in a manifold space, PCA is hard to implement. But there is a solution to this problem---we can use a "trick" to change the nonlinear classification problems into linear ones. And this is called the "Kernel Trick".<br />
<br />
'''An intuitive example'''<br />
<br />
[[File:Kernel trick.png|400px|300px]]<br />
<br />
From the picture, we can see the red dots are in the middle of the blue ones.However,it is hard to separate those two classes by using any lines(linear in the two dimensional space). But we can pull the red ones out of the two dimensional space to form a three dimensional space, in which case, we can easily tell them apart.<br />
<br />
For more details about this trick,please see http://omega.albany.edu:8008/machine-learning-dir/notes-dir/ker1/ker1.pdf<br />
<br />
More in detail,the significance of Kernel Function is that we can change the data points into a high dimension implicitly.<br />
Let's look at how this is possible:<br />
<br />
<math>Z_1=<br />
\begin{bmatrix}<br />
x_1\\<br />
y_1<br />
\end{bmatrix}\xrightarrow{\phi}<br />
</math><br />
<math>\phi(Z_1)=<br />
\begin{bmatrix}<br />
x_1^2\\<br />
y_1^2\\<br />
\sqrt2x_1y_1<br />
\end{bmatrix}.<br />
<br />
</math><br />
<math>Z_2=<br />
\begin{bmatrix}<br />
x_2\\<br />
y_2<br />
\end{bmatrix}\xrightarrow{\phi}<br />
</math><br />
<math>\phi(Z_2)=<br />
\begin{bmatrix}<br />
x_2^2\\<br />
y_2^2\\<br />
\sqrt2x_2y_2<br />
\end{bmatrix}<br />
</math><br />
<br />
The inner product of <math>\displaystyle \phi(Z1)</math> and <math>\displaystyle\phi(Z2)</math>, which is denoted as <math>\displaystyle\phi(Z1)\phi(Z2)^T</math>, is equal to:<br />
<math><br />
\begin{bmatrix}<br />
x_1^2&y_1^2&\sqrt2x_1y_1 <br />
\end{bmatrix}<br />
\begin{bmatrix}<br />
x_2^2\\<br />
y_2^2\\<br />
\sqrt2x_2y_2 <br />
\end{bmatrix}=</math> <math>\displaystyle (x_1x_2+y_1y_2)^2=K(Z_1,Z_2)</math>.<br />
<br />
'''The most common Kernel functions are as follows:'''<br />
*Linear: <math>\displaystyle K_{ij}=<X_i,X_j></math><br />
*Polynomial:<math>\displaystyle K_{ij}=(1+<X_i,X_j>)^p</math><br />
*Gausian:<math>\displaystyle K_{ij}=e^\frac{-{\left\Vert X_i-X_j\right\|}^2}{2\sigma^2}</math>,<br />
where <math>\displaystyle <X_i,X_j></math> denotes the inner product of <math>\displaystyle X_i</math> and <math>\displaystyle X_j</math>, <math>{\left\Vert X_i-X_j\right\|}^2</math> denotes the distance between vector<math>\displaystyle X_i</math> and vector <math>\displaystyle X_j</math>.<br />
<br />
<br />
==''' Kernel PCA -November,15,2011'''==<br />
<br />
First we look at the algorithm for PCA and see how we can kernelize PCA:<br />
<br />
== PCA ==<br />
<br />
Find eigenvectors of <math>XX^T</math>, call it U<br />
<br />
<math><br />
\begin{align}<br />
Y &= U^{T}X \\<br />
\hat{X} & = UY \\<br />
Y & = U^{T}X \\<br />
\hat{X} & = UY<br />
\end{align}<br />
</math><br />
<br />
== Modifying PCA ==<br />
<br />
<math><br />
\begin{align}<br />
[ U \Sigma V ] & = svd(X) \\<br />
Z & = U\Sigma{V^T}<br />
\end{align}<br />
</math><br />
<br />
U is eigenvectors of <math>XX^T</math><br />
<br />
V is eigenvectors of <math>X^T{X}</math><br />
<br />
Now we want to kernelize this classical version of PCA.<br />
<br />
We would like to express everything based on V which is eigenvectors of X^T{X} which can be kernelized. This is called Dual PCA.<br />
<br />
<math><br />
\begin{align}<br />
X&= U \Sigma V^T \\<br />
XV&=U \Sigma V^T V<br />
&= U\Sigma \\<br />
U&=XV\Sigma^{-1}<br />
\end{align}<br />
</math><br />
<br />
Find eigenvectors of <math>X^TX</math>, call it V.<br />
<br />
<math><br />
\begin{align}<br />
X&=U \Sigma V^T \\<br />
U^TX &= U^TU\Sigma V^T \\<br />
U^TX &= \Sigma V^T \\<br />
Y&=\Sigma V^T \\<br />
\end{align}<br />
</math><br />
<br />
Reconstruct Points<br />
<br />
<math><br />
\begin{align}<br />
\hat{X}&=UY \\<br />
X &=XV\Sigma^{-1}\Sigma{V^T} \\<br />
\hat{X} &= XVV^T<br />
\end{align}<br />
</math><br />
<br />
Map an out of sample point x to low-dimensional space<br />
<br />
<math><br />
\begin{align}<br />
Y &=U^TX \\<br />
& = (XV\Sigma^1)^TX \\<br />
& = \Sigma^{-1}{V^T}{X^T}X<br />
\end{align}<br />
</math><br />
<br />
Reconstruct an out of sample point <br />
<br />
<br />
<math><br />
\begin{align}<br />
\hat{X} &= UY=XV\Sigma^{-1}\Sigma{-1}V^T{X^T}X \\<br />
&= XV\Sigma^{-2}V^T{X^T}X<br />
\end{align}<br />
</math></div>S9huhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341f11&diff=14856stat341f112011-11-15T19:17:13Z<p>S9hu: /* Kernel PCA -November,15,2011 */</p>
<hr />
<div>Please contribute to the discussion of splitting up this page into multiple pages on the [[{{TALKPAGENAME}}|talk page]].<br />
<br />
==[[signupformStat341F11| Editor Sign Up]]==<br />
<br />
==Notation==<br />
<br />
The following guidelines on notation were posted on the Wiki Course Note page for [[Stat946f11|STAT 946]]. Add to them as necessary for consistent notation on this page.<br />
<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
<br />
==Sampling - September 20, 2011==<br />
<br />
The meaning of sampling is to generate data points or numbers such that these data follow a certain distribution.<br /><br />
i.e. From <math>x \sim~f(x)</math> sample <math>\,x_{1}, x_{2}, ..., x_{1000}</math><br />
<br />
In practice, it maybe difficult to find the joint distribution of random variables. Through simulating the random variables, we can make an inference from the data.<br />
<br />
===Sampling from Uniform Distribution===<br />
Computers cannot generate random numbers as they are deterministic; however they can produce pseudo random numbers using algorithms. Generated numbers mimic the properties of random numbers but they are never truly random. One famous algorithm that is considered highly reliable is the Mersenne twister[http://en.wikipedia.org/wiki/Mersenne_twister], which generates random numbers in an almost uniform distribution. <br />
<br />
<br />
====Multiplicative Congruential====<br />
*involves four parameters: integers <math>\,a, b, m</math>, and an initial value <math>\,x_0</math> which we call the seed<br />
*a sequence of integers is defined as<br />
:<math>x_{k+1} \equiv (ax_{k} + b) \mod{m}</math><br />
<br />
'''Example:''' <math>\,a=13, b=0, m=31, x_0=1</math> creates a uniform histogram.<br />
<br />
MATLAB code for generating 1000 random numbers using the multiplicative congruential method:<br />
<br />
<pre><br />
a = 13;<br />
b = 0;<br />
m = 31;<br />
x(1) = 1;<br />
<br />
for ii = 2:1000<br />
x(ii) = mod(a*x(ii-1)+b, m);<br />
end<br />
</pre><br />
<br />
MATLAB code for displaying the values of x generated:<br />
<br />
<pre><br />
x<br />
</pre><br />
<br />
MATLAB code for plotting the histogram of x:<br />
<br />
<pre><br />
hist(x)<br />
</pre><br />
<br />
Histogram Output:<br />
<br />
[[File:uniform.jpg]]<br />
<br />
Facts about this algorithm:<br />
*In this example, the first 30 terms in the sequence are a permutation of integers from 1 to 30 and then the sequence repeats itself.<br />
*Values are between <b>0</b> and <b>m-1</b>, inclusive.<br />
*Dividing the numbers by <b> m-1 </b> yields numbers in the interval <b>[0,1]</b>.<br />
*MATLAB's <code>rand</code> function once used this algorithm with <b>a= 7<sup>5</sup></b>, <b>b= 0</b>, <b>m= 2<sup>31</sup>-1</b>,for reasons described in Park and Miller's 1988 paper "Random Number Generators: Good Ones are Hard to Find" (available [http://www.firstpr.com.au/dsp/rand31/p1192-park.pdf online]).<br />
*Visual Basic's <code>RND</code> function also used this algorithm with <b>a= 1140671485</b>, <b>b= 12820163</b>, <b>m= 2<sup>24</sup></b>. ([http://support.microsoft.com/kb/231847 Reference])<br />
<br />
===Inverse Transform Method===<br />
This is a basic method for sampling. Theoretically using this method we can generate sample numbers at random from any probability distribution once we know its cumulative distribution function (cdf).<br />
<br />
====Theorem====<br />
Take <math>U \sim~ \mathrm{Unif}[0, 1]</math> and let <math>X = F^{-1}(U) </math>. Then <math>X</math> has distribution function <math>F(\cdot)</math>, where <math>F(x)=P(X \leq x)</math> and <math>F^{-1}(\cdot)</math> is the inverse of <math>F(\cdot)</math>.<br />
<br />
Therefore <math>F(x)=u\implies x=F^{-1}(u)</math><br />
<br />
'''Proof'''<br />
<br />
Recall that<br />
<br />
:<math>P(a \leq X<b)=\int_a^{b} f(x) dx</math><br />
<br />
:<math>cdf=F(x)=P(X \leq x)=\int_{-\infty}^{x} f(x) dx</math><br />
<br />
Note that if <math>U \sim~ \mathrm{Unif}[0, 1]</math>, we have <math>P(U \leq a)=a</math><br />
<br />
:<math>\begin{align}<br />
<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
====Continuous Case====<br />
Generally it takes two steps to get random numbers using this method.<br />
<br />
*Step 1. Draw <math>U \sim~ \mathrm{Unif}[0, 1]</math><br />
*Step 2. <b><i>X=F <sup>&minus;1</sup>(U)</i></b><br />
<br />
'''Example'''<br />
<br />
Take the exponential distribution for example<br />
:<math>\,f(x)={\lambda}e^{-{\lambda}x}</math><br />
:<math>\,F(x)=\int_0^x {\lambda}e^{-{\lambda}u} du=[-e^{-{\lambda}u}]_0^x=1-e^{-{\lambda}x}</math><br />
<br />
Let: <math>\,F(x)=y</math><br />
:<math>\,y=1-e^{-{\lambda}x}</math><br />
:<math>\,ln(1-y)={-{\lambda}x}</math><br />
:<math>\,x=\frac{ln(1-y)}{-\lambda}</math><br />
:<math>\,F^{-1}(x)=\frac{-ln(1-x)}{\lambda}</math><br />
<br />
Therefore, to get a exponential distribution from a uniform distribution takes 2 steps.<br />
*Step 1. Draw <math>U \sim~ \mathrm{Unif}[0, 1]</math><br />
*Step 2. <math>x=\frac{-ln(1-U)}{\lambda}</math><br />
<br />
Note: If U~Unif[0, 1], then (1 - U) and U have the same distribution. This allows us to slightly simplify step 2 into an alternate form:<br />
*Alternate Step 2. <math>x=\frac{-ln(U)}{\lambda}</math><br />
<br />
'''MATLAB code'''<br />
for exponential distribution case,assuming <math>\lambda=0.5</math><br />
<br />
<pre><br />
for ii = 1:1000<br />
u = rand;<br />
x(ii) = -log(1-u)/0.5;<br />
end<br />
hist(x)<br />
</pre><br />
<br />
MATLAB result<br />
<br />
[[File:MATLAB_Exp.jpg|center|300px]]<br />
<br />
====Discrete Case - September 22, 2011====<br />
This same technique can be applied to the discrete case. Generate a discrete random variable <math>\,x</math> that has probability mass function <math>\,P(X=x_i)=P_i </math> where <math>\,x_0<x_1<x_2...</math> and <math>\,\sum_i P_i=1</math><br />
*Step 1. Draw <math>u \sim~ \mathrm{Unif}[0, 1]</math><br />
*Step 2. <math>\,x=x_i</math> if <math>\,F(x_{i-1})<u \leq F(x_i)</math><br />
<br />
'''Example'''<br />
<br />
Let x be a discrete random variable with the following probability mass function:<br />
<br />
:<math>\begin{align}<br />
P(X=0) = 0.3 \\<br />
P(X=1) = 0.2 \\<br />
P(X=2) = 0.5<br />
\end{align}</math><br />
<br />
Given the pmf, we now need to find the cdf.<br />
<br />
We have:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0 & x < 0 \\<br />
0.3 & 0 \leq x < 1 \\<br />
0.5 & 1 \leq x < 2 \\<br />
1 & 2 \leq x<br />
\end{cases}</math><br />
<br />
We can apply the inverse transform method to obtain our random numbers from this distribution.<br />
<br />
'''Pseudo Code for generating the random numbers:'''<br />
<pre><br />
Draw U ~ Unif[0,1] <br />
if U <= 0.3 <br />
return 0 <br />
else if 0.3 < U <= 0.5 <br />
return 1<br />
else if 0.5 < U <= 1 <br />
return 2<br />
</pre><br />
<br />
'''MATLAB code for generating 1000 random numbers in the discrete case:'''<br />
<br />
<pre><br />
for ii = 1:1000<br />
u = rand;<br />
<br />
if u <= 0.3<br />
x(ii) = 0;<br />
else if u <= 0.5<br />
x(ii) = 1;<br />
else<br />
x(ii) = 2;<br />
end<br />
end<br />
</pre><br />
<br />
Matlab Output:<br />
<br />
[[File:Discreteinv.jpg]]<br />
<br />
'''Pseudo code for the Discrete Case:'''<br />
<br />
1. Draw U ~ Unif [0,1]<br />
<br />
2. If <math> U \leq P_0 </math>, deliver <b><i>X= x<sub>0</sub></i></b><br />
<br />
3. Else if <math> U \leq P_0 + P_1 </math>, deliver <b><i>X= x<sub>1</sub></i></b><br />
<br />
4. Else If <math> U \leq P_0 +....+ P_k </math>, deliver <b><i>X= x<sub>k</sub></i></b><br />
<br />
====Limitations====<br />
<br />
Although this method is useful, it isn't practical in many cases since we can't always obtain <math>F</math> or <math> F^{-1} </math> as some functions are not integrable or invertible, and sometimes even <math>f(x)</math> itself cannot be obtained in closed form. Let's look at some examples:<br />
*Continuous case<br />
If we want to use this method to draw the ''pdf'' of '''normal distribution''', we may find ourselves get stuck in finding its ''cdf''. <br />
The simplest case of '''normal distribution''' is <math>f(x)=\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}</math>,<br />
whose ''cdf'' is <math>F(x)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{x}{e^{-\frac{u^2}{2}}}du</math>. This integral cannot be expressed in terms of elementary functions. So evaluating it and then finding the inverse is a very difficult task.<br />
*Discrete case <br />
It is easy for us to simulate when there are only a few values taken by the particular random variable, like the case above.<br />
And it is easy to simulate the '''binomial distribution''' <math>X \sim~ \mathrm{B}(n,p)</math> when the parameter n is not too large.<br />
But when n takes on values that are very large, say 50, it is hard to do so.<br />
<br />
===Acceptance/Rejection Method===<br />
<br />
<br />
The aforementioned difficulties of the inverse transform method motivates a sampling method that does not require analytically calculating cdf's and their inverses, which is the acceptance/rejection sampling method. Here, <math> \displaystyle f(x)</math> is approximated by another function, say <math>\displaystyle g(x)</math>, with the idea being that <math>\displaystyle g(x)</math> is a "nicer" function to work with than <math>\displaystyle f(x)</math>.<br />
<br />
Suppose we assume the following:<br />
<br />
1. There exists another distribution <math>\displaystyle g(x)</math> that is easier to work with and that you know how to sample from, and<br />
<br />
2. There exists a constant c such that <math>f(x) \leq c \cdot g(x)</math> for all x<br />
<br />
Under these assumptions, we can sample from <math>\displaystyle f(x)</math> by sampling from <math>\displaystyle g(x)</math><br />
<br />
====General Idea====<br />
<br />
Looking at the image below we have graphed <math> c \cdot g(x) </math> and <math>\displaystyle f(x)</math>.<br />
<br />
[[File:Graph_updated.jpg]]<br />
<br />
Using the acceptance/rejection method we will accept some of the points from <math>\displaystyle g(x)</math> and reject some of the points from <math>\displaystyle g(x)</math>. The points that will be accepted from <math>\displaystyle g(x)</math> will have a distribution similar to <math>\displaystyle f(x)</math>. We can see from the image that the values around <math>\displaystyle x_1</math> will be sampled more often under <math>c \cdot g(x)</math> than under <math>\displaystyle f(x)</math>, so we will have to reject more samples taken at x<sub>1</sub>. Around <math>\displaystyle x_2</math> the number of samples that are drawn and the number of samples we need are much closer, so we accept more samples that we get at <math>\displaystyle x_2</math><br />
<br />
====Procedure====<br />
<br />
1. Draw y ~ g<br />
<br />
2. Draw U ~ Unif [0,1]<br />
<br />
3. If <math> U \leq \frac{f(y)}{c \cdot g(y)}</math> then x=y; else return to 1<br />
<br />
Note that the choice of <math> c </math> plays an important role in the efficiency of the algorithm. We want <math> c \cdot g(x) </math> to be "tightly fit" over <math> f(x) </math> to increase the probability of accepting points, and therefore reducing the number of sampling attempts. Mathematically, we want to minimize <math> c </math> such that <math>f(x) \leq c \cdot g(x) \ \forall x</math>. We do this by setting<br />
<br />
<math> \frac{d}{dx}(\frac{f(x)}{g(x)}) = 0 </math>, solving for a maximum point <math> x_0 </math> and setting <math> c = \frac{f(x_0)}{g(x_0)}. </math><br />
<br />
====Proof====<br />
<br />
Mathematically, we need to show that the sample points given that they are accepted have a distribution of f(x).<br />
<br />
<math>\begin{align} P(y|accepted) &= \frac{P(y, accepted)}{P(accepted)} \\<br />
<br />
&= \frac{P(accepted|y) P(y)}{P(accepted)}\end{align} </math> (Bayes' Rule)<br />
<br />
<br />
<br />
<math>\displaystyle P(y) = g(y)</math><br />
<br />
<math>P(accepted|y) =P(u\leq \frac{f(y)}{c \cdot g(y)}) =\frac{f(y)}{c \cdot g(y)} </math>,where u ~ Unif [0,1]<br />
<br />
<math>P(accepted) = \sum P(accepted|y)\cdot P(y)=\int^{}_y \frac{f(y)}{c \cdot g(y)}g(y) dy=\int^{}_y \frac{f(y)}{c} dy=\frac{1}{c} \cdot\int^{}_y f(y) dy=\frac{1}{c}</math><br />
<br />
So,<br />
<br />
<math> P(y|accepted) = \frac{ \frac {f(y)}{c \cdot g(y)} \cdot g(y)}{\frac{1}{c}} =f(y) </math><br />
<br />
====Continuous Case====<br />
<br />
'''Example'''<br />
<br />
Sample from Beta(2,1)<br />
<br />
In general:<br />
<br />
Beta(<math>\alpha, \beta) = \frac{\Gamma (\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}</math> <math>\displaystyle x^{\alpha-1}</math> <math>\displaystyle(1-x)^{\beta-1}</math>, <math>\displaystyle 0<x<1</math><br />
<br />
Note: <math>\!\Gamma(n) = (n-1)!</math> if n is a positive integer<br />
<br />
<math>\begin{align} f(x) &= Beta(2,1) \\<br />
&= \frac{\Gamma(3)}{\Gamma(2)\Gamma(1)} x^1(1-x)^0 \\<br />
&= \frac{2!}{1! 0!}\cdot (1) x \\<br />
&= 2x \end{align}</math><br />
<br />
We want to choose <math>\displaystyle g(x)</math> that is easy to sample from. So we choose <math>\displaystyle g(x)</math> to be uniform distribution.<br />
<br />
We now want a constant c such that <math>f(x) \leq c \cdot g(x) </math> for all x from Unif(0,1)<br />
<br />
<br />
So,<br /><br />
<br />
<math>c \geq \frac{f(x)}{g(x)}</math>, for all x from (0,1)<br />
<br />
<br />
<math>\begin{align}c &\geq max (\frac {f(x)}{g(x)}, 0<x<1) \\<br />
<br />
<br />
&= max (\frac {2x}{1},0<x<1) \\<br />
<br />
<br />
&= 2 \end{align}</math><br />
<br />
<br />
<br />
Now that we have c =2,<br />
<br />
1. Draw y ~ g(x) => Draw y ~ Unif [0,1] <br />
<br />
2. Draw u ~ Unif [0,1] <br />
<br />
3. if <math>u \leq \frac{2y}{2 \cdot 1}</math> then x=y; else return to 1<br />
<br />
<br />
'''MATLAB code for generating 1000 samples following Beta(2,1):'''<br />
<br />
<pre><br />
close all<br />
clear all<br />
ii=1;<br />
while ii < 1000<br />
y = rand;<br />
u = rand;<br />
<br />
if u <= y<br />
x(ii)=y;<br />
ii=ii+1;<br />
end<br />
end<br />
hist(x)<br />
</pre><br />
<br />
'''MATLAB result'''<br />
<br />
[[File:MATLAB_Beta.jpg]]<br />
<br />
====Discrete Example====<br />
<br />
Generate random variables according to the p.m.f:<br />
<br />
:<math>\begin{align}<br />
P(Y=1) = 0.15 \\<br />
P(Y=2) = 0.22 \\<br />
P(Y=3) = 0.33 \\<br />
P(Y=4) = 0.10 \\<br />
P(Y=5) = 0.20 <br />
\end{align}</math><br />
<br />
find a g(y) discrete uniform distribution from 1 to 5<br />
<br />
<math>c \geq \frac{P(y)}{g(y)} </math><br><br />
<math>c = \max \left(\frac{P(y)}{g(y)} \right)</math><br><br />
<math>c = \max \left(\frac{0.33}{0.2} \right) = 1.65</math> Since P(Y=3) is the max of P(Y) and g(y) = 0.2 for all y.<br><br />
<br />
1. Generate Y according to the discrete uniform between 1 - 5<br />
<br />
2. U ~ unif[0,1]<br />
<br />
3. If <math>U \leq \frac{P(y)}{1.65 \times 0.2} \leq \frac{P(y)}{0.33} </math>, then x = y; else return to 1.<br />
<br />
In MATLAB, the code would be:<br />
<br />
py = [0.15 0.22 0.33 0.1 0.2];<br />
ii =1;<br />
while ii <= 1000<br />
y = unidrnd(5);<br />
u = rand;<br />
if u <= py(y)/0.33<br />
x(ii) = y;<br />
ii = ii+1;<br />
end<br />
end<br />
hist(x);<br />
<br />
MATLAB result<br />
<br />
[[File:MATLAB_Y.jpg]]<br />
<br />
====Limitations====<br />
<br />
Most of the time we have to sample many more points from g(x) before we can obtain an acceptable amount of samples from f(x), hence this method may not be computationally efficient. It depends on our choice of g(x). For example, in the example above to sample from Beta(2,1), we need roughly 2000 samples from g(X) to get 1000 acceptable samples of f(x).<br />
<br />
In addition, in situations where a g(x) function is chosen and used, there can be a discrepancy between the functional behaviors of f(x) and g(x) that render this method unreliable. For example, given the normal distribution function as g(x) and a function of f(x) with a "fat" mid-section and "thin tails", this method becomes useless as more points near the two ends of f(x) will be rejected, resulting in a tedious and overwhelming number of sampling points having to be sampled due to the high rejection rate of such a method.<br />
<br />
===Sampling From Gamma and Normal Distribution - September 27, 2011===<br />
<br />
====Sampling From Gamma====<br />
<br />
'''Gamma Distribution'''<br />
<br />
The Gamma function is written as <math>X \sim~ Gamma (t, \lambda) </math><br />
<br />
:<math> F(x) = \int_{0}^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If you have t samples of the exponential distribution,<br><br />
<br> <math> \begin{align} X_1 \sim~ Exp(\lambda)\\ \vdots \\ X_t \sim~ Exp(\lambda) \end{align}<br />
</math><br />
<br />
The sum of these t samples has a gamma distribution,<br />
<br />
:<math> X_1+X_2+ ... + X_t \sim~ Gamma (t, \lambda) </math><br><br />
:<math> \sum_{i=1}^{t} X_i \sim~ Gamma (t, \lambda) </math> where <math>X_i \sim~Exp(\lambda)</math><br><br />
<br />
'''Method'''<br />
<br />
We can sample the exponential distribution using the inverse transform method from previous class,<br><br />
:<math>\,f(x)={\lambda}e^{-{\lambda}x}</math><br><br />
:<math>\,F^{-1}(u)=\frac{-ln(1-u)}{\lambda}</math><br><br />
:<math>\,F^{-1}(u)=\frac{-ln(u)}{\lambda}</math> <br />
1 - u is the same as x since <math>U \sim~ unif [0,1] </math><br><br />
:<math> \begin{align} \frac{-ln(u_1)}{\lambda} - \frac{ln(u_2)}{\lambda} - ... - \frac{ln(u_t)}{\lambda} = x_1\\ \vdots \\ \frac{-ln(u_1)}{\lambda} - \frac{ln(u_2)}{\lambda} - ... - \frac{ln(u_t)}{\lambda} = x_t \end{align}<br />
:</math><br><br />
:<math> \frac {-\sum_{i=1}^{t} ln(u_i)}{\lambda} = x</math><br />
<br />
'''MATLAB code''' for a Gamma(3,1) is<br />
<br />
<pre><br />
x = sum(-log(rand(1000,3)),2); <br />
hist(x)<br />
</pre><br />
<br />
And the Histogram of X follows a Gamma distribution with long tail: <br />
<br />
[[File:Hist.PNG|center|500px]]<br />
<br />
We can improve the quality of histogram by adding the number of bins we want, like hist(x, number_of_bins)<br />
<br />
<pre><br />
x = sum(-log(rand(20000,3)),2); <br />
hist(x,40)<br />
</pre><br />
<br />
[[File:untitled.jpg|center|500px]]<br />
<br />
''' R code''' for a Gamma(3,1) is<br />
<pre><br />
a<-apply(-log(matrix(runif(3000),nrow=1000)),1,sum);<br />
hist(a);<br />
</pre><br />
And the histogram is <br />
<br />
[[File:hist_gamma.png|center|500px]]<br />
<br />
Here is another histogram of Gamma coding with R<br />
<pre><br />
a<-apply(-log(matrix(runif(3000),nrow=1000)),1,sum);<br />
hist(a,freq=F);<br />
lines(density(a),col="blue");<br />
rug(jitter(a));<br />
</pre><br />
[[File:hist_gamma_2.png|center|500px]]<br />
<br />
====Sampling from Normal Distribution using Box-Muller Transform - September 29, 2011====<br />
<br />
=====Procedure=====<br />
<br />
# Generate <math>\displaystyle u_1</math> and <math>\displaystyle u_2</math>, two values sampled from a uniform distribution between 0 and 1.<br />
# Set <math>\displaystyle R^2 = -2log(u_1)</math> so that <math>\displaystyle R^2</math> is exponential with mean 1/2 <br> Set <math>\!\theta = 2*\pi*u_2</math> so that <math>\!\theta</math> ~ Unif[0, 2<math>\displaystyle\pi</math>]<br />
# Set <math>\displaystyle X = R cos(\theta)</math> <br> Set <math>\displaystyle Y = R sin(\theta)</math><br />
<br />
=====Justification=====<br />
<br />
Suppose we have X ~ N(0, 1) and Y ~ N(0, 1) where X and Y are independent normal random variables. The relative probability density function of these two random variables using Cartesian coordinates is:<br />
<br />
<math> f(X, Y) dxdy= f(X) f(Y) dxdy= \frac{1}{\sqrt{2\pi}}e^{-x^2/2} \frac{1}{\sqrt{2\pi}}e^{-y^2/2} dxdy= \frac{1}{2\pi}e^{-(x^2+y^2)/2}dxdy </math> <br><br />
<br />
In polar coordinates <math>\displaystyle R^2 = x^2 + y^2</math>, so the relative probability density function of these two random variables using polar coordinates is:<br />
<br />
<math> f(R, \theta) = \frac{1}{2\pi}e^{-R^2/2} </math> <br><br />
<br />
If we have <math>\displaystyle R^2 \sim exp(1/2)</math> and <math>\!\theta \sim unif[0, 2\pi]</math> we get an equivalent relative probability density function. Notice that after the two on two transformation, a determinant of jocobian should be added according to the change of variable and rule of differential multiplication where<br />
<br />
<math> |J|=\left|\frac{\partial(x,y)}{\partial(R,\theta)}\right|= \left|\begin{matrix}\frac{\partial x}{\partial R}&\frac{\partial x}{\partial \theta}\\\frac{\partial y}{\partial R}&\frac{\partial y}{\partial \theta}\end{matrix}\right|=R </math> <br><br />
<br />
<math> f(X, Y) dxdy = f(R, \theta)|J|dRd\theta = \frac{1}{2\pi}e^{-R^2/2}R dRd\theta= \frac{1}{4\pi}e^{-\frac{s}{2}} dSd\theta </math> <br>where <math> S=R^2. </math> <br><br />
<br />
Therefore we can generate a point in polar coordinates using the uniform and exponential distributions, then convert the point to Cartesian coordinates and the resulting X and Y values will be equivalent to samples generated from N(0, 1).<br />
<br />
'''MATLAB code'''<br />
<br />
In MatLab this algorithm can be implemented with the following code, which generates 20,000 samples from N(0, 1):<br />
<br />
<pre><br />
x = zeros(10000, 1);<br />
y = zeros(10000, 1);<br />
for ii = 1:10000<br />
u1 = rand;<br />
u2 = rand;<br />
R2 = -2 * log(u1);<br />
theta = 2 * pi * u2;<br />
x(ii) = sqrt(R2) * cos(theta);<br />
y(ii) = sqrt(R2) * sin(theta);<br />
end<br />
hist(x)<br />
</pre><br />
<br />
In one execution of this script, the following histogram for x was generated:<br />
<br />
[[File:Hist standard normal.jpg|center|500px]]<br />
<br />
=====Non-Standard Normal Distributions=====<br />
<br />
'''Example 1: Single-variate Normal'''<br />
<br />
If X ~ Norm(0, 1) then (a + bX) has a normal distribution with a mean of <math>\displaystyle a</math> and a standard deviation of <math>\displaystyle b</math> (which is equivalent to a variance of <math>\displaystyle b^2</math>). Using this information with the Box-Muller transform, we can generate values sampled from some random variable <math>\displaystyle Y\sim N(a,b^2) </math> for arbitrary values of <math>\displaystyle a,b</math>.<br />
<br />
# Generate a sample u from Norm(0, 1) using the Box-Muller transform.<br />
# Set v = a + bu.<br />
<br />
The values for v generated in this way will be equivalent to sample from a <math>\displaystyle N(a, b^2)</math>distribution. We can modify the MatLab code used in the last section to demonstrate this. We just need to add one line before we generate the histogram:<br />
<br />
<pre><br />
x = a + b * x;<br />
</pre><br />
<br />
For instance, this is the histogram generated when b = 15, a = 125:<br />
<br />
[[File:Hist normal.jpg|center|500]]<br />
<br />
'''Example 2: Multi-variate Normal'''<br />
<br />
The Box-Muller method can be extended to higher dimensions to generate multivariate normals. The objects generated will be nx1 vectors, and their variance will be described by nxn covariance matrices.<br />
<br />
<math>\mathbf{z} = N(\mathbf{u}, \Sigma)</math> defines the n by 1 vector <math>\mathbf{z}</math> such that:<br />
<br />
* <math>\displaystyle u_i</math> is the average of <math>\displaystyle z_i</math><br />
* <math>\!\Sigma_{ii}</math> is the variance of <math>\displaystyle z_i</math><br />
* <math>\!\Sigma_{ij}</math> is the co-variance of <math>\displaystyle z_i</math> and <math>\displaystyle z_j</math><br />
<br />
If <math>\displaystyle z_1, z_2, ..., z_d</math> are normal variables with mean 0 and variance 1, then the vector <math>\displaystyle (z_1, z_2,..., z_d) </math> has mean 0 and variance <math>\!I</math>, where 0 is the zero vector and <math>\!I</math> is the identity matrix. This fact suggests that the method for generating a multivariate normal is to generate each component individually as single normal variables.<br />
<br />
The mean and the covariance matrix of a multivariate normal distribution can be adjusted in ways analogous to the single variable case. If <math>\mathbf{z} \sim N(0,I)</math>, then <math>\Sigma^{1/2}\mathbf{z}+\mu \sim N(\mu,\Sigma)</math>. Note here that the covariance matrix is symmetric and nonnegative, so its square root should always exist.<br />
<br />
We can compute <math>\mathbf{z}</math> in the following way:<br />
<br />
# Generate an n by 1 vector <math>\mathbf{x} = \begin{bmatrix}x_{1} & x_{2} & ... & x_{n}\end{bmatrix}</math> where <math>x_{i}</math> ~ Norm(0, 1) using the Box-Muller transform.<br />
# Calculate <math>\!\Sigma^{1/2}</math> using singular value decomposition.<br />
# Set <math>\mathbf{z} = \Sigma^{1/2} \mathbf{x} + \mathbf{u}</math>.<br />
<br />
The following MatLab code provides an example, where a scatter plot of 10000 random points is generated. In this case x and y have a co-variance of 0.9 - a very strong positive correlation.<br />
<br />
<pre><br />
x = zeros(10000, 1);<br />
y = zeros(10000, 1);<br />
for ii = 1:10000<br />
u1 = rand;<br />
u2 = rand;<br />
R2 = -2 * log(u1);<br />
theta = 2 * pi * u2;<br />
x(ii) = sqrt(R2) * cos(theta);<br />
y(ii) = sqrt(R2) * sin(theta);<br />
end<br />
<br />
E = [1, 0.9; 0.9, 1];<br />
[u s v] = svd(E);<br />
root_E = u * (s ^ (1 / 2));<br />
<br />
z = (root_E * [x y]);<br />
z(:,1) = z(:,1) + 5;<br />
z(:,2) = z(:,2) + -8;<br />
<br />
scatter(z(:,1), z(:,2))<br />
</pre><br />
<br />
This code generated the following scatter plot:<br />
<br />
[[File:scatter covar.jpg|center|500px]]<br />
<br />
In Matlab, we can also use the function "sqrtm()" or "chol()" (Cholesky Decomposition) to calculate square root of a matrix directly. Note that the resulting root matrices may be different but this does materially affect the simulation.<br />
Here is an example:<br />
<br />
<pre><br />
E = [1, 0.9; 0.9, 1];<br />
r1 = sqrtm(E);<br />
r2 = chol(E);<br />
</pre><br />
<br />
R code for a multivariate normal distribution:<br />
<br />
<pre><br />
n=10000;<br />
r2<--2*log(runif(n));<br />
theta<-2*pi*(runif(n));<br />
x<-sqrt(r2)*cos(theta);<br />
<br />
y<-sqrt(r2)*sin(theta);<br />
a<-matrix(c(x,y),nrow=n,byrow=F);<br />
e<-matrix(c(1,.9,09,1),nrow=2,byrow=T);<br />
svde<-svd(e);<br />
root_e<-svde$u %*% diag(svde$d)^1/2;<br />
z<-t(root_e %*%t(a));<br />
z[,1]=z[,1]+5;<br />
z[,2]=z[,2]+ -8;<br />
par(pch=19);<br />
plot(z,col=rgb(1,0,0,alpha=0.06))<br />
</pre><br />
<br />
[[File:m_normal.png|center|500px]]<br />
<br />
=====Remarks=====<br />
MATLAB's randn function uses the ziggurat method to generate normal distributed samples. It is an efficient rejection method based on covering the probability density function with a set of horizontal rectangles so as to obtain points within each rectangle. It is reported that a 800 MHz Pentium III laptop can generate over 10 million random numbers from normal distribution in less than one second. ([http://www.mathworks.com/company/newsletters/news_notes/clevescorner/spring01_cleve.html Reference])<br />
<br />
===Sampling From Binomial Distributions===<br />
<br />
In order to generate a sample x from <math>\displaystyle X \sim Bin(n, p)</math>, we can follow the following procedure:<br />
<br />
1. Generate n uniform random numbers sampled from <math>\displaystyle Unif [0, 1] </math>: <math>\displaystyle u_1, u_2, ..., u_n</math>.<br />
<br />
2. Set x to be the total number of cases where <math>\displaystyle u_i <= p</math> for all <math>\displaystyle 1 <= i <= n</math>.<br />
<br />
In MatLab this can be coded with a single line. The following generates a sample from <math>\displaystyle X \sim Bin(n, p)</math> <br />
<br />
>> sum(rand(n, 1) <= p, 1)<br />
<br />
==Bayesian Inference and Frequentist Inference - October 4, 2011==<br />
<br />
===Bayesian inference vs Frequentist inference===<br />
The Bayesian method has become popular in the last few decades as simulation and computer technology makes it more applicable. For more information about its history and application, please refer to http://en.wikipedia.org/wiki/Bayesian_inference.<br />
As for frequentists, please refer to http://en.wikipedia.org/wiki/Frequentist_inference.<br />
<br />
====Example====<br />
Consider: A person drinks a cup of coffee on a specific day.<br />
<br><br><br />
Frequentist: There is no explanation to this situation. It is essentially meaningless since it has only occurred once. Therefore, it is not a probability.<br />
<br><br />
Bayesian: Probability is not just about the frequent occurrences but it is what you believe about this probability.<br />
<br />
<br />
====Example of face identification====<br />
Take the face as input x. And the person as output y. The person can be either Ali or Tom. If it is Ali, y=1. Otherwise, y=0. We can divide the picture into 100*100 pixels and then list them into a 10,000*1 column vector which is x.<br />
<br />
If you are a frequentist, you would compare Pr(X=x|y=1) with Pr(X=x|y=0) and see which one is higher. But if you are a Bayesianist, you would compare Pr(y=1|X=x) with Pr(y=0|X=x).<br />
<br />
====Summary of differences between two schools====<br />
*Frequentist: Probability refers to limiting relative frequency. (objective)<br />
*Bayesian: Probability describes degree of belief not frequency. (subjective)<br />
e.g. The probability that you drank a cup of tea on May 20, 2001 is 0.62 does not refer to any frequency.<br />
----<br />
*Frequentist: Parameters are fixed, unknown constants.<br />
*Bayesian: Parameters are random variables and we can make probabilistic statement about them.<br />
----<br />
*Frequentist: Statistical procedures should have long run frequency probabilities.<br />
e.g. a 95% confidence interval should trap true value of the parameter for at least 95% of limited frequency<br />
*Bayesian: It makes inferences about <math>\theta</math> by producing a prbability distribution for <math>\theta</math>. Inference (e.g. point estimation) will be extracted from this distribution.<br />
<br />
====Bayesian inference====<br />
<br />
Bayesian inference is usually carried out in the following way:<br />
<br />
1. Choose a prior probability density function of <math>\!\theta</math> which is <math>f(\!\theta)</math>. This is our belief about <math>\theta</math> before we see any data.<br />
<br />
2. Choose a statistical model <math>\displaystyle f(x|\theta)</math> that reflects our beliefs about X.<br />
<br />
3. After observing data <math>\displaystyle x_1,...,x_n</math>, we update our beliefs and calculate the posterior probability.<br />
<br />
<math>f(\theta|x) = \frac{f(\theta,x)}{f(x)}=\frac{f(x|\theta) \cdot f(\theta)}{f(x)}=\frac{f(x|\theta) \cdot f(\theta)}{\int^{}_\theta f(x|\theta) \cdot f(\theta) d\theta}</math>, where <math>\displaystyle f(\theta|x)</math> is the posterior probability, <math>\displaystyle f(\theta)</math> is the prior probability, <math>\displaystyle f(x|\theta)</math> is the likelihood of observing X=x given <math>\!\theta</math> and f(x) is the marginal probability of X=x.<br />
<br />
If we have i.i.d. observations <math>\displaystyle x_1,...,x_n</math>, we can replace <math>\displaystyle f(x|\theta)</math> with <math>f({x_1,...,x_n}|\theta)=\prod_{i=1}^n f(x_i|\theta)</math> because of independency.<br />
<br />
We denote <math>\displaystyle f({x_1,...,x_n}|\theta)</math> as <math>\displaystyle L_n(\theta)</math> which is called likelihood. And we use <math>\displaystyle x^n</math> to denote <math>\displaystyle (x_1,...,x_n)</math>.<br />
<br />
<math>f(\theta|x^n) = \frac{f(x^n|\theta) \cdot f(\theta)}{f(x^n)}=\frac{f(x^n|\theta) \cdot f(\theta)}{\int^{}_\theta f(x^n|\theta) \cdot f(\theta) d\theta}</math> , where <math>\int^{}_\theta f(x^n|\theta) \cdot f(\theta) d\theta</math> is a constant <math>\displaystyle c_n</math>. So <math>f(\theta|x^n) \propto f(x^n|\theta) \cdot f(\theta)</math>. The posterior probability is proportional to the likelihood times prior probability.<br />
<br />
<math>E(\theta)=\int^{}_\theta \theta \cdot f(\theta|x^n) d\theta</math> which is the posterior mean of <math>\!\theta</math>.<br />
<br />
Let <math>\tilde{\theta}=(\theta_1,...,\theta_d)^T</math>, then <math>f(\theta_1|x^n) = \int^{} \int^{} \dots \int^{}f(\theta|X)d\theta_2d\theta_3 \dots d\theta_d </math> and <math>E(\theta_1)=\int^{}\theta_1 \cdot f(\theta_1|x^n) d\theta_1</math><br />
<br />
====Example 1: Estimating parameters of a univariate Gaussian distribution====<br />
<br />
Suppose X follows a univariate Gaussian distribution (i.e. a Normal distribution) with parameters <math>\!\mu</math> and <br />
<math>\displaystyle {\sigma^2}</math>.<br />
<br />
(a) For Frequentists:<br />
<br />
<math>f(x|\theta)= \frac{1}{\sqrt{2\pi}\sigma} \cdot e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}</math><br />
<br />
<math>L_n(\theta)= \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma} \cdot e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2}</math><br />
<br />
<br />
<math>\ln L_n(\theta) = l(\theta) = \sum_{i=1}^n -\frac{1}{2}\ln 2\pi-\ln \sigma-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2</math><br />
<br />
To get the maximum likelihood estimator of <math>\!\mu</math> (mle), we find the <math>\hat{\mu}</math> which maximizes <math>\displaystyle L_n(\theta)</math>:<br />
<br />
<math>\frac{\partial l(\theta)}{\partial \mu}= \sum_{i=1}^n \frac{1}{\sigma}(\frac{x_i-\mu}{\sigma})=0 \Rightarrow \sum_{i=1}^n x_i = n\mu \Rightarrow \hat{\mu}_{mle}=\bar{x}</math><br />
<br />
(b) For Bayesians:<br />
<br />
<math>f(\theta|x) \propto f(x|\theta) \cdot f(\theta)</math><br />
<br />
We assume that the mean of the above normal distribution is itself distributed normally with mean <math>\!\mu_0</math> and variance <math>\!\Gamma</math>.<br />
<br />
Suppose <math>\!\mu\sim N(\mu_0, \!\Gamma^2</math>),<br />
<br />
so <math>f(\mu) = \frac{1}{\sqrt{2\pi}\Gamma} \cdot e^{-\frac{1}{2}(\frac{\mu-\mu_0}{\Gamma})^2}</math><br />
<br />
<math>f(\mu|x) = \frac{1}{\sqrt{2\pi}\tilde{\sigma}} \cdot e^{-\frac{1}{2}(\frac{\mu-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
<br />
<math>\tilde{\mu} = \frac{\frac{n}{\sigma^2}}{\frac{n}{\sigma^2}+\frac{1}{\Gamma^2}}\bar{x}+\frac{\frac{1}{\Gamma^2}}{\frac{n}{\sigma^2}+\frac{1}{\Gamma^2}}\mu_0</math>, where <math>\tilde{\mu}</math> is the estimator of <math>\!\mu</math>.<br />
<br />
* If prior belief about <math>\!\mu_0</math> is strong, then <math>\!\Gamma</math> is small and <math>\frac{1}{\Gamma^2}</math> is large. <math>\tilde{\mu}</math> is close to <math>\!\mu_0</math> and the observations will not affect too much. On the contrary, if prior belief about <math>\!\mu_0</math> is weak, <math>\!\Gamma</math> is large and <math>\frac{1}{\Gamma^2}</math> is small. <math>\tilde{\mu}</math> depends more on observations.(This is intuitive, when our original belief is reliable, then the sample is not important in improving the result; when the belief is not reliable, then we depend a lot on the sample.)<br />
<br />
* When the sample is large (i.e. n <math>\to \infty</math>), <math>\tilde{\mu} \to \bar{x}</math> and the impact of prior belief about <math>\!\mu</math> is weakened.<br />
<br />
=='''Basic Monte Carlo Integration - October 6th, 2011'''==<br />
<br />
Three integration methods would be taught in this course:<br />
*Basic Monte Carlo Integration<br />
*Importance Sampling<br />
*Markov Chain Monte Carlo (MCMC)<br />
<br />
The first, and most basic, method of numerical integration we will see is Monte Carlo Integration. We use this to solve an integral of the form: <math> I = \int_{a}^{b} h(x) dx </math><br />
<br />
Note the following derivation: <br />
<br />
<math>\begin{align}<br />
\displaystyle I & = \int_{a}^{b} h(x)dx \\<br />
& = \int_{a}^{b} h(x)((b-a)/(b-a))dx \\<br />
& = \int_{a}^{b} (h(x)(b-a))(1/(b-a))dx \\<br />
& = \int_{a}^{b} w(x)f(x)dx \\<br />
& = E[w(x)] \\<br />
\end{align}<br />
</math><br />
<br />
~<math>(1/n) \sum_{i=1}^{n} w(x) </math><br />
<br />
Where w(x) = h(x)(b-a) and f(x) is the probability density function of a uniform random variable on the interval [a,b]. The expectation, with respect to the distribution of f, of w is taken from n samples of x.<br />
<br />
<br />
===='''General Procedure'''====<br />
<br />
i) Draw n samples <math> x_i \sim~ U[a,b] </math><br />
<br />
ii) Compute <math> \ w(x_i) </math> for every sample<br />
<br />
iii) Obtain an estimate of the integral, <math> \hat{I} </math>, as follows:<br />
<br />
<math> \hat{I} = 1/n \sum_{i=1}^{n} w(x</math><sub>i</sub><math> )</math> . Clearly, this is just the average of the simulation results.<br />
<br />
By the strong law of large numbers <math> \hat{I} </math> converges to <math> \ I </math> as <math> \ n \rightarrow \infty </math>. Because of this, we can compute all sorts of useful information, such as variance, standard error, and confidence intervals.<br />
<br />
Standard Error: <math> SE = Standard Deviation / \sqrt{n} </math><br />
<br />
Variance: <math> V = (\sum_{i=1}^{n} (w(x)-I)^2)/(n-1) </math><br />
<br />
Confidence Interval: <math> I \pm t_{(\alpha/2)} SE </math><br />
<br />
==='''Example: Uniform Distribution'''===<br />
<br />
Consider the integral, <math> \int_{0}^{1} x^3dx </math>, which is easily solved through standard analytical integration methods, and is equal to .25. Now, let us check this answer with a numerical approximation using Monte Carlo Integration. <br />
<br />
We generate a 1 by 10000 vector of uniform (on the interval [0,1]) random variables and call that vector 'u'. We see that our 'w' in this case is <math> x^3 </math>, so we set <math> w = u^3 </math>. Our I<sup>^</sup> is equal to the mean of w.<br />
<br />
In Matlab, we can solve this integration problem with the following code:<br />
<br />
<pre><br />
u = rand(1,10000);<br />
w = u.^3;<br />
mean(w)<br />
ans = 0.2475<br />
</pre><br />
<br />
Note the '.' after 'u' in the second line of code, indicating that each entry in the matrix is cubed. Also, our approximation is close to the actual value of .25. Now let's try to get an even better approximation by generating more sample points. <br />
<br />
<pre><br />
u= rand(1,100000);<br />
w= u.^3;<br />
mean(w)<br />
ans = .2503<br />
</pre><br />
<br />
We see that when the number of sample points is increased, our approximation improves, as one would expect.<br />
<br />
==='''Generalization'''===<br />
<br />
Up to this point we have seen how to numerically approximate an integral when the distribution of f is uniform. Now we will see how to generalize this to other distributions.<br />
<br />
<math> I = \int h(x)f(x)dx </math> <br />
<br />
If f is a distribution function (pdf), then <math> I </math> can be estimated as E<sub>f</sub>[h(x)]. This means taking the expectation of h with respect to the distribution of f. Our previous example is the case where f is the uniform distribution between [a,b].<br />
<br />
'''Procedure for the General Case'''<br />
<br />
i) Draw n samples from f <br />
<br />
ii) Compute h(x<sub>i</sub>)<br />
<br />
iii) <math>\hat{I} = 1/n \sum_{i=1}^{n} h(x</math><sub>i</sub><math>)</math><br />
<br />
==='''Example: Exponential Distribution'''===<br />
<br />
Find <math> E[\sqrt{x}] </math> for <math> \displaystyle f = e^{-x} </math>, which is the exponential distribution with mean 1.<br />
<br />
<math> I = \int_{0}^{\infty} \sqrt{x} e^{-x}dx </math><br />
<br />
We can see that we must draw samples from f, the exponential distribution.<br />
<br />
To find a numerical solution using Monte Carlo Integration we see that: <br />
<br />
u= rand(1,10000)<br />
X= -log(u)<br />
h= <math> \sqrt{x} </math> <br />
I= mean(h)<br />
<br />
To implement this procedure in Matlab, use the following code:<br />
<br />
<pre><br />
u = rand(1,10000);<br />
X = -log(u);<br />
h = x.^.5;<br />
mean(h)<br />
ans = .8841<br />
</pre><br />
<br />
An easy way to check whether your approximation is correct is to use the built in Matlab function 'quadl' which takes a function and bounds for the integral and returns a solution for the definite integral of that function. For this specific example, we can enter:<br />
<br />
<pre><br />
f = @(x) sqrt(x).*exp(-x);<br />
% quadl runs into computational problems when the upper bound is "inf" or an extremely large number, <br />
% so choose just a moderately large number.<br />
quadl(f,0,100)<br />
ans =<br />
0.8862<br />
</pre><br />
<br />
From the above result, we see that our approximation was quite close.<br />
<br />
==='''Example: Normal Distribution'''===<br />
<br />
Let <math> f(x) = (1/(2 \pi)^{1/2}) e^{(-x^2)/2} </math>. Compute the cumulative distribution function at some point x.<br />
<br />
<math> F(x)= \int_{-\infty}^{x} f(s)ds = \int_{-\infty}^{x}(1)(1/(2 \pi)^{1/2}) e^{(-s^2)/2}ds </math>. The (1) is inserted to illustrate that our h(x) will be the constant function 1, and our f(x) is the normal distribution. To take into account the upper bound of integration, x, any values sampled that are greater than x will be set to zero. <br />
<br />
This is the Matlab code for solving F(2):<br />
<br />
<pre><br />
<br />
u = randn(1,10000)<br />
h = u < 2;<br />
mean(h)<br />
ans = .9756<br />
<br />
</pre><br />
<br />
We generate a 1 by 10000 vector of standard normal random variables and we return a value of 1 if u is less than 2, and 0 otherwise.<br />
<br />
We can also build the function F(x) in matlab in the following way:<br />
<br />
<pre><br />
function F(x)<br />
u=rand(1,1000000);<br />
h=u<x;<br />
mean(h)<br />
</pre><br />
<br />
<br />
==='''Example: Binomial Distribution'''===<br />
<br />
In this example we will see the Bayesian Inference for 2 Binomial Distributions.<br />
<br />
Let <math> X ~ Bin(n,p) </math> and <math> Y ~ Bin(m,q) </math>, and let <math> \!\delta = p-q </math>.<br />
<br />
Therefore, <math> \displaystyle \!\delta = x/n - y/m </math> which is the frequentist approach.<br />
<br />
Bayesian wants <math> \displaystyle f(p,q|x,y) = f(x,y|p,q)f(p,q)/f(x,y) </math>, where <math> f(x,y)=\iint\limits_{\!\theta} f(x,y|p,q)f(p,q)\,dp\,dq</math> is a constant.<br />
<br />
Thus, <math> \displaystyle f(p,q|x,y)\propto f(x,y|p,q)f(p,q) </math>. Now we assume that <math>\displaystyle f(p,q) = f(p)f(q) = 1 </math> and f(p) and f(q) are uniform.<br />
<br />
Therefore, <math> \displaystyle f(p,q|x,y)\propto p^x(1-p)^{n-x}q^y(1-q)^{m-y} </math>.<br />
<br />
<math> E[\delta] = \int_{0}^{1} \int_{0}^{1} (p-q)f(p,q|x,y)dxdy </math>.<br />
<br />
As you can see this is much tougher than the frequentist approach.<br />
<br />
=='''Importance Sampling and Basic Monte Carlo Integration - October 11th, 2011'''==<br />
<br />
==='''Example: Binomial Distribution (Continued)'''===<br />
<br />
Suppose we are given two independent Binomial Distributions <math>\displaystyle X \sim Bin(n, p_1)</math>, <math>\displaystyle Y \sim Bin(m, p_2)</math>. We would like to give an Monte Carlo estimate of <math>\displaystyle \delta = p_1 - p_2</math><br><br />
<br />
Frequentist approach: <br><br><math>\displaystyle \hat{p_1} = \frac{X}{n}</math> ; <math>\displaystyle \hat{p_2} = \frac{Y}{m}</math><br><br><math>\displaystyle \hat{\delta} = \hat{p_1} - \hat{p_2} = \frac{X}{n} - \frac{Y}{m}</math><br><br><br />
Bayesian approach to compute the expected value of <math>\displaystyle \delta</math>:<br><br><br />
<math>\displaystyle E(\delta) = \int\int(p_1-p_2) f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Assume that <math>\displaystyle n = 100, m = 100, p_1 = 0.5, p_2 = 0.8</math> and the sample size is 1000.<br><br />
MATLAB code of the above example:<br />
<pre><br />
n = 100;<br />
m = 100;<br />
p_1 = 0.5;<br />
p_2 = 0.8;<br />
p1 = mean(rand(n,1000)<p_1);<br />
p2 = mean(rand(m,1000)<p_2);<br />
delta = p2 - p1;<br />
hist(delta)<br />
mean(delta)<br />
</pre><br />
<br />
In one execution of the code, the mean of delta was 0.3017. The histogram of delta generated was:<br />
[[File:Hist delta.jpg|center|]]<br />
<br />
Through Monte Carlo simulation, we can obtain an empirical distribution of delta and carry out inference on the data obtained, such as computing the mean, maximum, variance, standard deviation and the standard error of delta.<br />
<br />
==='''Importance Sampling'''===<br />
<br />
====Motivation====<br />
<br />
Consider the integral <math>\displaystyle I = \int h(x)f(x)\,dx</math><br><br><br />
According to basic Monte Carlo Integration, if we can sample from the probability density function <math>\displaystyle f(x)</math> and feed the samples of <math>\displaystyle f(x)</math> back to <math>\displaystyle h(x)</math>, <math>\displaystyle I</math> can be estimated as an average of <math>\displaystyle h(x)</math> ( i.e. <math>\hat{I} = \frac{1}{n} \sum_{i=1}^{n} h(x_i)</math> )<br><br />
However, the Monte Carlo method works when we know how to sample from <math>\displaystyle f(x)</math>. In the case where it is difficult to sample from <math>\displaystyle f(x)</math>, importance sampling is a technique that we can apply. Importance Sampling relies on another function <math>\displaystyle g(x)</math> which we know how to sample from.<br />
<br />
The above integral can be rewritten as follow:<br><br />
<math>\begin{align}<br />
\displaystyle I & = \int h(x)f(x)\,dx \\<br />
& = \int h(x)f(x)\frac{g(x)}{g(x)}\,dx \\<br />
& = \int \frac{h(x)f(x)}{g(x)}g(x)\,dx \\<br />
& = \int y(x)g(x)\,dx \\<br />
& = E_g(y(x)) \\<br />
\end{align}<br />
</math><br><br />
<math>where \ y(x) = \frac{h(x)f(x)}{g(x)}</math><br><br />
<br />
The integral can thus be simulated as <math>\displaystyle \hat{I} = \frac{1}{n} \sum_{i=1}^{n} Y_i \ , \ where \ Y_i = \frac{h(x_i)f(x_i)}{g(x_i)}</math><br><br />
<br />
====Procedure====<br />
<br />
Suppose we know how to sample from <math>\displaystyle g(x)</math><br><br />
#Choose a suitable <math>\displaystyle g(x)</math> and draw n samples <math>x_1,x_2....,x_n \sim~ g(x)</math><br />
#Set <math>Y_i =\frac{h(x_i)f(x_i)}{g(x_i)}</math><br />
#Compute <math> \hat{I} = \frac{1}{n}\sum_{i=1}^{n} Y_i </math><br><br />
<br />
By the Law of large numbers, <math>\displaystyle \hat{I} \rightarrow I </math> provided that the sample size n is large enough.<br><br><br />
<br />
'''Remarks:''' One can think of <math>\frac{f(x)}{g(x)}</math> as a weight to <math>\displaystyle h(x)</math> in the computation of <math>\hat{I}</math><br><br><br />
<math>\displaystyle i.e. \ \hat{I} = \frac{1}{n}\sum_{i=1}^{n} Y_i = \frac{1}{n}\sum_{i=1}^{n} (\frac{f(x_i)}{g(x_i)})h(x_i)</math><br><br><br />
Therefore, <math>\displaystyle \hat{I} </math> is a weighted average of <math>\displaystyle h(x_i)</math><br><br><br />
<br />
====Problem====<br />
<br />
If <math>\displaystyle g(x)</math> is not chosen appropriately, then the variance of the estimate <math>\hat{I}</math> may be very large. Here we actually face a similar problem with Rejection-Acceptance Approach. Consider the second moment of <math>\displaystyle I</math>:<br><br><br />
<math>\begin{align}<br />
\displaystyle I & = E_g((y(x))^2) \\<br />
& = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx \\<br />
& = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx \\<br />
& = \int \frac{h^2(x)f^2(x)}{g(x)} dx \\<br />
\end{align}<br />
</math><br><br><br />
<br />
When <math>\displaystyle g(x)</math> is very small, then the above integral could be very large, hence the variance can be very large when g is not chosen appropriately. This occurs when <math>\displaystyle g(x)</math> has a thinner tail than <math>\displaystyle f(x)</math> such that the quantity <math>\displaystyle \frac{h^2(x)f^2(x)}{g(x)}</math> is large.<br />
<br />
'''Remarks:''' <br />
<br />
1. We can actually compute the form of <math>\displaystyle g(x)</math> to have optimal variance. <br>Mathematically, it is to find <math>\displaystyle g(x)</math> subject to <math>\displaystyle \min_g [\ E_g([y(x)]^2) - (E_g[y(x)])^2\ ]</math><br><br />
It can be shown that the optimal <math>\displaystyle g(x)</math> is <math>\displaystyle \frac{|h(x)|f(x)}{\int_{-\infty}^{\infty}|h(s)|f(s)ds}</math>. Using the optimal <math>\displaystyle g(x)</math> will minimize the variance of estimation in Importance Sampling. This is of theoretical interest but not useful in practice. As we can see, if we can actually show the expression of g(x), we must first have the value of the integration---which is what we want in the first place.<br />
<br />
2. In practice, we shall choose <math>\displaystyle g(x)</math> which has similar shape as <math>\displaystyle f(x)</math> but with a thicker tail than <math>\displaystyle f(x)</math> in order to avoid the problem mentioned above.<br><br />
<br />
====Example====<br />
<br />
Estimate <math>\displaystyle I = Pr(Z>3),\ where\ Z \sim N(0,1)</math><br><br><br />
'''Method 1: Basic Monte Carlo'''<br />
<br />
<math>\begin{align} Pr(Z>3) & = \int^\infty_3 f(x)\,dx \\<br />
& = \int^\infty_{-\infty} h(x)f(x)\,dx \end{align}</math><br /><br />
<math> where \ <br />
h(x) = \begin{cases}<br />
0, & \text{if } x \le 3 \\<br />
1, & \text{if } x > 3<br />
\end{cases}</math><br />
<math>\ ,\ f(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}</math><br />
<br />
MATLAB code to compute <math>\displaystyle I</math> from 100 samples of standard normal distribution:<br />
<pre><br />
h = randn(100,1) > 3;<br />
I = mean(h)<br />
</pre><br />
<br />
In one execution of the code, it returns a value of 0 for <math>\displaystyle I</math>, which differs significantly from the true value of <math>\displaystyle I \approx 0.0013 </math>. The problem of using Basic Monte Carlo in this example is that <math>\displaystyle Pr(Z>3)</math> has a small value, and hence many points sampled from the standard normal distribution will be wasted. Therefore, although Basic Monte Carlo is a feasible method to compute <math>\displaystyle I</math>, it gives a poor estimation.<br />
<br />
'''Method 2: Importance Sampling'''<br />
<br />
<math>\displaystyle I = Pr(Z>3)= \int^\infty_3 f(x)\,dx </math><br><br />
<br />
To apply importance sampling, we have to choose a <math>\displaystyle g(x)</math> which we will sample from. In this example, we can choose <math>\displaystyle g(x)</math> to be the probability density function of exponential distribution, normal distribution with mean 0 and variance greater than 1 or normal distribution with mean greater than 0 and variance 1 etc.. For the following, we take <math>\displaystyle g(x)</math> to be the pdf of <math>\displaystyle N(4,1)</math>.<br><br />
<br />
Procedure:<br />
#Draw n samples <math>x_1,x_2....,x_n \sim~ g(x)</math><br />
#Calculate <math>\begin{align} \frac{f(x)}{g(x)} & = \frac{ \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}<br />
}{ \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x-4)^2} } \\<br />
& = e^{8-4x} \end{align} </math><br><br />
#Set <math> Y_i = h(x_i)e^{8-4x_i}\ with\ h(x) = \begin{cases}<br />
0, & \text{if } x \le 3 \\<br />
1, & \text{if } x > 3<br />
\end{cases}<br />
</math><br><br />
#Compute <math> \hat{Y} = \frac{1}{n}\sum_{i=1}^{n} Y_i </math><br><br />
<br />
The above procedure from 100 samples of <math>\displaystyle g(x)</math>can be implemented in MATLAB as follow:<br />
<pre><br />
for ii = 1:100<br />
x = randn + 4 ;<br />
h = x > 3 ;<br />
y(ii) = h * exp(8-4*x) ;<br />
end<br />
mean(y)<br />
</pre><br />
<br />
In one execution of the code, it returns a value of 0.001271 for <math> \hat{Y} </math>, which is much closer to the true value of <math>\displaystyle I \approx 0.0013 </math>. From many executions of the code, the variance of basic monte carlo is approximately 150 times that of importance sampling. This demonstrates that this method can provide a better estimate than the Basic Monte Carlo method.<br />
<br />
==''' Importance Sampling with Normalized Weight and Markov Chain Monte Carlo - October 13th, 2011'''==<br />
==='''Importance Sampling with Normalized Weight'''===<br />
<br />
Recall that we can think of <math>\displaystyle b(x) = \frac{f(x)}{g(x)}</math> as a weight applied to the samples <math>\displaystyle h(x)</math>. If the form of <math>\displaystyle f(x)</math> is known only up to a constant, we can use an alternate, normalized form of the weight, <math>\displaystyle b^*(x)</math>. (This situation arises in Bayesian inference.) Importance sampling with normalized or standard weight is also called indirect importance sampling.<br />
<br />
We derive the normalized weight as follows:<br><br />
<math>\begin{align}<br />
\displaystyle I & = \int h(x)f(x)\,dx \\<br />
&= \int h(x)\frac{f(x)}{g(x)}g(x)\,dx \\<br />
&= \frac{\int h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int f(x) dx} \\<br />
&= \frac{\int h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int \frac{f(x)}{g(x)}g(x) dx} \\<br />
&= \frac{\int h(x)b(x)g(x)\,dx}{\int\ b(x)g(x) dx} <br />
\end{align}</math><br />
<br />
<math>\hat{I}= \frac{\sum_{i=1}^{n} h(x_i)b(x_i)}{\sum_{i=1}^{n} b(x_i)} </math><br />
<br />
Then, the normalized weight is <math>b^*(x) = \displaystyle \frac{b(x_i)}{\sum_{i=1}^{n} b(x_i)}</math><br />
<br />
Note that <math> \int f(x) dx = 1 = \int b(x)g(x) dx = 1 </math><br />
<br />
We can also determine the associated Monte Carlo variance of this estimate by<br />
<br />
<math> Var(\hat{I})= \frac{\sum_{i=1}^{n} b(x_i)(h(x_i) - \hat{I})^2}{\sum_{i=1}^{n} b(x_i)} </math><br />
<br />
==='''Markov Chain Monte Carlo'''===<br />
We still want to solve <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
====Stochastic Process====<br />
A stochastic process <math> \{ x_t : t \in T \}</math> is a collection of random variables. Variables <math>\displaystyle x_t</math> take values in some set <math>\displaystyle X</math> called the '''space set.''' The set <math>\displaystyle T</math> is called the '''index set.'''<br />
<br />
====Markov Chain====<br />
A Markov Chain is a stochastic process for which the distribution of <math>\displaystyle x_t</math> depends only on <math>\displaystyle x_{t-1}</math>. It is a random process characterized as being memoryless; meaning that the next occurrence of a defined event is only dependent on the current event and not on the sequence of events that preceded it. <br />
Formal Definition: The process <math> \{ x_t : t \in T \}</math> is a Markov Chain if <math>\displaystyle Pr(x_t|x_0, x_1,..., x_{t-1})= Pr(x_t|x_{t-1})</math> for all <math> \{t \in T \}</math> and for all <math> \{x \in X \}</math><br />
For a Markov Chain, <math>\displaystyle f(x_1,...x_n)= f(x_1)f(x_2|x_1)f(x_3|x_2)...f(x_n|x_{n-1})</math><br />
<br><br>Real Life Example:<br />
<br>When going for an interview, the employer only looks at your highest education achieved. The employer would not look at the past educations received (elementary school, high school etc.) because the employer believes that the highest education achieved summarizes your previous educations. Therefore, anything before your most recent previous education is irrelevant. In other word, we can say that<math> x_t </math>is regarded as the summary of <math>x_{t-1},...,x_2,x_1</math>, so when we need to determine <math>x_{t+1}</math>, we only need to pay attention in <math>x_{t}</math>.<br />
<br />
====Transition Probabilities====<br />
A Transition Probability is the probability of jumping from one state to another state.<br />
Formal Definition: We call <math>\displaystyle P_{ij} = Pr(x_{t+1}=j|x_t=i)</math> the transition probability.<br />
That is, P(i,j) is the probability of going to state j from state i. The matrix P whose (i,j) element is <math>\displaystyle P_{ij}</math> is called the Transition Matrix.<br />
<br />
Properties of P: <br />
:1) <math>\displaystyle P_{ij} >= 0</math> The probability of going to another state cannot be negative<br />
:2) <math>\displaystyle \sum_{\forall i}P_{ij} = 1</math> The probability of going to some state from state i (including remaining in state i) is certainty<br />
<br />
====Random Walk====<br />
Example: Start at one point and flip a coin where <math>\displaystyle Pr(H)=p</math> and <math>\displaystyle Pr(T)=1-p=q</math>. Take one step right if heads and one step left if tails. If at an endpoint, stay there.<br />
The transition matrix is<br />
<math>P=\left(\begin{matrix}1&0&0&\dots&\dots&0\\<br />
q&0&p&0&\dots&0\\<br />
0&q&0&p&\dots&0\\<br />
\vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\<br />
\vdots&\vdots&\vdots&\vdots&\ddots&\vdots\\<br />
0&0&\dots&\dots&\dots&1<br />
\end{matrix}\right)</math><br />
<br />
Let <math>\displaystyle P_n</math> be the matrix such that its (i,j) element is <math>\displaystyle P_{ij}(n)</math>. This is called n-step probability.<br />
<br />
:<math>\displaystyle P_n = P^n</math><br />
:<math>\displaystyle P_1 = P</math><br />
:<math>\displaystyle P_2 = P^2</math><br />
<br />
<br />
==''' Markov Chain Properties and Page Rank - October 18th, 2011'''==<br />
<br />
===Summary of Terminology===<br />
<br />
====Transition Matrix====<br />
<br />
A matrix <math>\!P</math> that defines a Markov Chain has the form:<br />
<br />
<math>P = \begin{bmatrix}<br />
P_{11} & \cdots & P_{1N} \\<br />
\vdots & \ddots & \vdots \\ <br />
P_{N1} & \cdots & P_{NN}<br />
\end{bmatrix}</math><br />
<br />
where <math>\!P(i,j) = P_{ij} = Pr(x_{t+1} = j | x_t = i) </math> is the probability of transitioning from state i to state j in the Markov Chain in a single step. Note that this implies that all rows add up to one.<br />
<br />
====n-step Transition matrix====<br />
<br />
A matrix <math>\!P_n</math> whose (i,j)<sup>th</sup> entry is the probability of moving from state i to state j after n transitions:<br />
<br />
<math>\!P_n(i,j) = Pr(x_{m+n}=j|x_m = i)</math><br />
<br />
This probability is called the n-step transition probability. A nice property of this matrix is that<br />
<br />
<math>\!P_n = P^n</math><br />
<br />
For all n >= 0, where P is the transition matrix. Note that the rows of <math>P_n</math> should still add up to one.<br />
<br />
====Marginal distribution of a Markov Chain====<br />
<br />
We represent the state at time t as a vector.<br />
<br />
<math>\mu_t = (\mu_t(1) \; \mu_t(2) \; ... \; \mu_t(n))</math><br />
<br />
Consider this Markov Chain:<br />
<br />
[[File:MarkovSample.png|300px]]<br />
<br />
<math>\mu_t = (A \; B)</math>, where A is the probability of being in state a at time t, and B is the probability of being in state b at time t.<br />
<br />
For example if <math>\mu_t = (0.1 \; 0.9)</math>, we have a 10% chance of being in state a at time t, and a 90% chance of being in state b at time t.<br />
<br />
Suppose we run this Markov chain many times, and record the state at each step.<br />
<br />
In this example, we run 4 trials, up until t=5.<br />
<br />
{| class="wikitable"<br />
|-<br />
! t<br />
! Trial 1<br />
! Trial 2<br />
! Trial 3<br />
! Trial 4<br />
! Observed <math>\mu</math><br />
|-<br />
| 1<br />
| a<br />
| b<br />
| b<br />
| a<br />
| (0.5, 0.5)<br />
|-<br />
| 2<br />
| b<br />
| a<br />
| a<br />
| a<br />
| (0.75, 0.25)<br />
|-<br />
| 3<br />
| a<br />
| a<br />
| b<br />
| a<br />
| (0.75, 0.25)<br />
|-<br />
| 4<br />
| b<br />
| b<br />
| a<br />
| b<br />
| (0.25, 0.75)<br />
|-<br />
| 5<br />
| b<br />
| b<br />
| b<br />
| a<br />
| (0.25, 0.75)<br />
|}<br />
<br />
Imagine simulating the chain many times. If we collect all the outcomes at time t from all the chains, the histogram of this data would look like <math>\!\mu_t</math>.<br />
<br />
We can find the marginal probabilities as <math>\!\mu_n = \mu_0 P^n</math><br />
<br />
====Stationary Distribution====<br />
<br />
Let <math>\pi = (\pi_i \mid i \in \chi)</math> be a vector of non-negative numbers that sum to 1. (i.e. <math>\!\pi</math> is a pmf)<br />
<br />
If <math>\!\pi = \pi P</math>, then <math>\!\pi</math> is a stationary distribution, also known as an invariant distribution.<br />
<br />
====Limiting Distribution====<br />
<br />
A Markov chain has limiting distribution <math>\!\pi </math> if <math>\lim_{n \to \infty} P^n = \begin{bmatrix} \pi \\ \vdots \\ \pi \end{bmatrix}</math><br />
<br />
That is, <math>\!\pi_j = \lim_{n \to \infty}\left [ P^n \right ]_{ij}</math> exists and is independent of i.<br />
<br />
Here is an example:<br />
<br />
Suppose we want to find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/3&1/3&1/3\\<br />
1/4&3/4&0\\<br />
1/2&0&1/2<br />
\end{matrix}\right)</math><br />
<br />
We want to solve <math>\pi=\pi P</math> and we want <math>\displaystyle \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
<math>\displaystyle \pi_0 = 1/3\pi_0 + 1/4\pi_1 + 1/2\pi_2</math><br /><br />
<math>\displaystyle \pi_1 = 1/3\pi_0 + 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_2 = 1/3\pi_0 + 1/2\pi_2</math><br /><br />
<br />
Solving the system of equations, we get <br /> <br />
<math>\displaystyle \pi_1 = 4/3\pi_0</math><br /><br />
<math>\displaystyle \pi_2 = 2/3\pi_0</math><br /><br />
<br />
So using our condition above, we have <math>\displaystyle \pi_0 + 4/3\pi_0 + 2/3\pi_0 = 1</math> and by solving we get <math>\displaystyle \pi_0 = 1/3</math><br />
<br />
Using this in our system of equations, we obtain: <br /><br />
<math>\displaystyle \pi_1 = 4/9</math><br /><br />
<math>\displaystyle \pi_2 = 2/9</math><br />
<br />
Thus, the limiting distribution is <math>\displaystyle \pi = (1/3, 4/9, 2/9)</math><br />
<br />
====Detailed Balance====<br />
<br />
<math>\!\pi</math> has the detailed balance property if <math>\!\pi_iP_{ij} = P_{ji}\pi_j</math><br />
<br />
'''Theorem'''<br />
<br />
If <math>\!\pi</math> satisfies detailed balance, then <math>\!\pi</math> is a stationary distribution.<br />
<br />
In other words, if <math>\!\pi_iP_{ij} = P_{ji}\pi_j</math>, then <math>\!\pi = \pi P</math><br />
<br />
'''Proof:''' <br />
<br />
<math>\!\pi P =<br />
\begin{bmatrix}\pi_1 & \pi_2 & \cdots & \pi_N\end{bmatrix} \begin{bmatrix}P_{11} & \cdots & P_{1N} \\ \vdots & \ddots & \vdots \\ P_{N1} & \cdots & P_{NN}\end{bmatrix}</math><br />
<br />
Observe that the j<sup>th</sup> element of <math>\!\pi P</math> is<br />
<br />
<math>\!\left [ \pi P \right ]_j = \pi_1 P_{1j} + \pi_2 P_{2j} + \dots + \pi_N P_{Nj}</math><br />
<br />
::<math>\! = \sum_{i=1}^N \pi_i P_{ij}</math><br />
<br />
::<math>\! = \sum_{i=1}^N P_{ji} \pi_j</math>, by the definition of detailed balance.<br />
<br />
::<math>\! = \pi_j \sum_{i=1}^N P_{ji}</math><br />
<br />
::<math>\! = \pi_j</math>, as the sum of the entries in a column of P must sum to 1.<br />
<br />
So <math>\!\pi = \pi P</math>.<br />
<br />
<br />
'''Example'''<br />
<br />
Find the marginal distribution of <br />
<br />
[[File:MarkovSample.png|300px]]<br />
<br />
Start by generating the matrix P.<br />
<br />
<math>\!P = \begin{pmatrix} 0.2 & 0.8 \\ 0.6 & 0.4 \end{pmatrix}</math><br />
<br />
We must assume some starting value for <math>\mu_0</math><br />
<br />
<math>\!\mu_0 = \begin{pmatrix} 0.1 & 0.9 \end{pmatrix}</math><br />
<br />
For t = 1, the marginal distribution is<br />
<br />
<math>\!\mu_1 = \mu_0 P</math><br />
<br />
Notice that this <math>\mu</math> converges. <br />
<br />
If you repeatedly run:<br />
<br />
<math>\!\mu_{i+1} = \mu_i P</math><br />
<br />
It converges to <math>\mu = \begin{pmatrix} 0.4286 & 0.5714 \end{pmatrix}</math><br />
<br />
This can be seen by running the following Matlab code:<br />
P = [0.2 0.8; 0.6 0.4];<br />
mu = [0.1 0.9]; <br />
while 1 <br />
mu_old = mu; <br />
mu = mu * P;<br />
if mu_old == mu <br />
disp(mu);<br />
break;<br />
end<br />
end<br />
<br />
Another way of looking at this simple question is that we can see whether the ultimate pmf converges:<br />
<br />
Let <math>\hat{p_n}(1)=\frac{1}{n}\sum_{k=1}^n I(X_k=1)</math> denote the estimator of the stationary probability of state 1,<math>\hat{p_n}(2)=\frac{1}{n}\sum_{k=1}^n I(X_k=2)</math> denote the estimator of the stationary probability of state 2, where <math>\displaystyle I(X_k=1)</math> and <math>\displaystyle I(X_k=2)</math> are indicator variables which equal 1 if <math>X_k=1</math>(or <math>X_k=2</math> for the latter one).<br />
<br />
Matlab codes for this explanation is<br />
<br />
n=1;<br />
if rand<0.1<br />
x(1)=1;<br />
else<br />
x(1)=0;<br />
end<br />
p1(1)=sum(x)/n;<br />
p2(1)=1-p1(1);<br />
for i=2:10000<br />
n=n+1;<br />
if (x(i-1)==1&rand<0.2)|(x(i-1)==0&rand<0.6)<br />
x(i)=1;<br />
else<br />
x(i)=0;<br />
end<br />
p1(i)=sum(x)/n;<br />
p2(i)=1-p1(i); <br />
end<br />
plot(p1,'red');<br />
hold on;<br />
plot(p2)<br />
<br />
The results can be easily seen from the graph below:<br />
<br />
[[File:Stationary distribution.png|300px]]<br />
<br />
Additionally, we can plot the marginal distribution as it converges without estimating it. The following Matlab code shows this:<br />
<br />
%transition matrix<br />
P=[0.2 0.8; 0.6 0.4];<br />
%mu at time 0<br />
mu=[0.1 0.9];<br />
%number of points for simulation<br />
n=20;<br />
for i=1:n<br />
mu_a(i)=mu(1);<br />
mu_b(i)=mu(2);<br />
mu=mu*P;<br />
end<br />
t=[1:n];<br />
plot(t, mu_a, t, mu_b);<br />
hleg1=legend('state a', 'state b');<br />
<br />
[[File:Marginal distribution convergence.png|300px]]<br />
<br />
Note that there are chains with stationary distributions that don't converge (the chain might not naturally reach the stationary distribution, and isn't limiting). An example of this is:<br />
<br />
<math>P = \begin{pmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \end{pmatrix}, \mu_0 = \begin{pmatrix} 1/3 & 1/3 & 1/3 \end{pmatrix}</math><br />
<br />
<math>\!\mu_0</math> is a stationary distribution, so <math>\!\mu P</math> is the same for all iterations.<br />
<br />
But,<br />
<br />
<math>P^{1000} = \begin{pmatrix} 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{pmatrix} \ne \begin{pmatrix} \mu \\ \mu \\ \mu \end{pmatrix}</math><br />
<br />
So <math>\!\mu</math> is not a limiting distribution. Also, if<br />
<br />
<math>\mu = \begin{pmatrix} 0.2 & 0.1 & 0.7 \end{pmatrix}</math><br />
<br />
Then <math>\!\mu = \mu P</math> does not converge.<br />
<br />
This can be observed through the following Matlab code.<br />
<br />
P = [0 0 1; 1 0 0; 0 1 0];<br />
mu = [0.2 0.1 0.7]; <br />
for i= 1:4 <br />
mu = mu * P;<br />
disp(mu);<br />
end<br />
<br />
This outputs<br />
0.1000 0.7000 0.2000<br />
0.7000 0.2000 0.1000<br />
0.2000 0.1000 0.7000<br />
0.1000 0.7000 0.2000<br />
<br />
Note that <math>\!\mu_1 = \!\mu_4</math>, which indicates that <math>\!\mu</math> will cycle forever.<br />
<br />
This means that this chain has a stationary distribution, but is not limiting.<br />
<br />
===Page Rank===<br />
<br />
Page Rank was the original ranking algorithm used by Google's search engine to rank web pages.<ref><br />
http://ilpubs.stanford.edu:8090/422/<br />
</ref> The algorithm was created by the founders of Google, Larry Page and Sergey Brin as part of Page's PhD thesis. When a query is entered in a search engine, there are a set of web pages which are matched by this query, but this set of pages must be ordered by their "importance" in order to identify the most meaningful results first. Page Rank is an algorithm which assigns importance to every web page based on the links in each page.<br />
<br />
==== Intuition ====<br />
<br />
We can represent web pages by a set of nodes, where web links are represented as edges connecting these nodes. Based on our intuition, there are three main factors in deciding whether a web page is important or not.<br />
<br />
# A web page is important if many other pages point to it.<br />
# The more important a webpage is, the more weight is placed on its links.<br />
# The more links a webpage has, the less weight is placed on its links.<br />
<br />
====Modelling====<br />
<br />
We can model the set of links as a N-by-N matrix L, where N is the number of web pages we are interested in:<br />
<br />
<math>L_{ij} =<br />
\left\{<br />
\begin{array}{lr}<br />
1 : \text{if page j points to i}\\<br />
0 : \text{otherwise}<br />
\end{array}<br />
\right. <br />
</math><br />
<br />
<br />
<br />
The number of outgoing links from page j is<br />
<br />
<math>c_j = \sum_{i=1}^N L_{ij}</math><br />
<br />
For example, consider the following set of links between web pages:<br />
<br />
[[File:PageRank.png|250px]]<br />
<br />
According to the factors relating to importance of links, we can consider two possible rankings :<br />
<br />
<br />
<math>\displaystyle 3 > 2 > 1 > 4 </math> <br />
<br />
or<br />
<br />
<math>\displaystyle 3>1>2>4 </math> <br />
if we consider that the high importance of the link from 3 to 1 is more influent than the fact that there are two outgoing links from page 1 and only one from page 2.<br />
<br />
<br />
We have <math>L = \begin{bmatrix} <br />
0 & 0 & 1 & 0 \\ <br />
1 & 0 & 0 & 0 \\ <br />
1 & 1 & 0 & 1 \\<br />
0 & 0 & 0 & 0<br />
\end{bmatrix}</math>, and <math>c = \begin{pmatrix}2 & 1 & 1 & 1\end{pmatrix} </math><br />
<br />
We can represent the ranks of web pages as the vector P, where the i<sup>th</sup> element is the rank of page i:<br />
<br />
<math>P_i = (1-d) + d\sum_j \frac{L_{ij}}{c_j} P_j</math><br />
<br />
Here we take the sum of the weights of the incoming links, where links are reduced in weight if the linking page has a lot of outgoing links, and links are increased in weight if the linking page has a lot of incoming links. <br />
<br />
We don't want to completely ignore pages with no incoming links, which is why we add the constant (1 - d).<br />
<br />
If <br />
<br />
<math>L = \begin{bmatrix} L_{11} & \cdots & L_{1N} \\<br />
\vdots & \ddots & \vdots \\<br />
L_{N1} & \cdots & L_{NN} \end{bmatrix}</math><br />
<br />
<math>D = \begin{bmatrix} c_1 & \cdots & 0 \\<br />
\vdots & \ddots & \vdots \\<br />
0 & \cdots & c_N \end{bmatrix}</math><br />
<br />
Then <math>D^{-1} = \begin{bmatrix} c_1^{-1} & \cdots & 0 \\<br />
\vdots & \ddots & \vdots \\<br />
0 & \cdots & c_N^{-1} \end{bmatrix}</math><br />
<br />
<math>\!P = (1-d)e + dLD^{-1}P</math><br />
<br />
where <math>\!e = \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}</math> is the vector with all 1's<br />
<br />
To simplify the problem, we let <math>\!e^T P = N \Rightarrow \frac{e^T P}{N} = 1</math>. This means that the average importance of all pages on the internet is 1.<br />
<br />
Then<br />
<math>\!P = (1-d)\frac{ee^TP}{N} + dLD^{-1}P</math><br />
::<math>\! = \left [ (1-d)\frac{ee^T}{N} + dLD^{-1} \right ] P</math><br />
::<math>\! = \left [ \left ( \frac{1-d}{N} \right ) E + dLD^{-1} \right ] P</math>, where <math> E </math> is an NxN matrix filled with ones.<br />
<br />
Let <math>\!A = \left [ \left ( \frac{1-d}{N} \right ) E + dLD^{-1} \right ]</math><br />
<br />
Then <math>\!P = AP</math>.<br />
<br />
<br />
Note that P is a stationary distribution and, more importantly, P is an eigenvector of A with eigenvalue 1. Therefore, we can find the ranks of all web pages by solving this equation for P. <br />
<br />
We can find the vector P for the example above, using the following Matlab code:<br />
L = [0 0 1 0; 1 0 0 0; 1 1 0 1; 0 0 0 0];<br />
D = [2 0 0 0; 0 1 0 0; 0 0 1 0; 0 0 0 1];<br />
d = 0.8 ;% pages with no links get a weight of 0.2<br />
N = 4 ;<br />
<br />
A = ((1-d)/N) * ones(N) + d * L * inv(D);<br />
[EigenVectors, EigenValues] = eigs(A)<br />
s=sum(EigenVectors(:,1));% we should note that the average entry of P should be 1 according to our assumption<br />
P=(EigenVectors(:,1))/s*N<br />
<br />
This outputs:<br />
<br />
EigenVectors =<br />
-0.6363 0.7071 0.7071 -0.0000 <br />
-0.3421 -0.3536 + 0.3536i -0.3536 - 0.3536i -0.7071 <br />
-0.6859 -0.3536 - 0.3536i -0.3536 + 0.3536i 0.0000 <br />
-0.0876 0.0000 + 0.0000i 0.0000 - 0.0000i 0.7071 <br />
<br />
<br />
EigenValues =<br />
1.0000 0 0 0 <br />
0 -0.4000 - 0.4000i 0 0 <br />
0 0 -0.4000 + 0.4000i 0 <br />
0 0 0 0.0000 <br />
<br />
P =<br />
<br />
1.4528<br />
0.7811<br />
1.5660<br />
0.2000<br />
<br />
Note that there is an eigenvector with eigenvalue 1. <br />
The reason why there always exist a 1-eigenvector is that A is a stochastic matrix. <br />
<br />
Thus our vector P is <math> <br />
\begin{bmatrix}1.4528 \\ 0.7811 \\ 1.5660\\ 0.2000 \end{bmatrix}</math><br />
<br />
However, this method is not practical, because there are simply too many web pages on the internet. So instead Google uses a method to approximate an eigenvector with eigenvalue 1.<br />
<br />
Note that page three has the rank with highest magnitude and page four has the rank with lowest magnitude, as expected.<br />
<br />
==''' Markov Chain Monte Carlo - Metropolis-Hastings - October 25th, 2011'''==<br />
<br />
We want to find <math> \int h(x)f(x)\, \mathrm dx </math>, but we don't know how to sample from <math>\,f</math>.<br />
<br />
We have seen simple techniques before. This one is used in real life.<br />
It consists of the search of a Markov Chain such that its stationary distribution is <math>\,f</math>.<br />
<br />
==== Main procedure ====<br />
<br />
Let us suppose that <math>\,q(y|x)</math> is a friendly distribution: we can sample from this function.<br />
<br />
1. Initialize the chain with <math>\,x_{i}</math> and set <math>\,i=0</math>.<br />
<br />
2. Draw a point from <math>\,q(y|x)</math> i.e. <math>\,Y \backsim q(y|x_{i})</math>.<br />
<br />
3. Evaluate <math>\,r(x,y)=min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\}</math><br />
<br />
<br />
4. Draw a point <math>\,U \backsim Unif[0,1]</math>.<br />
<br />
5. <math>\,x_{i+1}=\begin{cases}y & \text{ if } U<r \\x_{i} & \text{ otherwise } \end{cases} </math>.<br />
<br />
6. <math>\,i=i+1</math>. Go back to 2.<br />
<br />
==== Remark 1 ====<br />
<br />
A very common choice for <math>\,q(y|x)</math> is <math>\,N(y;x,b^{2})</math>, a normal distribution centered at the current point.<br />
<br />
Note : In this case <math>\,q(y|x)</math> is symmetric i.e. <math>\,q(y|x)=q(x|y)</math>.<br />
<br />
(Because <math>\,q(y|x)=\frac{1}{\sqrt{2\pi}b}e^{-\frac{1}{2b^{2}}(y-x)^{2}}</math> and <math>\,(y-x)^{2}=(x-y)^{2}</math>).<br />
<br />
Thus we have <math>\,\frac{q(x|y)}{q(y|x)}=1</math>, which implies :<br />
<br />
<math>\,r(x,y)=min\left\{\frac{f(y)}{f(x)},1\right\}</math>.<br />
<br />
In general, if <math>\,q(x|y)</math> is symmetric then the algorithm is called Metropolis in reference to the original algorithm (made in 1953)<ref>http://en.wikipedia.org/wiki/Equations_of_State_Calculations_by_Fast_Computing_Machines</ref>.<br />
<br />
<br />
<br />
====Remark 2====<br />
<br />
The value y is accepted if <math>\,u<min\left\{\frac{f(y)}{f(x)},1\right\}</math> so it is accepted with the probability <math>\,min\left\{\frac{f(y)}{f(x)},1\right\}</math>.<br />
<br />
Thus, if <math>\,f(y)>f(x)</math>, then <math>\,y</math> is always accepted.<br />
<br />
The higher that value of the pdf is in the vicinity of a point <math>\,y_1</math>, the more likely it is that a random variable will take on values around <math>\,y_1</math>. As a result it makes sense that we would want a high probability of acceptance for points generated near <math>\,y_1</math>.<br />
<br />
====Remark 3====<br />
<br />
One strength of the Metropolis-Hastings algorithm is that normalizing constants, which are often quite difficult to determine, can be cancelled out in the ratio <math> r </math>. For example, consider the case where we want to sample from the beta distribution, which has the pdf:<br />
<br />
<math><br />
\begin{align}<br />
f(x;\alpha,\beta)& = \frac{1}{\mathrm{B}(\alpha,\beta)}\, x^{\alpha-1}(1-x)^{\beta-1}\end{align}<br />
</math><br />
<br />
The beta function, ''B'', appears as a normalizating constant but it can be simplified by construction of the method.<br />
<br />
====Example====<br />
<br />
<math>\,f(x)=\frac{1}{\pi^{2}}\frac{1}{1+x^{2}}</math><br />
<br />
Then, we have <math>\,f(x)\propto\frac{1}{1+x^{2}}</math>.<br />
<br />
And let us take <math>\,q(x|y)=\frac{1}{\sqrt{2\pi}b}e^{-\frac{1}{2b^{2}}(y-x)^{2}}</math>.<br />
<br />
Then <math>\,q(x|y)</math> is symmetric.<br />
<br />
Therefore Y can be simplified.<br />
<br />
<br />
We get :<br />
<br />
<math>\,\begin{align}<br />
\displaystyle r(x,y) <br />
& =min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} \\<br />
& =min\left\{\frac{f(y)}{f(x)},1\right\} \\<br />
& =min\left\{ \frac{ \frac{1}{1+y^{2}} }{ \frac{1}{1+x^{2}} },1\right\}\\<br />
& =min\left\{ \frac{1+x^{2}}{1+y^{2}},1\right\}\\<br />
\end{align}<br />
</math>.<br />
<br />
<br />
<br />
The Matlab code of the algorithm is the following :<br />
<br />
<pre><br />
clear all<br />
close all<br />
clc<br />
b=2;<br />
x(1)=randn;<br />
for i=2:10000<br />
y=b*randn+x(i-1);<br />
r=min((1+x(i-1)^2)/(1+y^2),1);<br />
u=rand;<br />
if u<r<br />
x(i)=y;<br />
else<br />
x(i)=x(i-1);<br />
end<br />
<br />
end<br />
hist(x(5000:end));<br />
%The Markov Chain usually takes some time to converge and this is known as the "burning time".<br />
%Therefore, we don't display the first 5000 points because they don't show the limiting behaviour of the Markov <br />
Chain.<br />
</pre><br />
<br />
As we can see, the choice of the value of b is made by us.<br />
<br />
Changing this value has a significant impact on the results we obtain. There is a pitfall when b is too big or too small.<br />
<br />
Example with <math>\,b=0.1</math> (Also with graph after we run j=5000:10000; plot(j,x(5000:10000)):<br />
<br />
[[File:redaccoursb01.JPG|300px]] [[File:001Metr.PNG|300px]]<br />
<br />
With <math>\,b=0.1</math>, the chain takes small steps so the chain doesn't explore enough of the sample space. It doesn't give an accurate report of the function we want to sample.<br />
<br />
<br />
<br />
Example with <math>\,b=10</math> :<br />
<br />
[[File:redaccoursb10.JPG|300px]] [[File:010metro.PNG|300px]]<br />
<br />
With <math>\,b=10</math>, jumps are very unlikely to be accepted as they deviate far from the mean (i.e. <math>\,y</math> is rejected as <math>\ u<r </math> and <math>\,x(i)=x(i-1)</math> most of the time, hence most sample points stay fairly close to the origin.<br />
The third graph that resembles white noise (as in the case of <math>\,b=2</math>) indicates better sampling as more points are covered and accepted. For <math>\,b=0.1</math>, we have lots of jumps but most values are not repeated, hence the stationary distribution is less obvious; whereas in the <math>\,b=10</math> case, many points remains around 0. Approximately 73% were selected as x(i-1).<br />
<br />
<br />
Example with <math>\,b=2</math> :<br />
<br />
[[File:redaccoursb2.JPG|300px]] [[File:100metr.PNG|300px]]<br />
<br />
With <math>\,b=2</math>, we get a more accurate result as we avoid these extremes. Approximately 37% were selected as x(i-1).<br />
<br />
<br />
If the sample from the Markov Chain starts to look like the target distribution quickly, we say the chain is mixing well.<br />
<br />
==''' Theory and Applications of Metropolis-Hastings - October 27th, 2011'''==<br />
<br />
As mentioned in the previous section, the idea of the Metropolis-Hastings (MH) algorithm is to produce a Markov chain that converges to a stationary distribution <math>f</math> which we are interested in sampling from.<br />
<br />
====Convergence====<br />
<br />
One important fact to check is that <math>\displaystyle f</math> is indeed a stationary distribution in the MH scheme. For this, we can appeal to the implications of the detailed balance property:<br />
<br />
Given a probability vector <math>\!\pi</math> and a transition matrix <math>\displaystyle P</math>, <math>\!\pi</math> has the detailed balance property if <math>\!\pi_iP_{ij} = P_{ji}\pi_j</math><br />
<br />
If <math>\!\pi</math> satisfies detailed balance, then it is a stationary distribution.<br />
<br />
The above definition applies to the case where the states are discrete. In the continuous case, <math>\displaystyle f</math> satisfies detailed balance if <math>\displaystyle f(x)p(x,y)=f(y)p(y,x)</math>. Where <math>\displaystyle p(x,y)</math> and <math>\displaystyle p(y,x)</math> are the probabilities of transitioning from x to y and y to x respectively. If we can show that <math>\displaystyle f</math> has the detailed balance property, we can conclude that it is a stationary distribution. Because <math>\int^{}_y f(y)p(y,x)dy=\int^{}_y f(x)p(x,y)dy=f(x)</math>.<br />
<br />
In the MH algorithm, we use a proposal distribution to generate y~<math>\displaystyle q(y|x)</math>, and accept y with probability <math>min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\}</math><br />
<br />
Suppose, without loss of generality, that <math>\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)} <= 1</math>. This implies that <math>\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)} >= 1</math><br />
<br />
Let <math>\,r(x,y)</math> be the chance of accepting point y given that we are at point x.<br />
<br />
So <math>\,r(x,y) = min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} = \frac{f(x)}{f(y)} \frac{q(x|y)}{q(y|x)}</math><br />
<br />
Let <math>\,r(y,x)</math> be the chance of accepting point x given that we are at point y.<br />
<br />
So <math>\,r(y,x) = min\left\{\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)},1\right\} = 1</math><br />
<br />
<br />
<math>\,p(x,y)</math> is the probability of generating and accepting y, while at point x.<br />
<br />
So <math>\,p(x,y) = q(y|x)r(x,y) = q(y|x) \frac{f(y)}{f(x)} \frac{q(x|y)}{q(y|x)} = \frac{f(y)q(x|y)}{f(x)}</math><br />
<br />
<br />
<math>\,p(y,x)</math> is the probability of generating and accepting x, while at point y.<br />
<br />
So <math>\,p(y,x) = q(x|y)r(y,x) = q(x|y)</math><br />
<br />
<br />
<math>\,f(x)p(x,y) = f(x)\frac{f(y)q(x|y)}{f(x)} = f(y)q(x|y) = f(y)p(y,x)</math><br />
<br />
Thus, detailed balance holds.<br />
:i.e. <math>\,f(x)</math> is stationary distribution<br />
<br />
It can be shown (although not here) that <math>f</math> is a limiting distribution as well. Therefore, the MH algorithm generates a sequence whose distribution converges to <math>f</math>, the target.<br />
<br />
====Implementation====<br />
<br />
In the implementation of MH, the proposal distribution is commonly chosen to be symmetric, which simplifies the calculations and makes the algorithm more intuitively understandable. The MH algorithm can usually be regarded as a random walk along the distribution we want to sample from. Suppose we have a distribution <math>f</math>:<br />
<br />
[[File:Standard normal distribution.gif]]<br />
<br />
Suppose we start the walk at point <math>x</math>. The point <math>y_{1}</math> is in a denser region than <math>x</math>, therefore, the walk will always progress from <math>x</math> to <math>y_{1}</math>. On the other hand, <math>y_{2}</math> is in a less dense region, so it is not certain that the walk will progress from <math>x</math> to <math>y_{2}</math>. In terms of the MH algorithm:<br />
<br />
<math>r(x,y_{1})=min(\frac{f(y_{1})}{f(x)},1)=1</math> since <math>f(y_{1})>f(x)</math>. Thus, any generated value with a higher density will be accepted.<br />
<br />
<math>r(x,y_{2})=\frac{f(y_{2})}{f(x)}</math>. The lower the density of <math>y_{2}</math> is, the less chance it will have of being accepted.<br />
<br />
A certain class of proposal distributions can be written in the form:<br />
<br />
<math>\,y|x_i = x_i + \epsilon_i</math><br />
<br />
where <math>\,\epsilon_i = g(|x-y|)</math><br />
<br />
The density depends only on the distance between the current point and the next one (which can be seen as the "step" being taken). These proposal distributions give the Markov chain the random walk nature. The normal distribution that we frequently use in our examples satisfies the above definition.<br />
<br />
In actual implementations of the MH algorithm, the proposal distribution needs to be chosen judiciously, because not all proposals will work well with all target distributions we want to sample from. Take a trimodal distribution for example:<br />
<br />
[[File:trimodal.jpg]]<br />
<br />
If we choose the proposal distribution to be a standard normal as we have done before, problems will arise. The low densities between the peaks means that the MH algorithm will almost never walk to any points generated in these regions and get stuck at one peak. One way to address this issue is to increase the variance, so that the steps will be large enough to cross the gaps. Of course, in this case, it would probably be beneficial to come up with a different proposal function. As a rule of thumb, such functions should result in an approximately 50% acceptance rate for generated points.<br />
<br />
====Simulated Annealing====<br />
<br />
Metropolis-Hastings is very useful in simulation methods for solving optimization problems. One such application is simulated annealing, which addresses the problems of minimizing a function <math>h(x)</math>. This method will not always produce the global solution, but it is intuitively simple and easy to implement.<br />
<br />
Consider <math>e^{\frac{-h(x)}{T}}</math>, maximizing this expression is equivalent to minimizing <math>h(x)</math>. Suppose <math>\mu</math> is the maximizing value and <math>h(x)=(x-\mu)^2</math>, then the maximization function is a gaussian distribution <math>e^{-\frac{(x-\mu)^2}{T}}</math>. When many samples are taken from this distribution, the mean will converge to the desired maximizing value. The annealing comes into play by lowering T (the temperature) as the sampling progresses, making the distribution narrower. The steps of simulated annealing are outlined below:<br />
<br />
1. start with a random <math>x</math> and set T to a large number<br />
<br />
2. generate <math>y</math> from a proposal distribution <math>q(y|x)</math>, which should be symmetric<br />
<br />
3. accept <math>y</math> with probability <math>min(\frac{f(y)}{f(x)},1)</math><br />
<br />
4. decrease T, and then go to step 2<br />
<br />
The following plot and Matlab code illustrates the simulated annealing procedure as temperature ''T'', the variance, decreases for a Gaussian distribution with zero mean. Starting off with a large value for the temperature ''T'' allows the Metropolis-Hastings component of the procedure to capture the mean, before gradually decreasing the temperature ''T'' in order to converge to the mean. <br />
<br />
[[File:Simulated annealing illustration.png]]<br />
<br />
x=-10:0.1:10;<br />
mu=0;<br />
T=5;<br />
colour = ['b', 'g', 'm', 'r', 'k'];<br />
for i=1:5<br />
pdfNormal=normpdf(x, mu, T);<br />
plot(x, pdfNormal, colour(i));<br />
T=T-1;<br />
hold on<br />
end<br />
hleg1=legend('T=5', 'T=4', 'T=3', 'T=2', 'T=1');<br />
title('Simulated Annealing Illustration');<br />
<br />
=='''References'''==<br />
<br />
<references/><br />
<br />
=='''Simulated Annealing and Gibbs Sampling - November 1, 2011'''==<br />
<br />
continued from previous lecture...<br />
<br />
We will now look at a couple cases where <math> \displaystyle h(y) > h(x) </math> or <math> \displaystyle h(y) < h(x) </math>, and explore whether to accept or reject <math> y </math>.<br />
<br />
Recall r(x,y)=min{<math>\frac{f(y)}{f(x)}</math>,1} where <math> \frac{f(y)}{f(x)} = \frac{e^{\frac{-h(x)}{T}}}{e^{\frac{-h(y)}{T}}} = e^{\frac{h(x)-h(y)}{T}}</math>. And r(x,y) represents the probability of accepting <math>y</math>.<br />
<br />
====Cases====<br />
<br />
Case a)<br />
Suppose <math> \displaystyle h(y) < h(x) </math>. Since we want to find the minimum value for <math>\displaystyle h(x) </math>, and the point <math>\displaystyle y </math> creates a lower value than our previous point, we accept the new point. Mathematically, <math>\displaystyle h(y) < h(x) </math> implies that:<br />
<br />
<math> \frac{f(y)}{f(x)} > 1 </math>. Therefore,<br />
<math> \displaystyle r = 1 </math>.<br />
So, we will always accept <math>\displaystyle y </math>.<br />
<br />
Case b)<br />
Suppose <math> \displaystyle h(y) > h(x) </math>. This is bad, since our goal is to minimize <math>\displaystyle h(x) </math>. However, we may still accept <math>\displaystyle y </math> with some chance:<br />
<br />
<math> \frac{f(y)}{f(x)} < 1 </math>. Therefore,<br />
<math>\displaystyle r < 1 </math>.<br />
So, we may accept <math>\displaystyle y </math> with probability <math>\displaystyle r </math>.<br />
<br />
<br />
Next, we will look at these cases as <math>\displaystyle T\to0 </math>.<br />
<br />
As <math>\displaystyle T\to0 </math> and case a) happens, <math> e^{\frac{h(x)-h(y)}{T}} </math> approaches infinity, so we will always accept <math>\displaystyle y </math>.<br />
<br />
As <math>\displaystyle T\to0 </math> and case b) happens, <math> e^{\frac{h(x)-h(y)}{T}} </math> approaches zero, so the probability that <math>\displaystyle y </math> will be accepted gets extremely small.<br />
<br />
It is worth noting that if we simply start with a small value of T, we may end up rejecting all the generated points, and hence we will get stuck somewhere in the function (due to case b)), which might be a maximum value in some intervals (local maxima), but not in the whole domain (global maxima). It is therefore necessary to start with a large value of T in order to explore the whole function. At the same time, a good estimation of x0 is needed (at least cannot differ from the maximum point too much). <br />
<br />
=====Example=====<br />
<br />
Let <math>\displaystyle h(x) = (x-2)^2 </math>.<br />
The graph of it is:<br />
[[File:PCh(x).jpg|center|500]]<br />
<br />
Then, <math> e^{\frac{-h(x)}{T}} = e^{\frac{-(x-2)^2}{T}} </math> . Take an initial value of T = 20. A graph of this is:<br />
[[File:PC-highT.jpg|center|500]]<br />
<br />
<br />
In comparison, we look a graph of T = 0.2:<br />
[[File:PC-lowT.jpg|center|500]]<br />
<br />
One can see that with a low T value, the graph has a lot of r = 0, and r >1, while having a bigger T value gives smoother transitions in the graph.<br />
<br />
The MATLAB code for the above graphs are:<br />
<pre><br />
ezplot('(x-2)^2',[-6,10])<br />
ezplot('exp((-(x-2)^2)/20)',[-6,10])<br />
ezplot('exp((-(x-2)^2)/0.2)',[-6,10])<br />
</pre><br />
<br />
=====Travelling Salesman Problem=====<br />
<br />
The simulated annealing method can be applied to compute the solution to the travelling salesman problem. Suppose there are N cities and the salesman only have to visit each city once. The objective is to find out the shortest path (i.e. shortest total length of journey) connecting the cities. An algorithm using simulated annealing on the problem can be found here ([http://www.cs.ubbcluj.ro/~csatol/mestint/pdfs/Numerical_Recipes_Simulated_Annealing.pdf Reference]).<br />
<br />
===Gibbs Sampling===<br />
<br />
Gibbs sampling is another Markov chain Monte Carlo method, similar to Metropolis-Hastings. There are two main differences between Metropolis-Hastings and Gibbs sampling. First, the candidate state is always accepted as the next state in Gibbs sampling. Second, it is assumed that the full conditional distributions are known, i.e. <math>P(X_i=x|X_j=x_j, \forall j\neq i)</math> for all <math>\displaystyle i</math>. The idea is that it is easier to sample from conditional distributions which are sets of one dimensional distributions than to sample from a joint distribution which is a higher dimensional distribution. Gibbs is a way to turn the joint distribution into multiple conditional distribution. <br />
<br />
<b>Advantages:</b><br /><br />
- sampling from conditional distributions may be easier than sampling from joint distributions<br />
<br />
<b>Disadvantages:</b><br /><br />
- we do not necessarily know the conditional distributions<br />
<br />
For example, if we want to sample from <math>\, f_{X,Y}(x,y)</math>, we need to know how to sample from <math>\, f_{X|Y}(x|y)</math> and <math>\, f_{Y|X}(y|x)</math>. Suppose the chain starts with <math>\,(X_0,Y_0)</math> and <math>(X_1,Y_1), \dots , (X_n,Y_n)</math> have been sampled. Then,<br />
<br />
<math>\, (X_{n+1},Y_{n+1})=(f_{X|Y}(x|Y_n),f_{Y|X}(y|X_{n+1}))</math><br />
<br />
Gibbs sampling turns a multi-dimensional distribution into a set of one-dimensional distributions. If we want to sample from <br />
<br />
<math>P_{X^1,\dots ,X^p}(x^1,\dots ,x^p)</math> <br />
<br />
and the full conditionals are known, then:<br />
<br />
<math>X^1_{n+1}=f(X^1|X^2_n,\dots ,X^p_n)</math><br />
<br />
<math>X^2_{n+1}=f(X^2|X^1_{n+1},X^3_n\dots ,X^p_n)</math><br />
<br />
<math>\vdots</math><br />
<br />
<math>X^{p-1}_{n+1}=f(X^{p-1}|X^1_{n+1},\dots ,X^{p-2}_{n+1},X^p_n)</math><br />
<br />
<math>X^p_{n+1}=f(X^p|X^1_{n+1},\dots ,X^{p-1}_{n+1})</math><br />
<br />
With Gibbs sampling, we can simulate <math>\displaystyle n</math> random variables sequentially from <math>\displaystyle n</math> univariate conditionals rather than generating one <math>n</math>-dimensional vector using the full joint distribution, which could be a lot more complicated.<br />
<br />
Computational inference deals with probabilistic graphical models. Gibbs sampling is useful here: graphical models show the dependence relations among random variables. For instance, Bayesian networks are graphical models represented using directed acyclic graphs. Looking at such a graphical model tells us on which random variable the distribution of a certain random variable depends (i.e. its parent). The model can be used to "factor" a joint distribution into conditional distributions.<br />
<br />
[[File:stat341_nov_1_graphical_model.png|200px|thumb|left|Sample graphical model of five RVs]]<br />
<br />
For example, consider the five random variables A, B, C, D, and E. Without making any assumptions about dependence relations among them, all we know is <br />
<br />
<math>\, P(A,B,C,D,E)=</math><math>\, P(A|B,C,D,E) P(B|C,D,E) P(C|D,E) P(D|E) P(E)</math><br />
<br />
However, if we know the relation between the random variables, e.g. given the graphical model on the left, we can simplify this expression:<br />
<br />
<math>\, P(A,B,C,D,E)=P(A) P(B|A) P(C|A) P(D|C) P(E|C)</math><br />
<br />
Although the joint distribution may be very complicated, the conditional distributions may not be.<br />
<br />
Check out the following notes on Gibbs sampling:<br />
<br />
* [http://web.mit.edu/~wingated/www/introductions/mcmc-gibbs-intro.pdf MCMC and Gibbs Sampling, MIT Lecture Notes]<br />
* chapter 7.4 in [http://stat.fsu.edu/~anuj/pdf/classes/CompStatI09/BOOK.pdf Notes on Computational Methods in Statistics]<br />
* chapter 4.9 in [http://www.ma.hw.ac.uk/~foss/StochMod/Ross_S.pdf Introduction to Probability Models] by Sheldon Ross<br />
<br />
====Example of Gibbs sampling: Multi-variate normal====<br />
<br />
We'd like to generate samples from a bivariate normal with parameters<br />
<br />
<math>\mu = \begin{bmatrix}1\\ 2 \end{bmatrix} = \begin{bmatrix}\mu_1 \\ \mu_2 \end{bmatrix}</math> <br />
and <math>\sigma = \begin{bmatrix}1 && 0.9 \\ 0.9 && 1 \end{bmatrix}= \begin{bmatrix}1 && \rho \\ \rho && 1 \end{bmatrix}</math><br />
<br />
The conditional distributions of multi-variate normal random variables are also normal:<br />
<br />
<math>\, f(x_1|x_2)=N(\mu_1 + \rho(x_2-\mu_2), 1-\rho^2)</math><br />
<br />
<math>\, f(x_2|x_1)=N(\mu_2 + \rho(x_1-\mu_1), 1-\rho^2)</math><br />
<br />
(In general, if the joint distribution has parameters<br />
<br />
<math>\mu = \begin{bmatrix}\mu_1 \\ \mu_2 \end{bmatrix}</math> and <math>\Sigma = \begin{bmatrix} \Sigma _{1,1} && \Sigma _{1,2} \\ \Sigma _{2,1} && \Sigma _{2,2} \end{bmatrix}</math><br />
<br />
then the conditional distribution <math>\, f(x_1|x_2)</math> has mean <math>\, \mu_1 + \Sigma _{1,2}(\Sigma _{1,1})^{-1}(x_2-\mu_2)</math> and variance <math>\, \Sigma _{1,1}-\Sigma _{1,2}(\Sigma _{2,2})^{-1}\Sigma _{2,1})</math>.<br />
<br />
=='''Principal Component Analysis (PCA) - November 8, 2011'''==<br />
<br />
Principal component analysis is an 100 year old algorithm used for the dimensionality reduction of data. As dimensions increase, the data points needed to sample accurately increase by an exponential factor.<br />
<br />
<math>\, x\in \mathbb{R}^D \rarr y\in \mathbb{R}^d</math><br />
<br />
<math>\ d \le D </math><br />
<br />
We want to transform <math>\, x</math> to <math>\, y</math> by reducing dimensionality yet losing little information.<br />
<br />
For example, consider dots in a three dimensional space. By unrolling the 2D manifold that they are on, we can reduce the data to 2D while losing little information. Note: This is not an application of PCA, but simple illustrates one way we can reduce dimensionality.<br />
<br />
Principle Component Analysis lets us reduce data to a linear subspace of its original space. It works best when data is in a lower dimensional subspace of its original space, or is close to.<br />
<br />
<br />
'''Probabilistic View'''<br />
<br />
We can see data set <math>\, x</math> as a high dimensional random variable governed by a low dimensional random variable <math>\, y</math>. Given <math>\, x</math>, we are trying to estimate <math>\, y</math>.<br />
<br />
We can see this in 2D linear regression, as the locations of data points in a scatter plot are governed by its approximate linear regression. The subspace that we have reduced the data to here is in the direction of variation in the data.<br />
<br />
'''Principal Component Analysis'''<br />
<br />
Principal component analysis is an orthogonal linear transform on a data set. It transforms the data coordinates to associate with a new set of orthogonal vectors, each representing the direction of the maximum variance of the the data. E.G. the first principal component is the direction of the max variance, the second principal component is the direction of the max variance orthogonal to the first vector, the third principal component is the direction of the max variance orthogonal to the first and second vectors and etc. until we have D vectors, where D is the dimension of the original data.<br />
<br />
Suppose we have data represented by <math>\, X = \begin{bmatrix}<br />
x^1\\<br />
x^2\\<br />
\vdots \\ <br />
x^D<br />
\end{bmatrix}<br />
\in \mathbb{R}^{D \times n} </math><br />
<br />
For some <math>\, W = \begin{bmatrix}<br />
w^1\\<br />
w^2\\<br />
\vdots \\ <br />
w^D<br />
\end{bmatrix}<br />
\in \mathbb{R}^{D} </math><br />
<br />
We can write any vector in <math>\, \mathbb{R}^D </math> as<br />
<br />
<math>\, w^1x^1 + w^2x^2 + \cdots + w^dx^d = W^TX</math><br />
<br />
To find the first principal component, we want to maximize the variance of <math>\,W^TX</math>.<br />
<br />
The variance of <math>\,W^TX</math> is <math>\,W^TSW</math> where <math>\,S</math> is the covariance matrix of X.<br />
<br />
<math>\, S = (x-\mu)(x-\mu)^T</math><br />
<br />
<br />
So we have to solve the problem<br />
<br />
<math>\, \text {Max } W^TSW</math><br />
<br />
<math>\, \text{such that } W^TW = 1</math>.<br />
<br />
<br />
We restrict W to unit vectors as otherwise the maximum is unbounded. We are only looking for the direction of of the vector, the actual scale of it is unnecessary.<br />
<br />
Using the method of Lagrange multipliers, we have<br />
<br />
<math>\,L(W, \lambda) = W^TSW - \lambda(W^TW - 1) </math><br />
<br />
We set<br />
<br />
<math>\, \frac{\partial L}{\partial W} = 0 </math><br />
<br />
<br />
<br />
Note that <math>\, W^TSW</math> is a quadratic form. So we have<br />
<br />
<br />
<br />
<math>\, \frac{\partial L}{\partial W} = 2SW - 2\lambda W = 0 </math><br />
<br />
<math>\, SW = \lambda W </math><br />
<br />
Since S is a matrix and lambda is a scaler, W is an eigenvector of S and lambda is its corresponding eigenvalue.<br />
<br />
Suppose that<br />
<br />
<math>\, \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_d</math><br />
are eigenvalues of S and <math>\, u_1, u_2, \cdots u_d</math> are their corresponding eigenvectors.<br />
<br />
We want to choose some <math>\, W = u </math><br />
<br />
<math>\,u^TSu =u^T\lambda u = \lambda u^Tu = \lambda</math><br />
<br />
So to maximize <math>\, u^TSu</math>, choose the eigenvector corresponding to the max eiegenvalue, e.g. <math>\, u_1</math>.<br />
<br />
So we let <math>\, W = u_1 </math> be the first principal component.<br />
<br />
The principal component's decompose the total variance in the data.<br />
<br />
<math>\, \sum_{i=1}^D \text{Var}(u_i) = \sum_{i=1}^D \lambda_i = \text{Tr}(S) = \sum_{i=1}^D \text{Var}(x_i)</math><br />
<br />
<br><br />
===Singular Value Decomposition===<br />
Singular value decomposition is a "generalization" of eigenvalue decomposition "to rectangular matrices of size ''mxn''."<ref name="Abdel_SVD">Abdel-Rahman, E. (2011). Singular Value Decomposition [Lecture notes]. Retrieved from http://uwace.uwaterloo.ca</ref> Singular value decomposition solves:<br><br><br />
:<math>\ A_{mxn}\ v_{nx1}=s\ u_{mx1}</math><br><br><br />
"for the right singular vector ''v'', the singular value ''s'', and the left singular vector ''u''. There are ''n'' singular values ''s''<sub>''i''</sub> and ''n'' right and left singular vectors that must satisfy the following conditions"<ref name="Abdel_SVD"/>:<br />
# "All singular values are non-negative"<ref name="Abdel_SVD"/>, <br> <math>\ s_i \ge 0.</math><br />
# All "right singular vectors are pairwise orthonormal"<ref name="Abdel_SVD"/>, <br> <math>\ v_iv_j=\delta_{i,j}.</math><br />
# All "left singular vectors are pairwise orthonormal"<ref name="Abdel_SVD"/>, <br> <math>\ u_iu_j=\delta_{i,j}.</math><br />
where<br />
:<math>\delta_{i,j}=\left\{\begin{matrix}1 & \mathrm{if}\ i=j \\ 0 & \mathrm{if}\ i\neq j\end{matrix}\right.</math><br><br><br />
<br />
'''Procedure to find the singular values and vectors'''<br><br />
Observe the following about the eigenvalue decomposition of a real square matrix ''A'' where ''v'' is the unit eigenvector:<br><br />
::<math><br />
\begin{align}<br />
& Av=\lambda v \\<br />
& (Av)^T=(\lambda v)^T \\<br />
& (Av)^TAv=(\lambda v)^T\lambda v \\<br />
& v^TA^TAv=\lambda^2v^Tv \\<br />
& vv^TA^TAv=v\lambda^2 \\<br />
& A^TAv=\lambda^2v<br />
\end{align}<br />
</math><br />
As a result:<br />
# "The matrices ''A'' and ''A''<sup>''T''</sup>''A'' have the same eigenvectors."<ref name="Abdel_SVD"/><br />
# "The eigenvalues of matrix ''A''<sup>''T''</sup>''A'' are the square of the eigenvalues of matrix ''A''."<ref name="Abdel_SVD"/><br />
# Since matrix ''A''<sup>''T''</sup>''A'' is symmetric,<br />
## "all the eigenvalues of matrix ''A''<sup>''T''</sup>''A'' are real and distinct."<ref name="Abdel_SVD"/><br />
## "the eigenvectors of matrix ''A''<sup>''T''</sup>''A'' are orthogonal and can be chosen to be orthonormal."<ref name="Abdel_SVD"/><br />
# "The eigenvalues of matrix ''A''<sup>''T''</sup>''A'' are non-negative"<ref name="Abdel_SVD"/> since <math>\ \lambda^2_i \ge 0.</math><br />
Conclusions 3 and 4 are "true even for a rectangular matrix ''A'' since ''A''<sup>''T''</sup>''A'' is still a square symmetric matrix"<ref name="Abdel_SVD"/> and its eigenvalues and eigenvectors can be found.<br><br><br />
Therefore, for a rectangular matrix ''A'', assuming ''m>n'', the singular values and vectors can be found by:<br />
# "Form the ''nxn'' symmetric matrix ''A''<sup>''T''</sup>''A''."<ref name="Abdel_SVD"/><br />
# Perform an eigenvalue decomposition to get ''n'' eigenvalues and their "corresponding eigenvectors, ordered such that"<ref name="Abdel_SVD"/> <br><math>\lambda^2_1 \ge \lambda^2_2 \ge \dots \ge \lambda^2_n \ge 0</math> and <math>\{v_1, v_2, \dots, v_n\}.</math><br />
# "The singular values are"<ref name="Abdel_SVD"/>: <br><math>s_1=\sqrt{\lambda^2_1} \ge s_2=\sqrt{\lambda^2_2} \ge \dots \ge s_n=\sqrt{\lambda^2_n} \ge 0.</math><br>"The non-zero singular values are distinct; the equal sign applies only to the singular values that are equal to zero."<ref name="Abdel_SVD"/><br />
# "The ''n''-dimensional right singular vectors are"<ref name="Abdel_SVD"/><br><math>\{v_1, v_2, \dots, v_n\}.</math><br />
# "For the first <math>r \le n</math> singular values such that ''s''<sub>''i''</sub> ''> 0'', the left singular vectors are obtained as unit vectors"<ref name="Abdel_SVD"/> by <math>\tfrac{1}{s_i}Av_i=u_i.</math><br />
# Select "the <math>\ m-r</math> left singular vectors corresponding to the zero singular values such that they are unit vectors orthogonal to each other and to the first ''r'' left singular vectors"<ref name="Abdel_SVD"/> <math>\{u_1, u_2, \dots, u_r\}.</math><br><br><br />
<br />
'''Finding Singular value Decomposition Using MATLAB Code'''<br />
Please refer to the following link: http://www.mathworks.com/help/techdoc/ref/svd-singular-value-decomposition.html<br />
<br />
'''Formal definition'''<br><br />
"We can now decompose the rectangular matrix ''A'' in terms of singular values and vectors as follows"<ref name="Abdel_SVD"/>:<br><br><br />
<math>A_{mxn} \begin{bmatrix} v_1 & | & \cdots & | & v_n \end{bmatrix}_{nxn} = \begin{bmatrix} u_1 & | & \cdots & | & u_n & | I_{n+1} & | & \cdots & | & I_m \end{bmatrix}_{mxm} \begin{bmatrix} s_1 & 0 & \cdots & 0 \\ 0 & s_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & s_n \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix}_{mxn}</math><br><br />
:<math>\ AV=US</math><br><br><br />
Since "the matrices ''V'' and ''U'' are orthogonal"<ref name="Abdel_SVD"/>, ''V ''<sup>''-1''</sup>=''V''<sup>T</sup> and ''U ''<sup>''-1''</sup>=''U''<sup>T</sup>:<br><br><br />
:<math>\ A=USV^T</math><br><br><br />
"which is the formal definition of the singular value decomposition."<ref name="Abdel_SVD"/><br><br><br />
<br />
'''Relevance to PCA'''<br><br />
In order to perform PCA, one needs to do eigenvalue decomposition on the covariance matrix. By transforming the mean for all attributes to zero, the covariance matrix can be simplified to:<br><br><br />
<math>\ S=XX^T</math><br><br><br />
Since the eigenvalue decomposition of ''A''<sup>''T''</sup>''A'' gives the same eigenvectors as the singular value decomposition of ''A'', an additional and more consistent method (if a matrix has eigenvectors that are not invertible, its eigenvalue decomposition does not exist) for performing PCA is through the singular value decomposition of ''X''.<br />
<br />
The following MATLAB code uses singular value decomposition for performing PCA; 20 principal components, and thus the top 20 maximum variation directions, are selected for reconstructing facial images that have had noise applied to them:<br />
<br />
load noisy.mat<br />
%first noisy image; each image has a resolution of 20x28<br />
imagesc(reshape(X(:,1),20,28)')<br />
%to grayscale<br />
colormap gray<br />
%singular value decomposition <br />
[u s v]=svd(X);<br />
%reduced feature space: 20 principal components<br />
Xh=u(:,1:20)*s(1:20,1:20)*v(:,1:20)';<br />
figure<br />
imagesc(reshape(Xh(:,1),20,28)')<br />
colormap gray<br />
<br />
Since the reduced feature space image is noiseless, the added noise feature has less variation than the 20 principal components.<br />
<br />
=='''References'''==<br />
<br />
<references/><br />
<br />
==''' PCA and Introduction to Kernel Function-November,10,2011'''==<br />
===Continue with the last lecture===<br />
Some notations:<br />
Let <math>\displaystyle X_{d\times n}</math> be a matrix. <br />
<br />
Let <math>\displaystyle X_j,j=1,2,...,d</math> be the j th the data point,and <math>\displaystyle X_j\in\R^d</math>.<br />
<br />
Let <math>\displaystyle Q=\sum_{j=1}^n(X_j-\bar{X})(X_j-\bar{X})^T</math>, where <math> \bar{X}=\frac{1}{n}\sum_{j=1}^n X_j)</math>.<br />
<br />
But now, we are assuming that we have already centered the data, which means our <math>\displaystyle Q=\sum_{j=1}^n(X_j)(X_j)^T=X X^T </math>.<br />
<br />
*Find PC,which means finding eigenvectors of Q or do the singular value decomposition,[u s v]=svd(X), where the columns of u are eigenvectors of <math>\displaystyle Q=X X^T</math>.<br />
<br />
*Map the data in lower dimension space.<br />
We can choose the first p (p<d) eigenvectors, which means <math>\displaystyle u^T</math> is a <math>\displaystyle p\times n</math> matrix.<br />
Thus,we can project our original data points <math>\displaystyle x_j</math> to p dimension.<br />
Mathematically, it is <math>\displaystyle Y_{p\times n}={u^T}_{p\times d} X_{d\times n}</math>.Also,this means that we can reduce our original d variables to p principal components.<br />
<br />
*Reconstruct Points.<br />
We can also use those dimension-reduced data to project back to high dimension.<br />
However, we will lose some information because when we map those points into lower dimension, we throw away the last (d-p) eigenvectors which contain some of the original information.<br />
Since <math>\displaystyle u^T</math> is an orthogonal matrix, we can have <math> u_{d\times p} Y_{p\times n}=u_{d\times p}{u^T}_{p\times d}\hat{x}_{d\times n}= \hat{x}_{d\times n} </math>.<br />
<br />
*Map a new data point to a lower dimensional space and reconstruct it to the high dimension <math>\displaystyle y_{d\times 1}={u^T}_{p\times d} x_{d\times 1}=x_{d\times 1}=u_{d\times p} y_{p\times 1}</math><br />
<br />
===3 and 2 digits example===<br />
The data X is a 64 by 400 matrix. Every column can be imaged out as either "3" or "2". The first 200 columns are "2" and the last 200 columns are "3".<br />
We can first modify the data to centered data, and then try to find the first p(p<d) columns of the singular value decomposition of u.<br />
<br />
MATLAB CODE:<br />
MU=repmat(mean(X,2),1,400);<br />
% mean(X,2) is the average of each row <br />
%In order to center the data, we should change mean(X,2) which is a 64 by 1 matrix into a 64 by 400 matirx<br />
Xt=X-MU;<br />
% modify the data to zero mean data<br />
[u s v]=svd(Xt);<br />
%note that size(u)=64*64, and the columns of u are eigenvectors of VCM<br />
Y=u(:,1:2)'*X;<br />
%using the first two PCs to transform the high dimensional points to lower onces<br />
One way to look at this case is that, we can plot Principle Component #1 and Principle Component #2 in a two dimensional space.<br />
plot(Y(1,:)',Y(2,:)')<br />
The result is as follows, we can see clearly there are two classes.<br />
<br />
[[file:pca2.png|350px|400px]]<br />
<br />
To dig more into what kind of difference of these two classes, we can try to seperate the first 200 columns and the last 200 columns to find whether it has a significant difference due to the different types of digits.<br />
plot(Y(1,1:200)',Y(2,1:200)','d')<br />
% Note that the first 200 columns represent digit "2",and are in the form of "diamond"<br />
hold on<br />
% draw different graphs in one figure<br />
plot(Y(1,201:400)',Y(2,201:400)','ro')<br />
% Note that the first 200 columns represent digit "3",and are in the form of "o"<br />
<br />
[[file:pca3.png|350px|400px]]<br />
<br />
image=reshape(X,8,8,400);<br />
plotdigits(image,Y,.1,1);<br />
The result can be seen more clearly from the following picture.<br />
It is clearly to seperate "3" and "2" apart.<br />
<br />
[[file:Pca.png|350px|400px]]<br />
<br />
===Introduction to Kernel Function===<br />
PCA is useful when those data points spread in or close to a plane. This means that PCA is powerful when dealing with linear problems. But when data points spread in a manifold space, PCA is hard to implement. But there is a solution to this problem---we can use a "trick" to change the nonlinear classification problems into linear ones. And this is called the "Kernel Trick".<br />
<br />
'''An intuitive example'''<br />
<br />
[[File:Kernel trick.png|400px|300px]]<br />
<br />
From the picture, we can see the red dots are in the middle of the blue ones.However,it is hard to separate those two classes by using any lines(linear in the two dimensional space). But we can pull the red ones out of the two dimensional space to form a three dimensional space, in which case, we can easily tell them apart.<br />
<br />
For more details about this trick,please see http://omega.albany.edu:8008/machine-learning-dir/notes-dir/ker1/ker1.pdf<br />
<br />
More in detail,the significance of Kernel Function is that we can change the data points into a high dimension implicitly.<br />
Let's look at how this is possible:<br />
<br />
<math>Z_1=<br />
\begin{bmatrix}<br />
x_1\\<br />
y_1<br />
\end{bmatrix}\xrightarrow{\phi}<br />
</math><br />
<math>\phi(Z_1)=<br />
\begin{bmatrix}<br />
x_1^2\\<br />
y_1^2\\<br />
\sqrt2x_1y_1<br />
\end{bmatrix}.<br />
<br />
</math><br />
<math>Z_2=<br />
\begin{bmatrix}<br />
x_2\\<br />
y_2<br />
\end{bmatrix}\xrightarrow{\phi}<br />
</math><br />
<math>\phi(Z_2)=<br />
\begin{bmatrix}<br />
x_2^2\\<br />
y_2^2\\<br />
\sqrt2x_2y_2<br />
\end{bmatrix}<br />
</math><br />
<br />
The inner product of <math>\displaystyle \phi(Z1)</math> and <math>\displaystyle\phi(Z2)</math>, which is denoted as <math>\displaystyle\phi(Z1)\phi(Z2)^T</math>, is equal to:<br />
<math><br />
\begin{bmatrix}<br />
x_1^2&y_1^2&\sqrt2x_1y_1 <br />
\end{bmatrix}<br />
\begin{bmatrix}<br />
x_2^2\\<br />
y_2^2\\<br />
\sqrt2x_2y_2 <br />
\end{bmatrix}=</math> <math>\displaystyle (x_1x_2+y_1y_2)^2=K(Z_1,Z_2)</math>.<br />
<br />
'''The most common Kernel functions are as follows:'''<br />
*Linear: <math>\displaystyle K_{ij}=<X_i,X_j></math><br />
*Polynomial:<math>\displaystyle K_{ij}=(1+<X_i,X_j>)^p</math><br />
*Gausian:<math>\displaystyle K_{ij}=e^\frac{-{\left\Vert X_i-X_j\right\|}^2}{2\sigma^2}</math>,<br />
where <math>\displaystyle <X_i,X_j></math> denotes the inner product of <math>\displaystyle X_i</math> and <math>\displaystyle X_j</math>, <math>{\left\Vert X_i-X_j\right\|}^2</math> denotes the distance between vector<math>\displaystyle X_i</math> and vector <math>\displaystyle X_j</math>.<br />
<br />
<br />
==''' Kernel PCA -November,15,2011'''==<br />
<br />
First we look at the algorithm for PCA and see how we can kernelize PCA:<br />
<br />
== PCA ==<br />
<br />
Find eigenvectors of <math>XX^T</math>, call it U<br />
<br />
<math><br />
\begin{align}<br />
Y &= U^{T}X \\<br />
\hat{X} & = UY \\<br />
Y & = U^{T}X \\<br />
\hat{X} & = UY<br />
\end{align}<br />
</math><br />
<br />
== To solve PCA ==<br />
<br />
<math><br />
\begin{align}<br />
[ U \Sigma V ] & = svd(X) \\<br />
Z & = U\Sigma{V^T}<br />
\end{align}<br />
</math><br />
<br />
U is eigenvectors of <math>XX^T</math><br />
<br />
V is eigenvectors of <math>X^T{X}</math><br />
<br />
Now we want to kernelize this classical version of PCA.<br />
<br />
We would like to express everything based on V which is eigenvectors of X^T{X} which can be kernelized. This is called Dual PCA.<br />
<br />
<math><br />
\begin{align}<br />
X&= U \Sigma V^T \\<br />
XV&=U \Sigma V^T V<br />
&= U\Sigma \\<br />
U&=XV\Sigma^{-1}<br />
\end{align}<br />
</math><br />
<br />
Find eigenvectors of $X^TX$, call it V.<br />
<math><br />
X=U \Sigma V^T \\<br />
U^T X = U^T U \Sigma V^T \\<br />
U^T X = \Sigma V^T \\<br />
Y=\Sigma V^T \\<br />
</math><br />
<br />
Reconstruct Points<br />
<br />
<math><br />
\hat{X}=UY \\<br />
X=XV\Sigma^{-1}\Sigma{V^T} \\<br />
\hat{X} = XVV^T<br />
</math><br />
<br />
Map an out of sample point x to low-dimensional space<br />
<math>Y=U^TX = (XV\Sigma^1)^TX = \Sigma^{-1}{V^T}{X^T}X</math><br />
Reconstruct an out of sample point <br />
<math>\hat{X}=UY=XV\Sigma^{-1}\Sigma{-1}V^T{X^T}X = XV\Sigma^{-2}V^T{X^T}X</math></div>S9huhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341f11&diff=14854stat341f112011-11-15T18:57:19Z<p>S9hu: </p>
<hr />
<div>Please contribute to the discussion of splitting up this page into multiple pages on the [[{{TALKPAGENAME}}|talk page]].<br />
<br />
==[[signupformStat341F11| Editor Sign Up]]==<br />
<br />
==Notation==<br />
<br />
The following guidelines on notation were posted on the Wiki Course Note page for [[Stat946f11|STAT 946]]. Add to them as necessary for consistent notation on this page.<br />
<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
<br />
==Sampling - September 20, 2011==<br />
<br />
The meaning of sampling is to generate data points or numbers such that these data follow a certain distribution.<br /><br />
i.e. From <math>x \sim~f(x)</math> sample <math>\,x_{1}, x_{2}, ..., x_{1000}</math><br />
<br />
In practice, it maybe difficult to find the joint distribution of random variables. Through simulating the random variables, we can make an inference from the data.<br />
<br />
===Sampling from Uniform Distribution===<br />
Computers cannot generate random numbers as they are deterministic; however they can produce pseudo random numbers using algorithms. Generated numbers mimic the properties of random numbers but they are never truly random. One famous algorithm that is considered highly reliable is the Mersenne twister[http://en.wikipedia.org/wiki/Mersenne_twister], which generates random numbers in an almost uniform distribution. <br />
<br />
<br />
====Multiplicative Congruential====<br />
*involves four parameters: integers <math>\,a, b, m</math>, and an initial value <math>\,x_0</math> which we call the seed<br />
*a sequence of integers is defined as<br />
:<math>x_{k+1} \equiv (ax_{k} + b) \mod{m}</math><br />
<br />
'''Example:''' <math>\,a=13, b=0, m=31, x_0=1</math> creates a uniform histogram.<br />
<br />
MATLAB code for generating 1000 random numbers using the multiplicative congruential method:<br />
<br />
<pre><br />
a = 13;<br />
b = 0;<br />
m = 31;<br />
x(1) = 1;<br />
<br />
for ii = 2:1000<br />
x(ii) = mod(a*x(ii-1)+b, m);<br />
end<br />
</pre><br />
<br />
MATLAB code for displaying the values of x generated:<br />
<br />
<pre><br />
x<br />
</pre><br />
<br />
MATLAB code for plotting the histogram of x:<br />
<br />
<pre><br />
hist(x)<br />
</pre><br />
<br />
Histogram Output:<br />
<br />
[[File:uniform.jpg]]<br />
<br />
Facts about this algorithm:<br />
*In this example, the first 30 terms in the sequence are a permutation of integers from 1 to 30 and then the sequence repeats itself.<br />
*Values are between <b>0</b> and <b>m-1</b>, inclusive.<br />
*Dividing the numbers by <b> m-1 </b> yields numbers in the interval <b>[0,1]</b>.<br />
*MATLAB's <code>rand</code> function once used this algorithm with <b>a= 7<sup>5</sup></b>, <b>b= 0</b>, <b>m= 2<sup>31</sup>-1</b>,for reasons described in Park and Miller's 1988 paper "Random Number Generators: Good Ones are Hard to Find" (available [http://www.firstpr.com.au/dsp/rand31/p1192-park.pdf online]).<br />
*Visual Basic's <code>RND</code> function also used this algorithm with <b>a= 1140671485</b>, <b>b= 12820163</b>, <b>m= 2<sup>24</sup></b>. ([http://support.microsoft.com/kb/231847 Reference])<br />
<br />
===Inverse Transform Method===<br />
This is a basic method for sampling. Theoretically using this method we can generate sample numbers at random from any probability distribution once we know its cumulative distribution function (cdf).<br />
<br />
====Theorem====<br />
Take <math>U \sim~ \mathrm{Unif}[0, 1]</math> and let <math>X = F^{-1}(U) </math>. Then <math>X</math> has distribution function <math>F(\cdot)</math>, where <math>F(x)=P(X \leq x)</math> and <math>F^{-1}(\cdot)</math> is the inverse of <math>F(\cdot)</math>.<br />
<br />
Therefore <math>F(x)=u\implies x=F^{-1}(u)</math><br />
<br />
'''Proof'''<br />
<br />
Recall that<br />
<br />
:<math>P(a \leq X<b)=\int_a^{b} f(x) dx</math><br />
<br />
:<math>cdf=F(x)=P(X \leq x)=\int_{-\infty}^{x} f(x) dx</math><br />
<br />
Note that if <math>U \sim~ \mathrm{Unif}[0, 1]</math>, we have <math>P(U \leq a)=a</math><br />
<br />
:<math>\begin{align}<br />
<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
====Continuous Case====<br />
Generally it takes two steps to get random numbers using this method.<br />
<br />
*Step 1. Draw <math>U \sim~ \mathrm{Unif}[0, 1]</math><br />
*Step 2. <b><i>X=F <sup>&minus;1</sup>(U)</i></b><br />
<br />
'''Example'''<br />
<br />
Take the exponential distribution for example<br />
:<math>\,f(x)={\lambda}e^{-{\lambda}x}</math><br />
:<math>\,F(x)=\int_0^x {\lambda}e^{-{\lambda}u} du=[-e^{-{\lambda}u}]_0^x=1-e^{-{\lambda}x}</math><br />
<br />
Let: <math>\,F(x)=y</math><br />
:<math>\,y=1-e^{-{\lambda}x}</math><br />
:<math>\,ln(1-y)={-{\lambda}x}</math><br />
:<math>\,x=\frac{ln(1-y)}{-\lambda}</math><br />
:<math>\,F^{-1}(x)=\frac{-ln(1-x)}{\lambda}</math><br />
<br />
Therefore, to get a exponential distribution from a uniform distribution takes 2 steps.<br />
*Step 1. Draw <math>U \sim~ \mathrm{Unif}[0, 1]</math><br />
*Step 2. <math>x=\frac{-ln(1-U)}{\lambda}</math><br />
<br />
Note: If U~Unif[0, 1], then (1 - U) and U have the same distribution. This allows us to slightly simplify step 2 into an alternate form:<br />
*Alternate Step 2. <math>x=\frac{-ln(U)}{\lambda}</math><br />
<br />
'''MATLAB code'''<br />
for exponential distribution case,assuming <math>\lambda=0.5</math><br />
<br />
<pre><br />
for ii = 1:1000<br />
u = rand;<br />
x(ii) = -log(1-u)/0.5;<br />
end<br />
hist(x)<br />
</pre><br />
<br />
MATLAB result<br />
<br />
[[File:MATLAB_Exp.jpg|center|300px]]<br />
<br />
====Discrete Case - September 22, 2011====<br />
This same technique can be applied to the discrete case. Generate a discrete random variable <math>\,x</math> that has probability mass function <math>\,P(X=x_i)=P_i </math> where <math>\,x_0<x_1<x_2...</math> and <math>\,\sum_i P_i=1</math><br />
*Step 1. Draw <math>u \sim~ \mathrm{Unif}[0, 1]</math><br />
*Step 2. <math>\,x=x_i</math> if <math>\,F(x_{i-1})<u \leq F(x_i)</math><br />
<br />
'''Example'''<br />
<br />
Let x be a discrete random variable with the following probability mass function:<br />
<br />
:<math>\begin{align}<br />
P(X=0) = 0.3 \\<br />
P(X=1) = 0.2 \\<br />
P(X=2) = 0.5<br />
\end{align}</math><br />
<br />
Given the pmf, we now need to find the cdf.<br />
<br />
We have:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0 & x < 0 \\<br />
0.3 & 0 \leq x < 1 \\<br />
0.5 & 1 \leq x < 2 \\<br />
1 & 2 \leq x<br />
\end{cases}</math><br />
<br />
We can apply the inverse transform method to obtain our random numbers from this distribution.<br />
<br />
'''Pseudo Code for generating the random numbers:'''<br />
<pre><br />
Draw U ~ Unif[0,1] <br />
if U <= 0.3 <br />
return 0 <br />
else if 0.3 < U <= 0.5 <br />
return 1<br />
else if 0.5 < U <= 1 <br />
return 2<br />
</pre><br />
<br />
'''MATLAB code for generating 1000 random numbers in the discrete case:'''<br />
<br />
<pre><br />
for ii = 1:1000<br />
u = rand;<br />
<br />
if u <= 0.3<br />
x(ii) = 0;<br />
else if u <= 0.5<br />
x(ii) = 1;<br />
else<br />
x(ii) = 2;<br />
end<br />
end<br />
</pre><br />
<br />
Matlab Output:<br />
<br />
[[File:Discreteinv.jpg]]<br />
<br />
'''Pseudo code for the Discrete Case:'''<br />
<br />
1. Draw U ~ Unif [0,1]<br />
<br />
2. If <math> U \leq P_0 </math>, deliver <b><i>X= x<sub>0</sub></i></b><br />
<br />
3. Else if <math> U \leq P_0 + P_1 </math>, deliver <b><i>X= x<sub>1</sub></i></b><br />
<br />
4. Else If <math> U \leq P_0 +....+ P_k </math>, deliver <b><i>X= x<sub>k</sub></i></b><br />
<br />
====Limitations====<br />
<br />
Although this method is useful, it isn't practical in many cases since we can't always obtain <math>F</math> or <math> F^{-1} </math> as some functions are not integrable or invertible, and sometimes even <math>f(x)</math> itself cannot be obtained in closed form. Let's look at some examples:<br />
*Continuous case<br />
If we want to use this method to draw the ''pdf'' of '''normal distribution''', we may find ourselves get stuck in finding its ''cdf''. <br />
The simplest case of '''normal distribution''' is <math>f(x)=\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}</math>,<br />
whose ''cdf'' is <math>F(x)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{x}{e^{-\frac{u^2}{2}}}du</math>. This integral cannot be expressed in terms of elementary functions. So evaluating it and then finding the inverse is a very difficult task.<br />
*Discrete case <br />
It is easy for us to simulate when there are only a few values taken by the particular random variable, like the case above.<br />
And it is easy to simulate the '''binomial distribution''' <math>X \sim~ \mathrm{B}(n,p)</math> when the parameter n is not too large.<br />
But when n takes on values that are very large, say 50, it is hard to do so.<br />
<br />
===Acceptance/Rejection Method===<br />
<br />
<br />
The aforementioned difficulties of the inverse transform method motivates a sampling method that does not require analytically calculating cdf's and their inverses, which is the acceptance/rejection sampling method. Here, <math> \displaystyle f(x)</math> is approximated by another function, say <math>\displaystyle g(x)</math>, with the idea being that <math>\displaystyle g(x)</math> is a "nicer" function to work with than <math>\displaystyle f(x)</math>.<br />
<br />
Suppose we assume the following:<br />
<br />
1. There exists another distribution <math>\displaystyle g(x)</math> that is easier to work with and that you know how to sample from, and<br />
<br />
2. There exists a constant c such that <math>f(x) \leq c \cdot g(x)</math> for all x<br />
<br />
Under these assumptions, we can sample from <math>\displaystyle f(x)</math> by sampling from <math>\displaystyle g(x)</math><br />
<br />
====General Idea====<br />
<br />
Looking at the image below we have graphed <math> c \cdot g(x) </math> and <math>\displaystyle f(x)</math>.<br />
<br />
[[File:Graph_updated.jpg]]<br />
<br />
Using the acceptance/rejection method we will accept some of the points from <math>\displaystyle g(x)</math> and reject some of the points from <math>\displaystyle g(x)</math>. The points that will be accepted from <math>\displaystyle g(x)</math> will have a distribution similar to <math>\displaystyle f(x)</math>. We can see from the image that the values around <math>\displaystyle x_1</math> will be sampled more often under <math>c \cdot g(x)</math> than under <math>\displaystyle f(x)</math>, so we will have to reject more samples taken at x<sub>1</sub>. Around <math>\displaystyle x_2</math> the number of samples that are drawn and the number of samples we need are much closer, so we accept more samples that we get at <math>\displaystyle x_2</math><br />
<br />
====Procedure====<br />
<br />
1. Draw y ~ g<br />
<br />
2. Draw U ~ Unif [0,1]<br />
<br />
3. If <math> U \leq \frac{f(y)}{c \cdot g(y)}</math> then x=y; else return to 1<br />
<br />
Note that the choice of <math> c </math> plays an important role in the efficiency of the algorithm. We want <math> c \cdot g(x) </math> to be "tightly fit" over <math> f(x) </math> to increase the probability of accepting points, and therefore reducing the number of sampling attempts. Mathematically, we want to minimize <math> c </math> such that <math>f(x) \leq c \cdot g(x) \ \forall x</math>. We do this by setting<br />
<br />
<math> \frac{d}{dx}(\frac{f(x)}{g(x)}) = 0 </math>, solving for a maximum point <math> x_0 </math> and setting <math> c = \frac{f(x_0)}{g(x_0)}. </math><br />
<br />
====Proof====<br />
<br />
Mathematically, we need to show that the sample points given that they are accepted have a distribution of f(x).<br />
<br />
<math>\begin{align} P(y|accepted) &= \frac{P(y, accepted)}{P(accepted)} \\<br />
<br />
&= \frac{P(accepted|y) P(y)}{P(accepted)}\end{align} </math> (Bayes' Rule)<br />
<br />
<br />
<br />
<math>\displaystyle P(y) = g(y)</math><br />
<br />
<math>P(accepted|y) =P(u\leq \frac{f(y)}{c \cdot g(y)}) =\frac{f(y)}{c \cdot g(y)} </math>,where u ~ Unif [0,1]<br />
<br />
<math>P(accepted) = \sum P(accepted|y)\cdot P(y)=\int^{}_y \frac{f(y)}{c \cdot g(y)}g(y) dy=\int^{}_y \frac{f(y)}{c} dy=\frac{1}{c} \cdot\int^{}_y f(y) dy=\frac{1}{c}</math><br />
<br />
So,<br />
<br />
<math> P(y|accepted) = \frac{ \frac {f(y)}{c \cdot g(y)} \cdot g(y)}{\frac{1}{c}} =f(y) </math><br />
<br />
====Continuous Case====<br />
<br />
'''Example'''<br />
<br />
Sample from Beta(2,1)<br />
<br />
In general:<br />
<br />
Beta(<math>\alpha, \beta) = \frac{\Gamma (\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}</math> <math>\displaystyle x^{\alpha-1}</math> <math>\displaystyle(1-x)^{\beta-1}</math>, <math>\displaystyle 0<x<1</math><br />
<br />
Note: <math>\!\Gamma(n) = (n-1)!</math> if n is a positive integer<br />
<br />
<math>\begin{align} f(x) &= Beta(2,1) \\<br />
&= \frac{\Gamma(3)}{\Gamma(2)\Gamma(1)} x^1(1-x)^0 \\<br />
&= \frac{2!}{1! 0!}\cdot (1) x \\<br />
&= 2x \end{align}</math><br />
<br />
We want to choose <math>\displaystyle g(x)</math> that is easy to sample from. So we choose <math>\displaystyle g(x)</math> to be uniform distribution.<br />
<br />
We now want a constant c such that <math>f(x) \leq c \cdot g(x) </math> for all x from Unif(0,1)<br />
<br />
<br />
So,<br /><br />
<br />
<math>c \geq \frac{f(x)}{g(x)}</math>, for all x from (0,1)<br />
<br />
<br />
<math>\begin{align}c &\geq max (\frac {f(x)}{g(x)}, 0<x<1) \\<br />
<br />
<br />
&= max (\frac {2x}{1},0<x<1) \\<br />
<br />
<br />
&= 2 \end{align}</math><br />
<br />
<br />
<br />
Now that we have c =2,<br />
<br />
1. Draw y ~ g(x) => Draw y ~ Unif [0,1] <br />
<br />
2. Draw u ~ Unif [0,1] <br />
<br />
3. if <math>u \leq \frac{2y}{2 \cdot 1}</math> then x=y; else return to 1<br />
<br />
<br />
'''MATLAB code for generating 1000 samples following Beta(2,1):'''<br />
<br />
<pre><br />
close all<br />
clear all<br />
ii=1;<br />
while ii < 1000<br />
y = rand;<br />
u = rand;<br />
<br />
if u <= y<br />
x(ii)=y;<br />
ii=ii+1;<br />
end<br />
end<br />
hist(x)<br />
</pre><br />
<br />
'''MATLAB result'''<br />
<br />
[[File:MATLAB_Beta.jpg]]<br />
<br />
====Discrete Example====<br />
<br />
Generate random variables according to the p.m.f:<br />
<br />
:<math>\begin{align}<br />
P(Y=1) = 0.15 \\<br />
P(Y=2) = 0.22 \\<br />
P(Y=3) = 0.33 \\<br />
P(Y=4) = 0.10 \\<br />
P(Y=5) = 0.20 <br />
\end{align}</math><br />
<br />
find a g(y) discrete uniform distribution from 1 to 5<br />
<br />
<math>c \geq \frac{P(y)}{g(y)} </math><br><br />
<math>c = \max \left(\frac{P(y)}{g(y)} \right)</math><br><br />
<math>c = \max \left(\frac{0.33}{0.2} \right) = 1.65</math> Since P(Y=3) is the max of P(Y) and g(y) = 0.2 for all y.<br><br />
<br />
1. Generate Y according to the discrete uniform between 1 - 5<br />
<br />
2. U ~ unif[0,1]<br />
<br />
3. If <math>U \leq \frac{P(y)}{1.65 \times 0.2} \leq \frac{P(y)}{0.33} </math>, then x = y; else return to 1.<br />
<br />
In MATLAB, the code would be:<br />
<br />
py = [0.15 0.22 0.33 0.1 0.2];<br />
ii =1;<br />
while ii <= 1000<br />
y = unidrnd(5);<br />
u = rand;<br />
if u <= py(y)/0.33<br />
x(ii) = y;<br />
ii = ii+1;<br />
end<br />
end<br />
hist(x);<br />
<br />
MATLAB result<br />
<br />
[[File:MATLAB_Y.jpg]]<br />
<br />
====Limitations====<br />
<br />
Most of the time we have to sample many more points from g(x) before we can obtain an acceptable amount of samples from f(x), hence this method may not be computationally efficient. It depends on our choice of g(x). For example, in the example above to sample from Beta(2,1), we need roughly 2000 samples from g(X) to get 1000 acceptable samples of f(x).<br />
<br />
In addition, in situations where a g(x) function is chosen and used, there can be a discrepancy between the functional behaviors of f(x) and g(x) that render this method unreliable. For example, given the normal distribution function as g(x) and a function of f(x) with a "fat" mid-section and "thin tails", this method becomes useless as more points near the two ends of f(x) will be rejected, resulting in a tedious and overwhelming number of sampling points having to be sampled due to the high rejection rate of such a method.<br />
<br />
===Sampling From Gamma and Normal Distribution - September 27, 2011===<br />
<br />
====Sampling From Gamma====<br />
<br />
'''Gamma Distribution'''<br />
<br />
The Gamma function is written as <math>X \sim~ Gamma (t, \lambda) </math><br />
<br />
:<math> F(x) = \int_{0}^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If you have t samples of the exponential distribution,<br><br />
<br> <math> \begin{align} X_1 \sim~ Exp(\lambda)\\ \vdots \\ X_t \sim~ Exp(\lambda) \end{align}<br />
</math><br />
<br />
The sum of these t samples has a gamma distribution,<br />
<br />
:<math> X_1+X_2+ ... + X_t \sim~ Gamma (t, \lambda) </math><br><br />
:<math> \sum_{i=1}^{t} X_i \sim~ Gamma (t, \lambda) </math> where <math>X_i \sim~Exp(\lambda)</math><br><br />
<br />
'''Method'''<br />
<br />
We can sample the exponential distribution using the inverse transform method from previous class,<br><br />
:<math>\,f(x)={\lambda}e^{-{\lambda}x}</math><br><br />
:<math>\,F^{-1}(u)=\frac{-ln(1-u)}{\lambda}</math><br><br />
:<math>\,F^{-1}(u)=\frac{-ln(u)}{\lambda}</math> <br />
1 - u is the same as x since <math>U \sim~ unif [0,1] </math><br><br />
:<math> \begin{align} \frac{-ln(u_1)}{\lambda} - \frac{ln(u_2)}{\lambda} - ... - \frac{ln(u_t)}{\lambda} = x_1\\ \vdots \\ \frac{-ln(u_1)}{\lambda} - \frac{ln(u_2)}{\lambda} - ... - \frac{ln(u_t)}{\lambda} = x_t \end{align}<br />
:</math><br><br />
:<math> \frac {-\sum_{i=1}^{t} ln(u_i)}{\lambda} = x</math><br />
<br />
'''MATLAB code''' for a Gamma(3,1) is<br />
<br />
<pre><br />
x = sum(-log(rand(1000,3)),2); <br />
hist(x)<br />
</pre><br />
<br />
And the Histogram of X follows a Gamma distribution with long tail: <br />
<br />
[[File:Hist.PNG|center|500px]]<br />
<br />
We can improve the quality of histogram by adding the number of bins we want, like hist(x, number_of_bins)<br />
<br />
<pre><br />
x = sum(-log(rand(20000,3)),2); <br />
hist(x,40)<br />
</pre><br />
<br />
[[File:untitled.jpg|center|500px]]<br />
<br />
''' R code''' for a Gamma(3,1) is<br />
<pre><br />
a<-apply(-log(matrix(runif(3000),nrow=1000)),1,sum);<br />
hist(a);<br />
</pre><br />
And the histogram is <br />
<br />
[[File:hist_gamma.png|center|500px]]<br />
<br />
Here is another histogram of Gamma coding with R<br />
<pre><br />
a<-apply(-log(matrix(runif(3000),nrow=1000)),1,sum);<br />
hist(a,freq=F);<br />
lines(density(a),col="blue");<br />
rug(jitter(a));<br />
</pre><br />
[[File:hist_gamma_2.png|center|500px]]<br />
<br />
====Sampling from Normal Distribution using Box-Muller Transform - September 29, 2011====<br />
<br />
=====Procedure=====<br />
<br />
# Generate <math>\displaystyle u_1</math> and <math>\displaystyle u_2</math>, two values sampled from a uniform distribution between 0 and 1.<br />
# Set <math>\displaystyle R^2 = -2log(u_1)</math> so that <math>\displaystyle R^2</math> is exponential with mean 1/2 <br> Set <math>\!\theta = 2*\pi*u_2</math> so that <math>\!\theta</math> ~ Unif[0, 2<math>\displaystyle\pi</math>]<br />
# Set <math>\displaystyle X = R cos(\theta)</math> <br> Set <math>\displaystyle Y = R sin(\theta)</math><br />
<br />
=====Justification=====<br />
<br />
Suppose we have X ~ N(0, 1) and Y ~ N(0, 1) where X and Y are independent normal random variables. The relative probability density function of these two random variables using Cartesian coordinates is:<br />
<br />
<math> f(X, Y) dxdy= f(X) f(Y) dxdy= \frac{1}{\sqrt{2\pi}}e^{-x^2/2} \frac{1}{\sqrt{2\pi}}e^{-y^2/2} dxdy= \frac{1}{2\pi}e^{-(x^2+y^2)/2}dxdy </math> <br><br />
<br />
In polar coordinates <math>\displaystyle R^2 = x^2 + y^2</math>, so the relative probability density function of these two random variables using polar coordinates is:<br />
<br />
<math> f(R, \theta) = \frac{1}{2\pi}e^{-R^2/2} </math> <br><br />
<br />
If we have <math>\displaystyle R^2 \sim exp(1/2)</math> and <math>\!\theta \sim unif[0, 2\pi]</math> we get an equivalent relative probability density function. Notice that after the two on two transformation, a determinant of jocobian should be added according to the change of variable and rule of differential multiplication where<br />
<br />
<math> |J|=\left|\frac{\partial(x,y)}{\partial(R,\theta)}\right|= \left|\begin{matrix}\frac{\partial x}{\partial R}&\frac{\partial x}{\partial \theta}\\\frac{\partial y}{\partial R}&\frac{\partial y}{\partial \theta}\end{matrix}\right|=R </math> <br><br />
<br />
<math> f(X, Y) dxdy = f(R, \theta)|J|dRd\theta = \frac{1}{2\pi}e^{-R^2/2}R dRd\theta= \frac{1}{4\pi}e^{-\frac{s}{2}} dSd\theta </math> <br>where <math> S=R^2. </math> <br><br />
<br />
Therefore we can generate a point in polar coordinates using the uniform and exponential distributions, then convert the point to Cartesian coordinates and the resulting X and Y values will be equivalent to samples generated from N(0, 1).<br />
<br />
'''MATLAB code'''<br />
<br />
In MatLab this algorithm can be implemented with the following code, which generates 20,000 samples from N(0, 1):<br />
<br />
<pre><br />
x = zeros(10000, 1);<br />
y = zeros(10000, 1);<br />
for ii = 1:10000<br />
u1 = rand;<br />
u2 = rand;<br />
R2 = -2 * log(u1);<br />
theta = 2 * pi * u2;<br />
x(ii) = sqrt(R2) * cos(theta);<br />
y(ii) = sqrt(R2) * sin(theta);<br />
end<br />
hist(x)<br />
</pre><br />
<br />
In one execution of this script, the following histogram for x was generated:<br />
<br />
[[File:Hist standard normal.jpg|center|500px]]<br />
<br />
=====Non-Standard Normal Distributions=====<br />
<br />
'''Example 1: Single-variate Normal'''<br />
<br />
If X ~ Norm(0, 1) then (a + bX) has a normal distribution with a mean of <math>\displaystyle a</math> and a standard deviation of <math>\displaystyle b</math> (which is equivalent to a variance of <math>\displaystyle b^2</math>). Using this information with the Box-Muller transform, we can generate values sampled from some random variable <math>\displaystyle Y\sim N(a,b^2) </math> for arbitrary values of <math>\displaystyle a,b</math>.<br />
<br />
# Generate a sample u from Norm(0, 1) using the Box-Muller transform.<br />
# Set v = a + bu.<br />
<br />
The values for v generated in this way will be equivalent to sample from a <math>\displaystyle N(a, b^2)</math>distribution. We can modify the MatLab code used in the last section to demonstrate this. We just need to add one line before we generate the histogram:<br />
<br />
<pre><br />
x = a + b * x;<br />
</pre><br />
<br />
For instance, this is the histogram generated when b = 15, a = 125:<br />
<br />
[[File:Hist normal.jpg|center|500]]<br />
<br />
'''Example 2: Multi-variate Normal'''<br />
<br />
The Box-Muller method can be extended to higher dimensions to generate multivariate normals. The objects generated will be nx1 vectors, and their variance will be described by nxn covariance matrices.<br />
<br />
<math>\mathbf{z} = N(\mathbf{u}, \Sigma)</math> defines the n by 1 vector <math>\mathbf{z}</math> such that:<br />
<br />
* <math>\displaystyle u_i</math> is the average of <math>\displaystyle z_i</math><br />
* <math>\!\Sigma_{ii}</math> is the variance of <math>\displaystyle z_i</math><br />
* <math>\!\Sigma_{ij}</math> is the co-variance of <math>\displaystyle z_i</math> and <math>\displaystyle z_j</math><br />
<br />
If <math>\displaystyle z_1, z_2, ..., z_d</math> are normal variables with mean 0 and variance 1, then the vector <math>\displaystyle (z_1, z_2,..., z_d) </math> has mean 0 and variance <math>\!I</math>, where 0 is the zero vector and <math>\!I</math> is the identity matrix. This fact suggests that the method for generating a multivariate normal is to generate each component individually as single normal variables.<br />
<br />
The mean and the covariance matrix of a multivariate normal distribution can be adjusted in ways analogous to the single variable case. If <math>\mathbf{z} \sim N(0,I)</math>, then <math>\Sigma^{1/2}\mathbf{z}+\mu \sim N(\mu,\Sigma)</math>. Note here that the covariance matrix is symmetric and nonnegative, so its square root should always exist.<br />
<br />
We can compute <math>\mathbf{z}</math> in the following way:<br />
<br />
# Generate an n by 1 vector <math>\mathbf{x} = \begin{bmatrix}x_{1} & x_{2} & ... & x_{n}\end{bmatrix}</math> where <math>x_{i}</math> ~ Norm(0, 1) using the Box-Muller transform.<br />
# Calculate <math>\!\Sigma^{1/2}</math> using singular value decomposition.<br />
# Set <math>\mathbf{z} = \Sigma^{1/2} \mathbf{x} + \mathbf{u}</math>.<br />
<br />
The following MatLab code provides an example, where a scatter plot of 10000 random points is generated. In this case x and y have a co-variance of 0.9 - a very strong positive correlation.<br />
<br />
<pre><br />
x = zeros(10000, 1);<br />
y = zeros(10000, 1);<br />
for ii = 1:10000<br />
u1 = rand;<br />
u2 = rand;<br />
R2 = -2 * log(u1);<br />
theta = 2 * pi * u2;<br />
x(ii) = sqrt(R2) * cos(theta);<br />
y(ii) = sqrt(R2) * sin(theta);<br />
end<br />
<br />
E = [1, 0.9; 0.9, 1];<br />
[u s v] = svd(E);<br />
root_E = u * (s ^ (1 / 2));<br />
<br />
z = (root_E * [x y]);<br />
z(:,1) = z(:,1) + 5;<br />
z(:,2) = z(:,2) + -8;<br />
<br />
scatter(z(:,1), z(:,2))<br />
</pre><br />
<br />
This code generated the following scatter plot:<br />
<br />
[[File:scatter covar.jpg|center|500px]]<br />
<br />
In Matlab, we can also use the function "sqrtm()" or "chol()" (Cholesky Decomposition) to calculate square root of a matrix directly. Note that the resulting root matrices may be different but this does materially affect the simulation.<br />
Here is an example:<br />
<br />
<pre><br />
E = [1, 0.9; 0.9, 1];<br />
r1 = sqrtm(E);<br />
r2 = chol(E);<br />
</pre><br />
<br />
R code for a multivariate normal distribution:<br />
<br />
<pre><br />
n=10000;<br />
r2<--2*log(runif(n));<br />
theta<-2*pi*(runif(n));<br />
x<-sqrt(r2)*cos(theta);<br />
<br />
y<-sqrt(r2)*sin(theta);<br />
a<-matrix(c(x,y),nrow=n,byrow=F);<br />
e<-matrix(c(1,.9,09,1),nrow=2,byrow=T);<br />
svde<-svd(e);<br />
root_e<-svde$u %*% diag(svde$d)^1/2;<br />
z<-t(root_e %*%t(a));<br />
z[,1]=z[,1]+5;<br />
z[,2]=z[,2]+ -8;<br />
par(pch=19);<br />
plot(z,col=rgb(1,0,0,alpha=0.06))<br />
</pre><br />
<br />
[[File:m_normal.png|center|500px]]<br />
<br />
=====Remarks=====<br />
MATLAB's randn function uses the ziggurat method to generate normal distributed samples. It is an efficient rejection method based on covering the probability density function with a set of horizontal rectangles so as to obtain points within each rectangle. It is reported that a 800 MHz Pentium III laptop can generate over 10 million random numbers from normal distribution in less than one second. ([http://www.mathworks.com/company/newsletters/news_notes/clevescorner/spring01_cleve.html Reference])<br />
<br />
===Sampling From Binomial Distributions===<br />
<br />
In order to generate a sample x from <math>\displaystyle X \sim Bin(n, p)</math>, we can follow the following procedure:<br />
<br />
1. Generate n uniform random numbers sampled from <math>\displaystyle Unif [0, 1] </math>: <math>\displaystyle u_1, u_2, ..., u_n</math>.<br />
<br />
2. Set x to be the total number of cases where <math>\displaystyle u_i <= p</math> for all <math>\displaystyle 1 <= i <= n</math>.<br />
<br />
In MatLab this can be coded with a single line. The following generates a sample from <math>\displaystyle X \sim Bin(n, p)</math> <br />
<br />
>> sum(rand(n, 1) <= p, 1)<br />
<br />
==Bayesian Inference and Frequentist Inference - October 4, 2011==<br />
<br />
===Bayesian inference vs Frequentist inference===<br />
The Bayesian method has become popular in the last few decades as simulation and computer technology makes it more applicable. For more information about its history and application, please refer to http://en.wikipedia.org/wiki/Bayesian_inference.<br />
As for frequentists, please refer to http://en.wikipedia.org/wiki/Frequentist_inference.<br />
<br />
====Example====<br />
Consider: A person drinks a cup of coffee on a specific day.<br />
<br><br><br />
Frequentist: There is no explanation to this situation. It is essentially meaningless since it has only occurred once. Therefore, it is not a probability.<br />
<br><br />
Bayesian: Probability is not just about the frequent occurrences but it is what you believe about this probability.<br />
<br />
<br />
====Example of face identification====<br />
Take the face as input x. And the person as output y. The person can be either Ali or Tom. If it is Ali, y=1. Otherwise, y=0. We can divide the picture into 100*100 pixels and then list them into a 10,000*1 column vector which is x.<br />
<br />
If you are a frequentist, you would compare Pr(X=x|y=1) with Pr(X=x|y=0) and see which one is higher. But if you are a Bayesianist, you would compare Pr(y=1|X=x) with Pr(y=0|X=x).<br />
<br />
====Summary of differences between two schools====<br />
*Frequentist: Probability refers to limiting relative frequency. (objective)<br />
*Bayesian: Probability describes degree of belief not frequency. (subjective)<br />
e.g. The probability that you drank a cup of tea on May 20, 2001 is 0.62 does not refer to any frequency.<br />
----<br />
*Frequentist: Parameters are fixed, unknown constants.<br />
*Bayesian: Parameters are random variables and we can make probabilistic statement about them.<br />
----<br />
*Frequentist: Statistical procedures should have long run frequency probabilities.<br />
e.g. a 95% confidence interval should trap true value of the parameter for at least 95% of limited frequency<br />
*Bayesian: It makes inferences about <math>\theta</math> by producing a prbability distribution for <math>\theta</math>. Inference (e.g. point estimation) will be extracted from this distribution.<br />
<br />
====Bayesian inference====<br />
<br />
Bayesian inference is usually carried out in the following way:<br />
<br />
1. Choose a prior probability density function of <math>\!\theta</math> which is <math>f(\!\theta)</math>. This is our belief about <math>\theta</math> before we see any data.<br />
<br />
2. Choose a statistical model <math>\displaystyle f(x|\theta)</math> that reflects our beliefs about X.<br />
<br />
3. After observing data <math>\displaystyle x_1,...,x_n</math>, we update our beliefs and calculate the posterior probability.<br />
<br />
<math>f(\theta|x) = \frac{f(\theta,x)}{f(x)}=\frac{f(x|\theta) \cdot f(\theta)}{f(x)}=\frac{f(x|\theta) \cdot f(\theta)}{\int^{}_\theta f(x|\theta) \cdot f(\theta) d\theta}</math>, where <math>\displaystyle f(\theta|x)</math> is the posterior probability, <math>\displaystyle f(\theta)</math> is the prior probability, <math>\displaystyle f(x|\theta)</math> is the likelihood of observing X=x given <math>\!\theta</math> and f(x) is the marginal probability of X=x.<br />
<br />
If we have i.i.d. observations <math>\displaystyle x_1,...,x_n</math>, we can replace <math>\displaystyle f(x|\theta)</math> with <math>f({x_1,...,x_n}|\theta)=\prod_{i=1}^n f(x_i|\theta)</math> because of independency.<br />
<br />
We denote <math>\displaystyle f({x_1,...,x_n}|\theta)</math> as <math>\displaystyle L_n(\theta)</math> which is called likelihood. And we use <math>\displaystyle x^n</math> to denote <math>\displaystyle (x_1,...,x_n)</math>.<br />
<br />
<math>f(\theta|x^n) = \frac{f(x^n|\theta) \cdot f(\theta)}{f(x^n)}=\frac{f(x^n|\theta) \cdot f(\theta)}{\int^{}_\theta f(x^n|\theta) \cdot f(\theta) d\theta}</math> , where <math>\int^{}_\theta f(x^n|\theta) \cdot f(\theta) d\theta</math> is a constant <math>\displaystyle c_n</math>. So <math>f(\theta|x^n) \propto f(x^n|\theta) \cdot f(\theta)</math>. The posterior probability is proportional to the likelihood times prior probability.<br />
<br />
<math>E(\theta)=\int^{}_\theta \theta \cdot f(\theta|x^n) d\theta</math> which is the posterior mean of <math>\!\theta</math>.<br />
<br />
Let <math>\tilde{\theta}=(\theta_1,...,\theta_d)^T</math>, then <math>f(\theta_1|x^n) = \int^{} \int^{} \dots \int^{}f(\theta|X)d\theta_2d\theta_3 \dots d\theta_d </math> and <math>E(\theta_1)=\int^{}\theta_1 \cdot f(\theta_1|x^n) d\theta_1</math><br />
<br />
====Example 1: Estimating parameters of a univariate Gaussian distribution====<br />
<br />
Suppose X follows a univariate Gaussian distribution (i.e. a Normal distribution) with parameters <math>\!\mu</math> and <br />
<math>\displaystyle {\sigma^2}</math>.<br />
<br />
(a) For Frequentists:<br />
<br />
<math>f(x|\theta)= \frac{1}{\sqrt{2\pi}\sigma} \cdot e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}</math><br />
<br />
<math>L_n(\theta)= \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma} \cdot e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2}</math><br />
<br />
<br />
<math>\ln L_n(\theta) = l(\theta) = \sum_{i=1}^n -\frac{1}{2}\ln 2\pi-\ln \sigma-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2</math><br />
<br />
To get the maximum likelihood estimator of <math>\!\mu</math> (mle), we find the <math>\hat{\mu}</math> which maximizes <math>\displaystyle L_n(\theta)</math>:<br />
<br />
<math>\frac{\partial l(\theta)}{\partial \mu}= \sum_{i=1}^n \frac{1}{\sigma}(\frac{x_i-\mu}{\sigma})=0 \Rightarrow \sum_{i=1}^n x_i = n\mu \Rightarrow \hat{\mu}_{mle}=\bar{x}</math><br />
<br />
(b) For Bayesians:<br />
<br />
<math>f(\theta|x) \propto f(x|\theta) \cdot f(\theta)</math><br />
<br />
We assume that the mean of the above normal distribution is itself distributed normally with mean <math>\!\mu_0</math> and variance <math>\!\Gamma</math>.<br />
<br />
Suppose <math>\!\mu\sim N(\mu_0, \!\Gamma^2</math>),<br />
<br />
so <math>f(\mu) = \frac{1}{\sqrt{2\pi}\Gamma} \cdot e^{-\frac{1}{2}(\frac{\mu-\mu_0}{\Gamma})^2}</math><br />
<br />
<math>f(\mu|x) = \frac{1}{\sqrt{2\pi}\tilde{\sigma}} \cdot e^{-\frac{1}{2}(\frac{\mu-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
<br />
<math>\tilde{\mu} = \frac{\frac{n}{\sigma^2}}{\frac{n}{\sigma^2}+\frac{1}{\Gamma^2}}\bar{x}+\frac{\frac{1}{\Gamma^2}}{\frac{n}{\sigma^2}+\frac{1}{\Gamma^2}}\mu_0</math>, where <math>\tilde{\mu}</math> is the estimator of <math>\!\mu</math>.<br />
<br />
* If prior belief about <math>\!\mu_0</math> is strong, then <math>\!\Gamma</math> is small and <math>\frac{1}{\Gamma^2}</math> is large. <math>\tilde{\mu}</math> is close to <math>\!\mu_0</math> and the observations will not affect too much. On the contrary, if prior belief about <math>\!\mu_0</math> is weak, <math>\!\Gamma</math> is large and <math>\frac{1}{\Gamma^2}</math> is small. <math>\tilde{\mu}</math> depends more on observations.(This is intuitive, when our original belief is reliable, then the sample is not important in improving the result; when the belief is not reliable, then we depend a lot on the sample.)<br />
<br />
* When the sample is large (i.e. n <math>\to \infty</math>), <math>\tilde{\mu} \to \bar{x}</math> and the impact of prior belief about <math>\!\mu</math> is weakened.<br />
<br />
=='''Basic Monte Carlo Integration - October 6th, 2011'''==<br />
<br />
Three integration methods would be taught in this course:<br />
*Basic Monte Carlo Integration<br />
*Importance Sampling<br />
*Markov Chain Monte Carlo (MCMC)<br />
<br />
The first, and most basic, method of numerical integration we will see is Monte Carlo Integration. We use this to solve an integral of the form: <math> I = \int_{a}^{b} h(x) dx </math><br />
<br />
Note the following derivation: <br />
<br />
<math>\begin{align}<br />
\displaystyle I & = \int_{a}^{b} h(x)dx \\<br />
& = \int_{a}^{b} h(x)((b-a)/(b-a))dx \\<br />
& = \int_{a}^{b} (h(x)(b-a))(1/(b-a))dx \\<br />
& = \int_{a}^{b} w(x)f(x)dx \\<br />
& = E[w(x)] \\<br />
\end{align}<br />
</math><br />
<br />
~<math>(1/n) \sum_{i=1}^{n} w(x) </math><br />
<br />
Where w(x) = h(x)(b-a) and f(x) is the probability density function of a uniform random variable on the interval [a,b]. The expectation, with respect to the distribution of f, of w is taken from n samples of x.<br />
<br />
<br />
===='''General Procedure'''====<br />
<br />
i) Draw n samples <math> x_i \sim~ U[a,b] </math><br />
<br />
ii) Compute <math> \ w(x_i) </math> for every sample<br />
<br />
iii) Obtain an estimate of the integral, <math> \hat{I} </math>, as follows:<br />
<br />
<math> \hat{I} = 1/n \sum_{i=1}^{n} w(x</math><sub>i</sub><math> )</math> . Clearly, this is just the average of the simulation results.<br />
<br />
By the strong law of large numbers <math> \hat{I} </math> converges to <math> \ I </math> as <math> \ n \rightarrow \infty </math>. Because of this, we can compute all sorts of useful information, such as variance, standard error, and confidence intervals.<br />
<br />
Standard Error: <math> SE = Standard Deviation / \sqrt{n} </math><br />
<br />
Variance: <math> V = (\sum_{i=1}^{n} (w(x)-I)^2)/(n-1) </math><br />
<br />
Confidence Interval: <math> I \pm t_{(\alpha/2)} SE </math><br />
<br />
==='''Example: Uniform Distribution'''===<br />
<br />
Consider the integral, <math> \int_{0}^{1} x^3dx </math>, which is easily solved through standard analytical integration methods, and is equal to .25. Now, let us check this answer with a numerical approximation using Monte Carlo Integration. <br />
<br />
We generate a 1 by 10000 vector of uniform (on the interval [0,1]) random variables and call that vector 'u'. We see that our 'w' in this case is <math> x^3 </math>, so we set <math> w = u^3 </math>. Our I<sup>^</sup> is equal to the mean of w.<br />
<br />
In Matlab, we can solve this integration problem with the following code:<br />
<br />
<pre><br />
u = rand(1,10000);<br />
w = u.^3;<br />
mean(w)<br />
ans = 0.2475<br />
</pre><br />
<br />
Note the '.' after 'u' in the second line of code, indicating that each entry in the matrix is cubed. Also, our approximation is close to the actual value of .25. Now let's try to get an even better approximation by generating more sample points. <br />
<br />
<pre><br />
u= rand(1,100000);<br />
w= u.^3;<br />
mean(w)<br />
ans = .2503<br />
</pre><br />
<br />
We see that when the number of sample points is increased, our approximation improves, as one would expect.<br />
<br />
==='''Generalization'''===<br />
<br />
Up to this point we have seen how to numerically approximate an integral when the distribution of f is uniform. Now we will see how to generalize this to other distributions.<br />
<br />
<math> I = \int h(x)f(x)dx </math> <br />
<br />
If f is a distribution function (pdf), then <math> I </math> can be estimated as E<sub>f</sub>[h(x)]. This means taking the expectation of h with respect to the distribution of f. Our previous example is the case where f is the uniform distribution between [a,b].<br />
<br />
'''Procedure for the General Case'''<br />
<br />
i) Draw n samples from f <br />
<br />
ii) Compute h(x<sub>i</sub>)<br />
<br />
iii) <math>\hat{I} = 1/n \sum_{i=1}^{n} h(x</math><sub>i</sub><math>)</math><br />
<br />
==='''Example: Exponential Distribution'''===<br />
<br />
Find <math> E[\sqrt{x}] </math> for <math> \displaystyle f = e^{-x} </math>, which is the exponential distribution with mean 1.<br />
<br />
<math> I = \int_{0}^{\infty} \sqrt{x} e^{-x}dx </math><br />
<br />
We can see that we must draw samples from f, the exponential distribution.<br />
<br />
To find a numerical solution using Monte Carlo Integration we see that: <br />
<br />
u= rand(1,10000)<br />
X= -log(u)<br />
h= <math> \sqrt{x} </math> <br />
I= mean(h)<br />
<br />
To implement this procedure in Matlab, use the following code:<br />
<br />
<pre><br />
u = rand(1,10000);<br />
X = -log(u);<br />
h = x.^.5;<br />
mean(h)<br />
ans = .8841<br />
</pre><br />
<br />
An easy way to check whether your approximation is correct is to use the built in Matlab function 'quadl' which takes a function and bounds for the integral and returns a solution for the definite integral of that function. For this specific example, we can enter:<br />
<br />
<pre><br />
f = @(x) sqrt(x).*exp(-x);<br />
% quadl runs into computational problems when the upper bound is "inf" or an extremely large number, <br />
% so choose just a moderately large number.<br />
quadl(f,0,100)<br />
ans =<br />
0.8862<br />
</pre><br />
<br />
From the above result, we see that our approximation was quite close.<br />
<br />
==='''Example: Normal Distribution'''===<br />
<br />
Let <math> f(x) = (1/(2 \pi)^{1/2}) e^{(-x^2)/2} </math>. Compute the cumulative distribution function at some point x.<br />
<br />
<math> F(x)= \int_{-\infty}^{x} f(s)ds = \int_{-\infty}^{x}(1)(1/(2 \pi)^{1/2}) e^{(-s^2)/2}ds </math>. The (1) is inserted to illustrate that our h(x) will be the constant function 1, and our f(x) is the normal distribution. To take into account the upper bound of integration, x, any values sampled that are greater than x will be set to zero. <br />
<br />
This is the Matlab code for solving F(2):<br />
<br />
<pre><br />
<br />
u = randn(1,10000)<br />
h = u < 2;<br />
mean(h)<br />
ans = .9756<br />
<br />
</pre><br />
<br />
We generate a 1 by 10000 vector of standard normal random variables and we return a value of 1 if u is less than 2, and 0 otherwise.<br />
<br />
We can also build the function F(x) in matlab in the following way:<br />
<br />
<pre><br />
function F(x)<br />
u=rand(1,1000000);<br />
h=u<x;<br />
mean(h)<br />
</pre><br />
<br />
<br />
==='''Example: Binomial Distribution'''===<br />
<br />
In this example we will see the Bayesian Inference for 2 Binomial Distributions.<br />
<br />
Let <math> X ~ Bin(n,p) </math> and <math> Y ~ Bin(m,q) </math>, and let <math> \!\delta = p-q </math>.<br />
<br />
Therefore, <math> \displaystyle \!\delta = x/n - y/m </math> which is the frequentist approach.<br />
<br />
Bayesian wants <math> \displaystyle f(p,q|x,y) = f(x,y|p,q)f(p,q)/f(x,y) </math>, where <math> f(x,y)=\iint\limits_{\!\theta} f(x,y|p,q)f(p,q)\,dp\,dq</math> is a constant.<br />
<br />
Thus, <math> \displaystyle f(p,q|x,y)\propto f(x,y|p,q)f(p,q) </math>. Now we assume that <math>\displaystyle f(p,q) = f(p)f(q) = 1 </math> and f(p) and f(q) are uniform.<br />
<br />
Therefore, <math> \displaystyle f(p,q|x,y)\propto p^x(1-p)^{n-x}q^y(1-q)^{m-y} </math>.<br />
<br />
<math> E[\delta] = \int_{0}^{1} \int_{0}^{1} (p-q)f(p,q|x,y)dxdy </math>.<br />
<br />
As you can see this is much tougher than the frequentist approach.<br />
<br />
=='''Importance Sampling and Basic Monte Carlo Integration - October 11th, 2011'''==<br />
<br />
==='''Example: Binomial Distribution (Continued)'''===<br />
<br />
Suppose we are given two independent Binomial Distributions <math>\displaystyle X \sim Bin(n, p_1)</math>, <math>\displaystyle Y \sim Bin(m, p_2)</math>. We would like to give an Monte Carlo estimate of <math>\displaystyle \delta = p_1 - p_2</math><br><br />
<br />
Frequentist approach: <br><br><math>\displaystyle \hat{p_1} = \frac{X}{n}</math> ; <math>\displaystyle \hat{p_2} = \frac{Y}{m}</math><br><br><math>\displaystyle \hat{\delta} = \hat{p_1} - \hat{p_2} = \frac{X}{n} - \frac{Y}{m}</math><br><br><br />
Bayesian approach to compute the expected value of <math>\displaystyle \delta</math>:<br><br><br />
<math>\displaystyle E(\delta) = \int\int(p_1-p_2) f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Assume that <math>\displaystyle n = 100, m = 100, p_1 = 0.5, p_2 = 0.8</math> and the sample size is 1000.<br><br />
MATLAB code of the above example:<br />
<pre><br />
n = 100;<br />
m = 100;<br />
p_1 = 0.5;<br />
p_2 = 0.8;<br />
p1 = mean(rand(n,1000)<p_1);<br />
p2 = mean(rand(m,1000)<p_2);<br />
delta = p2 - p1;<br />
hist(delta)<br />
mean(delta)<br />
</pre><br />
<br />
In one execution of the code, the mean of delta was 0.3017. The histogram of delta generated was:<br />
[[File:Hist delta.jpg|center|]]<br />
<br />
Through Monte Carlo simulation, we can obtain an empirical distribution of delta and carry out inference on the data obtained, such as computing the mean, maximum, variance, standard deviation and the standard error of delta.<br />
<br />
==='''Importance Sampling'''===<br />
<br />
====Motivation====<br />
<br />
Consider the integral <math>\displaystyle I = \int h(x)f(x)\,dx</math><br><br><br />
According to basic Monte Carlo Integration, if we can sample from the probability density function <math>\displaystyle f(x)</math> and feed the samples of <math>\displaystyle f(x)</math> back to <math>\displaystyle h(x)</math>, <math>\displaystyle I</math> can be estimated as an average of <math>\displaystyle h(x)</math> ( i.e. <math>\hat{I} = \frac{1}{n} \sum_{i=1}^{n} h(x_i)</math> )<br><br />
However, the Monte Carlo method works when we know how to sample from <math>\displaystyle f(x)</math>. In the case where it is difficult to sample from <math>\displaystyle f(x)</math>, importance sampling is a technique that we can apply. Importance Sampling relies on another function <math>\displaystyle g(x)</math> which we know how to sample from.<br />
<br />
The above integral can be rewritten as follow:<br><br />
<math>\begin{align}<br />
\displaystyle I & = \int h(x)f(x)\,dx \\<br />
& = \int h(x)f(x)\frac{g(x)}{g(x)}\,dx \\<br />
& = \int \frac{h(x)f(x)}{g(x)}g(x)\,dx \\<br />
& = \int y(x)g(x)\,dx \\<br />
& = E_g(y(x)) \\<br />
\end{align}<br />
</math><br><br />
<math>where \ y(x) = \frac{h(x)f(x)}{g(x)}</math><br><br />
<br />
The integral can thus be simulated as <math>\displaystyle \hat{I} = \frac{1}{n} \sum_{i=1}^{n} Y_i \ , \ where \ Y_i = \frac{h(x_i)f(x_i)}{g(x_i)}</math><br><br />
<br />
====Procedure====<br />
<br />
Suppose we know how to sample from <math>\displaystyle g(x)</math><br><br />
#Choose a suitable <math>\displaystyle g(x)</math> and draw n samples <math>x_1,x_2....,x_n \sim~ g(x)</math><br />
#Set <math>Y_i =\frac{h(x_i)f(x_i)}{g(x_i)}</math><br />
#Compute <math> \hat{I} = \frac{1}{n}\sum_{i=1}^{n} Y_i </math><br><br />
<br />
By the Law of large numbers, <math>\displaystyle \hat{I} \rightarrow I </math> provided that the sample size n is large enough.<br><br><br />
<br />
'''Remarks:''' One can think of <math>\frac{f(x)}{g(x)}</math> as a weight to <math>\displaystyle h(x)</math> in the computation of <math>\hat{I}</math><br><br><br />
<math>\displaystyle i.e. \ \hat{I} = \frac{1}{n}\sum_{i=1}^{n} Y_i = \frac{1}{n}\sum_{i=1}^{n} (\frac{f(x_i)}{g(x_i)})h(x_i)</math><br><br><br />
Therefore, <math>\displaystyle \hat{I} </math> is a weighted average of <math>\displaystyle h(x_i)</math><br><br><br />
<br />
====Problem====<br />
<br />
If <math>\displaystyle g(x)</math> is not chosen appropriately, then the variance of the estimate <math>\hat{I}</math> may be very large. Here we actually face a similar problem with Rejection-Acceptance Approach. Consider the second moment of <math>\displaystyle I</math>:<br><br><br />
<math>\begin{align}<br />
\displaystyle I & = E_g((y(x))^2) \\<br />
& = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx \\<br />
& = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx \\<br />
& = \int \frac{h^2(x)f^2(x)}{g(x)} dx \\<br />
\end{align}<br />
</math><br><br><br />
<br />
When <math>\displaystyle g(x)</math> is very small, then the above integral could be very large, hence the variance can be very large when g is not chosen appropriately. This occurs when <math>\displaystyle g(x)</math> has a thinner tail than <math>\displaystyle f(x)</math> such that the quantity <math>\displaystyle \frac{h^2(x)f^2(x)}{g(x)}</math> is large.<br />
<br />
'''Remarks:''' <br />
<br />
1. We can actually compute the form of <math>\displaystyle g(x)</math> to have optimal variance. <br>Mathematically, it is to find <math>\displaystyle g(x)</math> subject to <math>\displaystyle \min_g [\ E_g([y(x)]^2) - (E_g[y(x)])^2\ ]</math><br><br />
It can be shown that the optimal <math>\displaystyle g(x)</math> is <math>\displaystyle \frac{|h(x)|f(x)}{\int_{-\infty}^{\infty}|h(s)|f(s)ds}</math>. Using the optimal <math>\displaystyle g(x)</math> will minimize the variance of estimation in Importance Sampling. This is of theoretical interest but not useful in practice. As we can see, if we can actually show the expression of g(x), we must first have the value of the integration---which is what we want in the first place.<br />
<br />
2. In practice, we shall choose <math>\displaystyle g(x)</math> which has similar shape as <math>\displaystyle f(x)</math> but with a thicker tail than <math>\displaystyle f(x)</math> in order to avoid the problem mentioned above.<br><br />
<br />
====Example====<br />
<br />
Estimate <math>\displaystyle I = Pr(Z>3),\ where\ Z \sim N(0,1)</math><br><br><br />
'''Method 1: Basic Monte Carlo'''<br />
<br />
<math>\begin{align} Pr(Z>3) & = \int^\infty_3 f(x)\,dx \\<br />
& = \int^\infty_{-\infty} h(x)f(x)\,dx \end{align}</math><br /><br />
<math> where \ <br />
h(x) = \begin{cases}<br />
0, & \text{if } x \le 3 \\<br />
1, & \text{if } x > 3<br />
\end{cases}</math><br />
<math>\ ,\ f(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}</math><br />
<br />
MATLAB code to compute <math>\displaystyle I</math> from 100 samples of standard normal distribution:<br />
<pre><br />
h = randn(100,1) > 3;<br />
I = mean(h)<br />
</pre><br />
<br />
In one execution of the code, it returns a value of 0 for <math>\displaystyle I</math>, which differs significantly from the true value of <math>\displaystyle I \approx 0.0013 </math>. The problem of using Basic Monte Carlo in this example is that <math>\displaystyle Pr(Z>3)</math> has a small value, and hence many points sampled from the standard normal distribution will be wasted. Therefore, although Basic Monte Carlo is a feasible method to compute <math>\displaystyle I</math>, it gives a poor estimation.<br />
<br />
'''Method 2: Importance Sampling'''<br />
<br />
<math>\displaystyle I = Pr(Z>3)= \int^\infty_3 f(x)\,dx </math><br><br />
<br />
To apply importance sampling, we have to choose a <math>\displaystyle g(x)</math> which we will sample from. In this example, we can choose <math>\displaystyle g(x)</math> to be the probability density function of exponential distribution, normal distribution with mean 0 and variance greater than 1 or normal distribution with mean greater than 0 and variance 1 etc.. For the following, we take <math>\displaystyle g(x)</math> to be the pdf of <math>\displaystyle N(4,1)</math>.<br><br />
<br />
Procedure:<br />
#Draw n samples <math>x_1,x_2....,x_n \sim~ g(x)</math><br />
#Calculate <math>\begin{align} \frac{f(x)}{g(x)} & = \frac{ \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}<br />
}{ \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x-4)^2} } \\<br />
& = e^{8-4x} \end{align} </math><br><br />
#Set <math> Y_i = h(x_i)e^{8-4x_i}\ with\ h(x) = \begin{cases}<br />
0, & \text{if } x \le 3 \\<br />
1, & \text{if } x > 3<br />
\end{cases}<br />
</math><br><br />
#Compute <math> \hat{Y} = \frac{1}{n}\sum_{i=1}^{n} Y_i </math><br><br />
<br />
The above procedure from 100 samples of <math>\displaystyle g(x)</math>can be implemented in MATLAB as follow:<br />
<pre><br />
for ii = 1:100<br />
x = randn + 4 ;<br />
h = x > 3 ;<br />
y(ii) = h * exp(8-4*x) ;<br />
end<br />
mean(y)<br />
</pre><br />
<br />
In one execution of the code, it returns a value of 0.001271 for <math> \hat{Y} </math>, which is much closer to the true value of <math>\displaystyle I \approx 0.0013 </math>. From many executions of the code, the variance of basic monte carlo is approximately 150 times that of importance sampling. This demonstrates that this method can provide a better estimate than the Basic Monte Carlo method.<br />
<br />
==''' Importance Sampling with Normalized Weight and Markov Chain Monte Carlo - October 13th, 2011'''==<br />
==='''Importance Sampling with Normalized Weight'''===<br />
<br />
Recall that we can think of <math>\displaystyle b(x) = \frac{f(x)}{g(x)}</math> as a weight applied to the samples <math>\displaystyle h(x)</math>. If the form of <math>\displaystyle f(x)</math> is known only up to a constant, we can use an alternate, normalized form of the weight, <math>\displaystyle b^*(x)</math>. (This situation arises in Bayesian inference.) Importance sampling with normalized or standard weight is also called indirect importance sampling.<br />
<br />
We derive the normalized weight as follows:<br><br />
<math>\begin{align}<br />
\displaystyle I & = \int h(x)f(x)\,dx \\<br />
&= \int h(x)\frac{f(x)}{g(x)}g(x)\,dx \\<br />
&= \frac{\int h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int f(x) dx} \\<br />
&= \frac{\int h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int \frac{f(x)}{g(x)}g(x) dx} \\<br />
&= \frac{\int h(x)b(x)g(x)\,dx}{\int\ b(x)g(x) dx} <br />
\end{align}</math><br />
<br />
<math>\hat{I}= \frac{\sum_{i=1}^{n} h(x_i)b(x_i)}{\sum_{i=1}^{n} b(x_i)} </math><br />
<br />
Then, the normalized weight is <math>b^*(x) = \displaystyle \frac{b(x_i)}{\sum_{i=1}^{n} b(x_i)}</math><br />
<br />
Note that <math> \int f(x) dx = 1 = \int b(x)g(x) dx = 1 </math><br />
<br />
We can also determine the associated Monte Carlo variance of this estimate by<br />
<br />
<math> Var(\hat{I})= \frac{\sum_{i=1}^{n} b(x_i)(h(x_i) - \hat{I})^2}{\sum_{i=1}^{n} b(x_i)} </math><br />
<br />
==='''Markov Chain Monte Carlo'''===<br />
We still want to solve <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
====Stochastic Process====<br />
A stochastic process <math> \{ x_t : t \in T \}</math> is a collection of random variables. Variables <math>\displaystyle x_t</math> take values in some set <math>\displaystyle X</math> called the '''space set.''' The set <math>\displaystyle T</math> is called the '''index set.'''<br />
<br />
====Markov Chain====<br />
A Markov Chain is a stochastic process for which the distribution of <math>\displaystyle x_t</math> depends only on <math>\displaystyle x_{t-1}</math>. It is a random process characterized as being memoryless; meaning that the next occurrence of a defined event is only dependent on the current event and not on the sequence of events that preceded it. <br />
Formal Definition: The process <math> \{ x_t : t \in T \}</math> is a Markov Chain if <math>\displaystyle Pr(x_t|x_0, x_1,..., x_{t-1})= Pr(x_t|x_{t-1})</math> for all <math> \{t \in T \}</math> and for all <math> \{x \in X \}</math><br />
For a Markov Chain, <math>\displaystyle f(x_1,...x_n)= f(x_1)f(x_2|x_1)f(x_3|x_2)...f(x_n|x_{n-1})</math><br />
<br><br>Real Life Example:<br />
<br>When going for an interview, the employer only looks at your highest education achieved. The employer would not look at the past educations received (elementary school, high school etc.) because the employer believes that the highest education achieved summarizes your previous educations. Therefore, anything before your most recent previous education is irrelevant. In other word, we can say that<math> x_t </math>is regarded as the summary of <math>x_{t-1},...,x_2,x_1</math>, so when we need to determine <math>x_{t+1}</math>, we only need to pay attention in <math>x_{t}</math>.<br />
<br />
====Transition Probabilities====<br />
A Transition Probability is the probability of jumping from one state to another state.<br />
Formal Definition: We call <math>\displaystyle P_{ij} = Pr(x_{t+1}=j|x_t=i)</math> the transition probability.<br />
That is, P(i,j) is the probability of going to state j from state i. The matrix P whose (i,j) element is <math>\displaystyle P_{ij}</math> is called the Transition Matrix.<br />
<br />
Properties of P: <br />
:1) <math>\displaystyle P_{ij} >= 0</math> The probability of going to another state cannot be negative<br />
:2) <math>\displaystyle \sum_{\forall i}P_{ij} = 1</math> The probability of going to some state from state i (including remaining in state i) is certainty<br />
<br />
====Random Walk====<br />
Example: Start at one point and flip a coin where <math>\displaystyle Pr(H)=p</math> and <math>\displaystyle Pr(T)=1-p=q</math>. Take one step right if heads and one step left if tails. If at an endpoint, stay there.<br />
The transition matrix is<br />
<math>P=\left(\begin{matrix}1&0&0&\dots&\dots&0\\<br />
q&0&p&0&\dots&0\\<br />
0&q&0&p&\dots&0\\<br />
\vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\<br />
\vdots&\vdots&\vdots&\vdots&\ddots&\vdots\\<br />
0&0&\dots&\dots&\dots&1<br />
\end{matrix}\right)</math><br />
<br />
Let <math>\displaystyle P_n</math> be the matrix such that its (i,j) element is <math>\displaystyle P_{ij}(n)</math>. This is called n-step probability.<br />
<br />
:<math>\displaystyle P_n = P^n</math><br />
:<math>\displaystyle P_1 = P</math><br />
:<math>\displaystyle P_2 = P^2</math><br />
<br />
<br />
==''' Markov Chain Properties and Page Rank - October 18th, 2011'''==<br />
<br />
===Summary of Terminology===<br />
<br />
====Transition Matrix====<br />
<br />
A matrix <math>\!P</math> that defines a Markov Chain has the form:<br />
<br />
<math>P = \begin{bmatrix}<br />
P_{11} & \cdots & P_{1N} \\<br />
\vdots & \ddots & \vdots \\ <br />
P_{N1} & \cdots & P_{NN}<br />
\end{bmatrix}</math><br />
<br />
where <math>\!P(i,j) = P_{ij} = Pr(x_{t+1} = j | x_t = i) </math> is the probability of transitioning from state i to state j in the Markov Chain in a single step. Note that this implies that all rows add up to one.<br />
<br />
====n-step Transition matrix====<br />
<br />
A matrix <math>\!P_n</math> whose (i,j)<sup>th</sup> entry is the probability of moving from state i to state j after n transitions:<br />
<br />
<math>\!P_n(i,j) = Pr(x_{m+n}=j|x_m = i)</math><br />
<br />
This probability is called the n-step transition probability. A nice property of this matrix is that<br />
<br />
<math>\!P_n = P^n</math><br />
<br />
For all n >= 0, where P is the transition matrix. Note that the rows of <math>P_n</math> should still add up to one.<br />
<br />
====Marginal distribution of a Markov Chain====<br />
<br />
We represent the state at time t as a vector.<br />
<br />
<math>\mu_t = (\mu_t(1) \; \mu_t(2) \; ... \; \mu_t(n))</math><br />
<br />
Consider this Markov Chain:<br />
<br />
[[File:MarkovSample.png|300px]]<br />
<br />
<math>\mu_t = (A \; B)</math>, where A is the probability of being in state a at time t, and B is the probability of being in state b at time t.<br />
<br />
For example if <math>\mu_t = (0.1 \; 0.9)</math>, we have a 10% chance of being in state a at time t, and a 90% chance of being in state b at time t.<br />
<br />
Suppose we run this Markov chain many times, and record the state at each step.<br />
<br />
In this example, we run 4 trials, up until t=5.<br />
<br />
{| class="wikitable"<br />
|-<br />
! t<br />
! Trial 1<br />
! Trial 2<br />
! Trial 3<br />
! Trial 4<br />
! Observed <math>\mu</math><br />
|-<br />
| 1<br />
| a<br />
| b<br />
| b<br />
| a<br />
| (0.5, 0.5)<br />
|-<br />
| 2<br />
| b<br />
| a<br />
| a<br />
| a<br />
| (0.75, 0.25)<br />
|-<br />
| 3<br />
| a<br />
| a<br />
| b<br />
| a<br />
| (0.75, 0.25)<br />
|-<br />
| 4<br />
| b<br />
| b<br />
| a<br />
| b<br />
| (0.25, 0.75)<br />
|-<br />
| 5<br />
| b<br />
| b<br />
| b<br />
| a<br />
| (0.25, 0.75)<br />
|}<br />
<br />
Imagine simulating the chain many times. If we collect all the outcomes at time t from all the chains, the histogram of this data would look like <math>\!\mu_t</math>.<br />
<br />
We can find the marginal probabilities as <math>\!\mu_n = \mu_0 P^n</math><br />
<br />
====Stationary Distribution====<br />
<br />
Let <math>\pi = (\pi_i \mid i \in \chi)</math> be a vector of non-negative numbers that sum to 1. (i.e. <math>\!\pi</math> is a pmf)<br />
<br />
If <math>\!\pi = \pi P</math>, then <math>\!\pi</math> is a stationary distribution, also known as an invariant distribution.<br />
<br />
====Limiting Distribution====<br />
<br />
A Markov chain has limiting distribution <math>\!\pi </math> if <math>\lim_{n \to \infty} P^n = \begin{bmatrix} \pi \\ \vdots \\ \pi \end{bmatrix}</math><br />
<br />
That is, <math>\!\pi_j = \lim_{n \to \infty}\left [ P^n \right ]_{ij}</math> exists and is independent of i.<br />
<br />
Here is an example:<br />
<br />
Suppose we want to find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/3&1/3&1/3\\<br />
1/4&3/4&0\\<br />
1/2&0&1/2<br />
\end{matrix}\right)</math><br />
<br />
We want to solve <math>\pi=\pi P</math> and we want <math>\displaystyle \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
<math>\displaystyle \pi_0 = 1/3\pi_0 + 1/4\pi_1 + 1/2\pi_2</math><br /><br />
<math>\displaystyle \pi_1 = 1/3\pi_0 + 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_2 = 1/3\pi_0 + 1/2\pi_2</math><br /><br />
<br />
Solving the system of equations, we get <br /> <br />
<math>\displaystyle \pi_1 = 4/3\pi_0</math><br /><br />
<math>\displaystyle \pi_2 = 2/3\pi_0</math><br /><br />
<br />
So using our condition above, we have <math>\displaystyle \pi_0 + 4/3\pi_0 + 2/3\pi_0 = 1</math> and by solving we get <math>\displaystyle \pi_0 = 1/3</math><br />
<br />
Using this in our system of equations, we obtain: <br /><br />
<math>\displaystyle \pi_1 = 4/9</math><br /><br />
<math>\displaystyle \pi_2 = 2/9</math><br />
<br />
Thus, the limiting distribution is <math>\displaystyle \pi = (1/3, 4/9, 2/9)</math><br />
<br />
====Detailed Balance====<br />
<br />
<math>\!\pi</math> has the detailed balance property if <math>\!\pi_iP_{ij} = P_{ji}\pi_j</math><br />
<br />
'''Theorem'''<br />
<br />
If <math>\!\pi</math> satisfies detailed balance, then <math>\!\pi</math> is a stationary distribution.<br />
<br />
In other words, if <math>\!\pi_iP_{ij} = P_{ji}\pi_j</math>, then <math>\!\pi = \pi P</math><br />
<br />
'''Proof:''' <br />
<br />
<math>\!\pi P =<br />
\begin{bmatrix}\pi_1 & \pi_2 & \cdots & \pi_N\end{bmatrix} \begin{bmatrix}P_{11} & \cdots & P_{1N} \\ \vdots & \ddots & \vdots \\ P_{N1} & \cdots & P_{NN}\end{bmatrix}</math><br />
<br />
Observe that the j<sup>th</sup> element of <math>\!\pi P</math> is<br />
<br />
<math>\!\left [ \pi P \right ]_j = \pi_1 P_{1j} + \pi_2 P_{2j} + \dots + \pi_N P_{Nj}</math><br />
<br />
::<math>\! = \sum_{i=1}^N \pi_i P_{ij}</math><br />
<br />
::<math>\! = \sum_{i=1}^N P_{ji} \pi_j</math>, by the definition of detailed balance.<br />
<br />
::<math>\! = \pi_j \sum_{i=1}^N P_{ji}</math><br />
<br />
::<math>\! = \pi_j</math>, as the sum of the entries in a column of P must sum to 1.<br />
<br />
So <math>\!\pi = \pi P</math>.<br />
<br />
<br />
'''Example'''<br />
<br />
Find the marginal distribution of <br />
<br />
[[File:MarkovSample.png|300px]]<br />
<br />
Start by generating the matrix P.<br />
<br />
<math>\!P = \begin{pmatrix} 0.2 & 0.8 \\ 0.6 & 0.4 \end{pmatrix}</math><br />
<br />
We must assume some starting value for <math>\mu_0</math><br />
<br />
<math>\!\mu_0 = \begin{pmatrix} 0.1 & 0.9 \end{pmatrix}</math><br />
<br />
For t = 1, the marginal distribution is<br />
<br />
<math>\!\mu_1 = \mu_0 P</math><br />
<br />
Notice that this <math>\mu</math> converges. <br />
<br />
If you repeatedly run:<br />
<br />
<math>\!\mu_{i+1} = \mu_i P</math><br />
<br />
It converges to <math>\mu = \begin{pmatrix} 0.4286 & 0.5714 \end{pmatrix}</math><br />
<br />
This can be seen by running the following Matlab code:<br />
P = [0.2 0.8; 0.6 0.4];<br />
mu = [0.1 0.9]; <br />
while 1 <br />
mu_old = mu; <br />
mu = mu * P;<br />
if mu_old == mu <br />
disp(mu);<br />
break;<br />
end<br />
end<br />
<br />
Another way of looking at this simple question is that we can see whether the ultimate pmf converges:<br />
<br />
Let <math>\hat{p_n}(1)=\frac{1}{n}\sum_{k=1}^n I(X_k=1)</math> denote the estimator of the stationary probability of state 1,<math>\hat{p_n}(2)=\frac{1}{n}\sum_{k=1}^n I(X_k=2)</math> denote the estimator of the stationary probability of state 2, where <math>\displaystyle I(X_k=1)</math> and <math>\displaystyle I(X_k=2)</math> are indicator variables which equal 1 if <math>X_k=1</math>(or <math>X_k=2</math> for the latter one).<br />
<br />
Matlab codes for this explanation is<br />
<br />
n=1;<br />
if rand<0.1<br />
x(1)=1;<br />
else<br />
x(1)=0;<br />
end<br />
p1(1)=sum(x)/n;<br />
p2(1)=1-p1(1);<br />
for i=2:10000<br />
n=n+1;<br />
if (x(i-1)==1&rand<0.2)|(x(i-1)==0&rand<0.6)<br />
x(i)=1;<br />
else<br />
x(i)=0;<br />
end<br />
p1(i)=sum(x)/n;<br />
p2(i)=1-p1(i); <br />
end<br />
plot(p1,'red');<br />
hold on;<br />
plot(p2)<br />
<br />
The results can be easily seen from the graph below:<br />
<br />
[[File:Stationary distribution.png|300px]]<br />
<br />
Additionally, we can plot the marginal distribution as it converges without estimating it. The following Matlab code shows this:<br />
<br />
%transition matrix<br />
P=[0.2 0.8; 0.6 0.4];<br />
%mu at time 0<br />
mu=[0.1 0.9];<br />
%number of points for simulation<br />
n=20;<br />
for i=1:n<br />
mu_a(i)=mu(1);<br />
mu_b(i)=mu(2);<br />
mu=mu*P;<br />
end<br />
t=[1:n];<br />
plot(t, mu_a, t, mu_b);<br />
hleg1=legend('state a', 'state b');<br />
<br />
[[File:Marginal distribution convergence.png|300px]]<br />
<br />
Note that there are chains with stationary distributions that don't converge (the chain might not naturally reach the stationary distribution, and isn't limiting). An example of this is:<br />
<br />
<math>P = \begin{pmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \end{pmatrix}, \mu_0 = \begin{pmatrix} 1/3 & 1/3 & 1/3 \end{pmatrix}</math><br />
<br />
<math>\!\mu_0</math> is a stationary distribution, so <math>\!\mu P</math> is the same for all iterations.<br />
<br />
But,<br />
<br />
<math>P^{1000} = \begin{pmatrix} 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{pmatrix} \ne \begin{pmatrix} \mu \\ \mu \\ \mu \end{pmatrix}</math><br />
<br />
So <math>\!\mu</math> is not a limiting distribution. Also, if<br />
<br />
<math>\mu = \begin{pmatrix} 0.2 & 0.1 & 0.7 \end{pmatrix}</math><br />
<br />
Then <math>\!\mu = \mu P</math> does not converge.<br />
<br />
This can be observed through the following Matlab code.<br />
<br />
P = [0 0 1; 1 0 0; 0 1 0];<br />
mu = [0.2 0.1 0.7]; <br />
for i= 1:4 <br />
mu = mu * P;<br />
disp(mu);<br />
end<br />
<br />
This outputs<br />
0.1000 0.7000 0.2000<br />
0.7000 0.2000 0.1000<br />
0.2000 0.1000 0.7000<br />
0.1000 0.7000 0.2000<br />
<br />
Note that <math>\!\mu_1 = \!\mu_4</math>, which indicates that <math>\!\mu</math> will cycle forever.<br />
<br />
This means that this chain has a stationary distribution, but is not limiting.<br />
<br />
===Page Rank===<br />
<br />
Page Rank was the original ranking algorithm used by Google's search engine to rank web pages.<ref><br />
http://ilpubs.stanford.edu:8090/422/<br />
</ref> The algorithm was created by the founders of Google, Larry Page and Sergey Brin as part of Page's PhD thesis. When a query is entered in a search engine, there are a set of web pages which are matched by this query, but this set of pages must be ordered by their "importance" in order to identify the most meaningful results first. Page Rank is an algorithm which assigns importance to every web page based on the links in each page.<br />
<br />
==== Intuition ====<br />
<br />
We can represent web pages by a set of nodes, where web links are represented as edges connecting these nodes. Based on our intuition, there are three main factors in deciding whether a web page is important or not.<br />
<br />
# A web page is important if many other pages point to it.<br />
# The more important a webpage is, the more weight is placed on its links.<br />
# The more links a webpage has, the less weight is placed on its links.<br />
<br />
====Modelling====<br />
<br />
We can model the set of links as a N-by-N matrix L, where N is the number of web pages we are interested in:<br />
<br />
<math>L_{ij} =<br />
\left\{<br />
\begin{array}{lr}<br />
1 : \text{if page j points to i}\\<br />
0 : \text{otherwise}<br />
\end{array}<br />
\right. <br />
</math><br />
<br />
<br />
<br />
The number of outgoing links from page j is<br />
<br />
<math>c_j = \sum_{i=1}^N L_{ij}</math><br />
<br />
For example, consider the following set of links between web pages:<br />
<br />
[[File:PageRank.png|250px]]<br />
<br />
According to the factors relating to importance of links, we can consider two possible rankings :<br />
<br />
<br />
<math>\displaystyle 3 > 2 > 1 > 4 </math> <br />
<br />
or<br />
<br />
<math>\displaystyle 3>1>2>4 </math> <br />
if we consider that the high importance of the link from 3 to 1 is more influent than the fact that there are two outgoing links from page 1 and only one from page 2.<br />
<br />
<br />
We have <math>L = \begin{bmatrix} <br />
0 & 0 & 1 & 0 \\ <br />
1 & 0 & 0 & 0 \\ <br />
1 & 1 & 0 & 1 \\<br />
0 & 0 & 0 & 0<br />
\end{bmatrix}</math>, and <math>c = \begin{pmatrix}2 & 1 & 1 & 1\end{pmatrix} </math><br />
<br />
We can represent the ranks of web pages as the vector P, where the i<sup>th</sup> element is the rank of page i:<br />
<br />
<math>P_i = (1-d) + d\sum_j \frac{L_{ij}}{c_j} P_j</math><br />
<br />
Here we take the sum of the weights of the incoming links, where links are reduced in weight if the linking page has a lot of outgoing links, and links are increased in weight if the linking page has a lot of incoming links. <br />
<br />
We don't want to completely ignore pages with no incoming links, which is why we add the constant (1 - d).<br />
<br />
If <br />
<br />
<math>L = \begin{bmatrix} L_{11} & \cdots & L_{1N} \\<br />
\vdots & \ddots & \vdots \\<br />
L_{N1} & \cdots & L_{NN} \end{bmatrix}</math><br />
<br />
<math>D = \begin{bmatrix} c_1 & \cdots & 0 \\<br />
\vdots & \ddots & \vdots \\<br />
0 & \cdots & c_N \end{bmatrix}</math><br />
<br />
Then <math>D^{-1} = \begin{bmatrix} c_1^{-1} & \cdots & 0 \\<br />
\vdots & \ddots & \vdots \\<br />
0 & \cdots & c_N^{-1} \end{bmatrix}</math><br />
<br />
<math>\!P = (1-d)e + dLD^{-1}P</math><br />
<br />
where <math>\!e = \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}</math> is the vector with all 1's<br />
<br />
To simplify the problem, we let <math>\!e^T P = N \Rightarrow \frac{e^T P}{N} = 1</math>. This means that the average importance of all pages on the internet is 1.<br />
<br />
Then<br />
<math>\!P = (1-d)\frac{ee^TP}{N} + dLD^{-1}P</math><br />
::<math>\! = \left [ (1-d)\frac{ee^T}{N} + dLD^{-1} \right ] P</math><br />
::<math>\! = \left [ \left ( \frac{1-d}{N} \right ) E + dLD^{-1} \right ] P</math>, where <math> E </math> is an NxN matrix filled with ones.<br />
<br />
Let <math>\!A = \left [ \left ( \frac{1-d}{N} \right ) E + dLD^{-1} \right ]</math><br />
<br />
Then <math>\!P = AP</math>.<br />
<br />
<br />
Note that P is a stationary distribution and, more importantly, P is an eigenvector of A with eigenvalue 1. Therefore, we can find the ranks of all web pages by solving this equation for P. <br />
<br />
We can find the vector P for the example above, using the following Matlab code:<br />
L = [0 0 1 0; 1 0 0 0; 1 1 0 1; 0 0 0 0];<br />
D = [2 0 0 0; 0 1 0 0; 0 0 1 0; 0 0 0 1];<br />
d = 0.8 ;% pages with no links get a weight of 0.2<br />
N = 4 ;<br />
<br />
A = ((1-d)/N) * ones(N) + d * L * inv(D);<br />
[EigenVectors, EigenValues] = eigs(A)<br />
s=sum(EigenVectors(:,1));% we should note that the average entry of P should be 1 according to our assumption<br />
P=(EigenVectors(:,1))/s*N<br />
<br />
This outputs:<br />
<br />
EigenVectors =<br />
-0.6363 0.7071 0.7071 -0.0000 <br />
-0.3421 -0.3536 + 0.3536i -0.3536 - 0.3536i -0.7071 <br />
-0.6859 -0.3536 - 0.3536i -0.3536 + 0.3536i 0.0000 <br />
-0.0876 0.0000 + 0.0000i 0.0000 - 0.0000i 0.7071 <br />
<br />
<br />
EigenValues =<br />
1.0000 0 0 0 <br />
0 -0.4000 - 0.4000i 0 0 <br />
0 0 -0.4000 + 0.4000i 0 <br />
0 0 0 0.0000 <br />
<br />
P =<br />
<br />
1.4528<br />
0.7811<br />
1.5660<br />
0.2000<br />
<br />
Note that there is an eigenvector with eigenvalue 1. <br />
The reason why there always exist a 1-eigenvector is that A is a stochastic matrix. <br />
<br />
Thus our vector P is <math> <br />
\begin{bmatrix}1.4528 \\ 0.7811 \\ 1.5660\\ 0.2000 \end{bmatrix}</math><br />
<br />
However, this method is not practical, because there are simply too many web pages on the internet. So instead Google uses a method to approximate an eigenvector with eigenvalue 1.<br />
<br />
Note that page three has the rank with highest magnitude and page four has the rank with lowest magnitude, as expected.<br />
<br />
==''' Markov Chain Monte Carlo - Metropolis-Hastings - October 25th, 2011'''==<br />
<br />
We want to find <math> \int h(x)f(x)\, \mathrm dx </math>, but we don't know how to sample from <math>\,f</math>.<br />
<br />
We have seen simple techniques before. This one is used in real life.<br />
It consists of the search of a Markov Chain such that its stationary distribution is <math>\,f</math>.<br />
<br />
==== Main procedure ====<br />
<br />
Let us suppose that <math>\,q(y|x)</math> is a friendly distribution: we can sample from this function.<br />
<br />
1. Initialize the chain with <math>\,x_{i}</math> and set <math>\,i=0</math>.<br />
<br />
2. Draw a point from <math>\,q(y|x)</math> i.e. <math>\,Y \backsim q(y|x_{i})</math>.<br />
<br />
3. Evaluate <math>\,r(x,y)=min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\}</math><br />
<br />
<br />
4. Draw a point <math>\,U \backsim Unif[0,1]</math>.<br />
<br />
5. <math>\,x_{i+1}=\begin{cases}y & \text{ if } U<r \\x_{i} & \text{ otherwise } \end{cases} </math>.<br />
<br />
6. <math>\,i=i+1</math>. Go back to 2.<br />
<br />
==== Remark 1 ====<br />
<br />
A very common choice for <math>\,q(y|x)</math> is <math>\,N(y;x,b^{2})</math>, a normal distribution centered at the current point.<br />
<br />
Note : In this case <math>\,q(y|x)</math> is symmetric i.e. <math>\,q(y|x)=q(x|y)</math>.<br />
<br />
(Because <math>\,q(y|x)=\frac{1}{\sqrt{2\pi}b}e^{-\frac{1}{2b^{2}}(y-x)^{2}}</math> and <math>\,(y-x)^{2}=(x-y)^{2}</math>).<br />
<br />
Thus we have <math>\,\frac{q(x|y)}{q(y|x)}=1</math>, which implies :<br />
<br />
<math>\,r(x,y)=min\left\{\frac{f(y)}{f(x)},1\right\}</math>.<br />
<br />
In general, if <math>\,q(x|y)</math> is symmetric then the algorithm is called Metropolis in reference to the original algorithm (made in 1953)<ref>http://en.wikipedia.org/wiki/Equations_of_State_Calculations_by_Fast_Computing_Machines</ref>.<br />
<br />
<br />
<br />
====Remark 2====<br />
<br />
The value y is accepted if <math>\,u<min\left\{\frac{f(y)}{f(x)},1\right\}</math> so it is accepted with the probability <math>\,min\left\{\frac{f(y)}{f(x)},1\right\}</math>.<br />
<br />
Thus, if <math>\,f(y)>f(x)</math>, then <math>\,y</math> is always accepted.<br />
<br />
The higher that value of the pdf is in the vicinity of a point <math>\,y_1</math>, the more likely it is that a random variable will take on values around <math>\,y_1</math>. As a result it makes sense that we would want a high probability of acceptance for points generated near <math>\,y_1</math>.<br />
<br />
====Remark 3====<br />
<br />
One strength of the Metropolis-Hastings algorithm is that normalizing constants, which are often quite difficult to determine, can be cancelled out in the ratio <math> r </math>. For example, consider the case where we want to sample from the beta distribution, which has the pdf:<br />
<br />
<math><br />
\begin{align}<br />
f(x;\alpha,\beta)& = \frac{1}{\mathrm{B}(\alpha,\beta)}\, x^{\alpha-1}(1-x)^{\beta-1}\end{align}<br />
</math><br />
<br />
The beta function, ''B'', appears as a normalizating constant but it can be simplified by construction of the method.<br />
<br />
====Example====<br />
<br />
<math>\,f(x)=\frac{1}{\pi^{2}}\frac{1}{1+x^{2}}</math><br />
<br />
Then, we have <math>\,f(x)\propto\frac{1}{1+x^{2}}</math>.<br />
<br />
And let us take <math>\,q(x|y)=\frac{1}{\sqrt{2\pi}b}e^{-\frac{1}{2b^{2}}(y-x)^{2}}</math>.<br />
<br />
Then <math>\,q(x|y)</math> is symmetric.<br />
<br />
Therefore Y can be simplified.<br />
<br />
<br />
We get :<br />
<br />
<math>\,\begin{align}<br />
\displaystyle r(x,y) <br />
& =min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} \\<br />
& =min\left\{\frac{f(y)}{f(x)},1\right\} \\<br />
& =min\left\{ \frac{ \frac{1}{1+y^{2}} }{ \frac{1}{1+x^{2}} },1\right\}\\<br />
& =min\left\{ \frac{1+x^{2}}{1+y^{2}},1\right\}\\<br />
\end{align}<br />
</math>.<br />
<br />
<br />
<br />
The Matlab code of the algorithm is the following :<br />
<br />
<pre><br />
clear all<br />
close all<br />
clc<br />
b=2;<br />
x(1)=randn;<br />
for i=2:10000<br />
y=b*randn+x(i-1);<br />
r=min((1+x(i-1)^2)/(1+y^2),1);<br />
u=rand;<br />
if u<r<br />
x(i)=y;<br />
else<br />
x(i)=x(i-1);<br />
end<br />
<br />
end<br />
hist(x(5000:end));<br />
%The Markov Chain usually takes some time to converge and this is known as the "burning time".<br />
%Therefore, we don't display the first 5000 points because they don't show the limiting behaviour of the Markov <br />
Chain.<br />
</pre><br />
<br />
As we can see, the choice of the value of b is made by us.<br />
<br />
Changing this value has a significant impact on the results we obtain. There is a pitfall when b is too big or too small.<br />
<br />
Example with <math>\,b=0.1</math> (Also with graph after we run j=5000:10000; plot(j,x(5000:10000)):<br />
<br />
[[File:redaccoursb01.JPG|300px]] [[File:001Metr.PNG|300px]]<br />
<br />
With <math>\,b=0.1</math>, the chain takes small steps so the chain doesn't explore enough of the sample space. It doesn't give an accurate report of the function we want to sample.<br />
<br />
<br />
<br />
Example with <math>\,b=10</math> :<br />
<br />
[[File:redaccoursb10.JPG|300px]] [[File:010metro.PNG|300px]]<br />
<br />
With <math>\,b=10</math>, jumps are very unlikely to be accepted as they deviate far from the mean (i.e. <math>\,y</math> is rejected as <math>\ u<r </math> and <math>\,x(i)=x(i-1)</math> most of the time, hence most sample points stay fairly close to the origin.<br />
The third graph that resembles white noise (as in the case of <math>\,b=2</math>) indicates better sampling as more points are covered and accepted. For <math>\,b=0.1</math>, we have lots of jumps but most values are not repeated, hence the stationary distribution is less obvious; whereas in the <math>\,b=10</math> case, many points remains around 0. Approximately 73% were selected as x(i-1).<br />
<br />
<br />
Example with <math>\,b=2</math> :<br />
<br />
[[File:redaccoursb2.JPG|300px]] [[File:100metr.PNG|300px]]<br />
<br />
With <math>\,b=2</math>, we get a more accurate result as we avoid these extremes. Approximately 37% were selected as x(i-1).<br />
<br />
<br />
If the sample from the Markov Chain starts to look like the target distribution quickly, we say the chain is mixing well.<br />
<br />
==''' Theory and Applications of Metropolis-Hastings - October 27th, 2011'''==<br />
<br />
As mentioned in the previous section, the idea of the Metropolis-Hastings (MH) algorithm is to produce a Markov chain that converges to a stationary distribution <math>f</math> which we are interested in sampling from.<br />
<br />
====Convergence====<br />
<br />
One important fact to check is that <math>\displaystyle f</math> is indeed a stationary distribution in the MH scheme. For this, we can appeal to the implications of the detailed balance property:<br />
<br />
Given a probability vector <math>\!\pi</math> and a transition matrix <math>\displaystyle P</math>, <math>\!\pi</math> has the detailed balance property if <math>\!\pi_iP_{ij} = P_{ji}\pi_j</math><br />
<br />
If <math>\!\pi</math> satisfies detailed balance, then it is a stationary distribution.<br />
<br />
The above definition applies to the case where the states are discrete. In the continuous case, <math>\displaystyle f</math> satisfies detailed balance if <math>\displaystyle f(x)p(x,y)=f(y)p(y,x)</math>. Where <math>\displaystyle p(x,y)</math> and <math>\displaystyle p(y,x)</math> are the probabilities of transitioning from x to y and y to x respectively. If we can show that <math>\displaystyle f</math> has the detailed balance property, we can conclude that it is a stationary distribution. Because <math>\int^{}_y f(y)p(y,x)dy=\int^{}_y f(x)p(x,y)dy=f(x)</math>.<br />
<br />
In the MH algorithm, we use a proposal distribution to generate y~<math>\displaystyle q(y|x)</math>, and accept y with probability <math>min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\}</math><br />
<br />
Suppose, without loss of generality, that <math>\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)} <= 1</math>. This implies that <math>\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)} >= 1</math><br />
<br />
Let <math>\,r(x,y)</math> be the chance of accepting point y given that we are at point x.<br />
<br />
So <math>\,r(x,y) = min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} = \frac{f(x)}{f(y)} \frac{q(x|y)}{q(y|x)}</math><br />
<br />
Let <math>\,r(y,x)</math> be the chance of accepting point x given that we are at point y.<br />
<br />
So <math>\,r(y,x) = min\left\{\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)},1\right\} = 1</math><br />
<br />
<br />
<math>\,p(x,y)</math> is the probability of generating and accepting y, while at point x.<br />
<br />
So <math>\,p(x,y) = q(y|x)r(x,y) = q(y|x) \frac{f(y)}{f(x)} \frac{q(x|y)}{q(y|x)} = \frac{f(y)q(x|y)}{f(x)}</math><br />
<br />
<br />
<math>\,p(y,x)</math> is the probability of generating and accepting x, while at point y.<br />
<br />
So <math>\,p(y,x) = q(x|y)r(y,x) = q(x|y)</math><br />
<br />
<br />
<math>\,f(x)p(x,y) = f(x)\frac{f(y)q(x|y)}{f(x)} = f(y)q(x|y) = f(y)p(y,x)</math><br />
<br />
Thus, detailed balance holds.<br />
:i.e. <math>\,f(x)</math> is stationary distribution<br />
<br />
It can be shown (although not here) that <math>f</math> is a limiting distribution as well. Therefore, the MH algorithm generates a sequence whose distribution converges to <math>f</math>, the target.<br />
<br />
====Implementation====<br />
<br />
In the implementation of MH, the proposal distribution is commonly chosen to be symmetric, which simplifies the calculations and makes the algorithm more intuitively understandable. The MH algorithm can usually be regarded as a random walk along the distribution we want to sample from. Suppose we have a distribution <math>f</math>:<br />
<br />
[[File:Standard normal distribution.gif]]<br />
<br />
Suppose we start the walk at point <math>x</math>. The point <math>y_{1}</math> is in a denser region than <math>x</math>, therefore, the walk will always progress from <math>x</math> to <math>y_{1}</math>. On the other hand, <math>y_{2}</math> is in a less dense region, so it is not certain that the walk will progress from <math>x</math> to <math>y_{2}</math>. In terms of the MH algorithm:<br />
<br />
<math>r(x,y_{1})=min(\frac{f(y_{1})}{f(x)},1)=1</math> since <math>f(y_{1})>f(x)</math>. Thus, any generated value with a higher density will be accepted.<br />
<br />
<math>r(x,y_{2})=\frac{f(y_{2})}{f(x)}</math>. The lower the density of <math>y_{2}</math> is, the less chance it will have of being accepted.<br />
<br />
A certain class of proposal distributions can be written in the form:<br />
<br />
<math>\,y|x_i = x_i + \epsilon_i</math><br />
<br />
where <math>\,\epsilon_i = g(|x-y|)</math><br />
<br />
The density depends only on the distance between the current point and the next one (which can be seen as the "step" being taken). These proposal distributions give the Markov chain the random walk nature. The normal distribution that we frequently use in our examples satisfies the above definition.<br />
<br />
In actual implementations of the MH algorithm, the proposal distribution needs to be chosen judiciously, because not all proposals will work well with all target distributions we want to sample from. Take a trimodal distribution for example:<br />
<br />
[[File:trimodal.jpg]]<br />
<br />
If we choose the proposal distribution to be a standard normal as we have done before, problems will arise. The low densities between the peaks means that the MH algorithm will almost never walk to any points generated in these regions and get stuck at one peak. One way to address this issue is to increase the variance, so that the steps will be large enough to cross the gaps. Of course, in this case, it would probably be beneficial to come up with a different proposal function. As a rule of thumb, such functions should result in an approximately 50% acceptance rate for generated points.<br />
<br />
====Simulated Annealing====<br />
<br />
Metropolis-Hastings is very useful in simulation methods for solving optimization problems. One such application is simulated annealing, which addresses the problems of minimizing a function <math>h(x)</math>. This method will not always produce the global solution, but it is intuitively simple and easy to implement.<br />
<br />
Consider <math>e^{\frac{-h(x)}{T}}</math>, maximizing this expression is equivalent to minimizing <math>h(x)</math>. Suppose <math>\mu</math> is the maximizing value and <math>h(x)=(x-\mu)^2</math>, then the maximization function is a gaussian distribution <math>e^{-\frac{(x-\mu)^2}{T}}</math>. When many samples are taken from this distribution, the mean will converge to the desired maximizing value. The annealing comes into play by lowering T (the temperature) as the sampling progresses, making the distribution narrower. The steps of simulated annealing are outlined below:<br />
<br />
1. start with a random <math>x</math> and set T to a large number<br />
<br />
2. generate <math>y</math> from a proposal distribution <math>q(y|x)</math>, which should be symmetric<br />
<br />
3. accept <math>y</math> with probability <math>min(\frac{f(y)}{f(x)},1)</math><br />
<br />
4. decrease T, and then go to step 2<br />
<br />
The following plot and Matlab code illustrates the simulated annealing procedure as temperature ''T'', the variance, decreases for a Gaussian distribution with zero mean. Starting off with a large value for the temperature ''T'' allows the Metropolis-Hastings component of the procedure to capture the mean, before gradually decreasing the temperature ''T'' in order to converge to the mean. <br />
<br />
[[File:Simulated annealing illustration.png]]<br />
<br />
x=-10:0.1:10;<br />
mu=0;<br />
T=5;<br />
colour = ['b', 'g', 'm', 'r', 'k'];<br />
for i=1:5<br />
pdfNormal=normpdf(x, mu, T);<br />
plot(x, pdfNormal, colour(i));<br />
T=T-1;<br />
hold on<br />
end<br />
hleg1=legend('T=5', 'T=4', 'T=3', 'T=2', 'T=1');<br />
title('Simulated Annealing Illustration');<br />
<br />
=='''References'''==<br />
<br />
<references/><br />
<br />
=='''Simulated Annealing and Gibbs Sampling - November 1, 2011'''==<br />
<br />
continued from previous lecture...<br />
<br />
We will now look at a couple cases where <math> \displaystyle h(y) > h(x) </math> or <math> \displaystyle h(y) < h(x) </math>, and explore whether to accept or reject <math> y </math>.<br />
<br />
Recall r(x,y)=min{<math>\frac{f(y)}{f(x)}</math>,1} where <math> \frac{f(y)}{f(x)} = \frac{e^{\frac{-h(x)}{T}}}{e^{\frac{-h(y)}{T}}} = e^{\frac{h(x)-h(y)}{T}}</math>. And r(x,y) represents the probability of accepting <math>y</math>.<br />
<br />
====Cases====<br />
<br />
Case a)<br />
Suppose <math> \displaystyle h(y) < h(x) </math>. Since we want to find the minimum value for <math>\displaystyle h(x) </math>, and the point <math>\displaystyle y </math> creates a lower value than our previous point, we accept the new point. Mathematically, <math>\displaystyle h(y) < h(x) </math> implies that:<br />
<br />
<math> \frac{f(y)}{f(x)} > 1 </math>. Therefore,<br />
<math> \displaystyle r = 1 </math>.<br />
So, we will always accept <math>\displaystyle y </math>.<br />
<br />
Case b)<br />
Suppose <math> \displaystyle h(y) > h(x) </math>. This is bad, since our goal is to minimize <math>\displaystyle h(x) </math>. However, we may still accept <math>\displaystyle y </math> with some chance:<br />
<br />
<math> \frac{f(y)}{f(x)} < 1 </math>. Therefore,<br />
<math>\displaystyle r < 1 </math>.<br />
So, we may accept <math>\displaystyle y </math> with probability <math>\displaystyle r </math>.<br />
<br />
<br />
Next, we will look at these cases as <math>\displaystyle T\to0 </math>.<br />
<br />
As <math>\displaystyle T\to0 </math> and case a) happens, <math> e^{\frac{h(x)-h(y)}{T}} </math> approaches infinity, so we will always accept <math>\displaystyle y </math>.<br />
<br />
As <math>\displaystyle T\to0 </math> and case b) happens, <math> e^{\frac{h(x)-h(y)}{T}} </math> approaches zero, so the probability that <math>\displaystyle y </math> will be accepted gets extremely small.<br />
<br />
It is worth noting that if we simply start with a small value of T, we may end up rejecting all the generated points, and hence we will get stuck somewhere in the function (due to case b)), which might be a maximum value in some intervals (local maxima), but not in the whole domain (global maxima). It is therefore necessary to start with a large value of T in order to explore the whole function. At the same time, a good estimation of x0 is needed (at least cannot differ from the maximum point too much). <br />
<br />
=====Example=====<br />
<br />
Let <math>\displaystyle h(x) = (x-2)^2 </math>.<br />
The graph of it is:<br />
[[File:PCh(x).jpg|center|500]]<br />
<br />
Then, <math> e^{\frac{-h(x)}{T}} = e^{\frac{-(x-2)^2}{T}} </math> . Take an initial value of T = 20. A graph of this is:<br />
[[File:PC-highT.jpg|center|500]]<br />
<br />
<br />
In comparison, we look a graph of T = 0.2:<br />
[[File:PC-lowT.jpg|center|500]]<br />
<br />
One can see that with a low T value, the graph has a lot of r = 0, and r >1, while having a bigger T value gives smoother transitions in the graph.<br />
<br />
The MATLAB code for the above graphs are:<br />
<pre><br />
ezplot('(x-2)^2',[-6,10])<br />
ezplot('exp((-(x-2)^2)/20)',[-6,10])<br />
ezplot('exp((-(x-2)^2)/0.2)',[-6,10])<br />
</pre><br />
<br />
=====Travelling Salesman Problem=====<br />
<br />
The simulated annealing method can be applied to compute the solution to the travelling salesman problem. Suppose there are N cities and the salesman only have to visit each city once. The objective is to find out the shortest path (i.e. shortest total length of journey) connecting the cities. An algorithm using simulated annealing on the problem can be found here ([http://www.cs.ubbcluj.ro/~csatol/mestint/pdfs/Numerical_Recipes_Simulated_Annealing.pdf Reference]).<br />
<br />
===Gibbs Sampling===<br />
<br />
Gibbs sampling is another Markov chain Monte Carlo method, similar to Metropolis-Hastings. There are two main differences between Metropolis-Hastings and Gibbs sampling. First, the candidate state is always accepted as the next state in Gibbs sampling. Second, it is assumed that the full conditional distributions are known, i.e. <math>P(X_i=x|X_j=x_j, \forall j\neq i)</math> for all <math>\displaystyle i</math>. The idea is that it is easier to sample from conditional distributions which are sets of one dimensional distributions than to sample from a joint distribution which is a higher dimensional distribution. Gibbs is a way to turn the joint distribution into multiple conditional distribution. <br />
<br />
<b>Advantages:</b><br /><br />
- sampling from conditional distributions may be easier than sampling from joint distributions<br />
<br />
<b>Disadvantages:</b><br /><br />
- we do not necessarily know the conditional distributions<br />
<br />
For example, if we want to sample from <math>\, f_{X,Y}(x,y)</math>, we need to know how to sample from <math>\, f_{X|Y}(x|y)</math> and <math>\, f_{Y|X}(y|x)</math>. Suppose the chain starts with <math>\,(X_0,Y_0)</math> and <math>(X_1,Y_1), \dots , (X_n,Y_n)</math> have been sampled. Then,<br />
<br />
<math>\, (X_{n+1},Y_{n+1})=(f_{X|Y}(x|Y_n),f_{Y|X}(y|X_{n+1}))</math><br />
<br />
Gibbs sampling turns a multi-dimensional distribution into a set of one-dimensional distributions. If we want to sample from <br />
<br />
<math>P_{X^1,\dots ,X^p}(x^1,\dots ,x^p)</math> <br />
<br />
and the full conditionals are known, then:<br />
<br />
<math>X^1_{n+1}=f(X^1|X^2_n,\dots ,X^p_n)</math><br />
<br />
<math>X^2_{n+1}=f(X^2|X^1_{n+1},X^3_n\dots ,X^p_n)</math><br />
<br />
<math>\vdots</math><br />
<br />
<math>X^{p-1}_{n+1}=f(X^{p-1}|X^1_{n+1},\dots ,X^{p-2}_{n+1},X^p_n)</math><br />
<br />
<math>X^p_{n+1}=f(X^p|X^1_{n+1},\dots ,X^{p-1}_{n+1})</math><br />
<br />
With Gibbs sampling, we can simulate <math>\displaystyle n</math> random variables sequentially from <math>\displaystyle n</math> univariate conditionals rather than generating one <math>n</math>-dimensional vector using the full joint distribution, which could be a lot more complicated.<br />
<br />
Computational inference deals with probabilistic graphical models. Gibbs sampling is useful here: graphical models show the dependence relations among random variables. For instance, Bayesian networks are graphical models represented using directed acyclic graphs. Looking at such a graphical model tells us on which random variable the distribution of a certain random variable depends (i.e. its parent). The model can be used to "factor" a joint distribution into conditional distributions.<br />
<br />
[[File:stat341_nov_1_graphical_model.png|200px|thumb|left|Sample graphical model of five RVs]]<br />
<br />
For example, consider the five random variables A, B, C, D, and E. Without making any assumptions about dependence relations among them, all we know is <br />
<br />
<math>\, P(A,B,C,D,E)=</math><math>\, P(A|B,C,D,E) P(B|C,D,E) P(C|D,E) P(D|E) P(E)</math><br />
<br />
However, if we know the relation between the random variables, e.g. given the graphical model on the left, we can simplify this expression:<br />
<br />
<math>\, P(A,B,C,D,E)=P(A) P(B|A) P(C|A) P(D|C) P(E|C)</math><br />
<br />
Although the joint distribution may be very complicated, the conditional distributions may not be.<br />
<br />
Check out the following notes on Gibbs sampling:<br />
<br />
* [http://web.mit.edu/~wingated/www/introductions/mcmc-gibbs-intro.pdf MCMC and Gibbs Sampling, MIT Lecture Notes]<br />
* chapter 7.4 in [http://stat.fsu.edu/~anuj/pdf/classes/CompStatI09/BOOK.pdf Notes on Computational Methods in Statistics]<br />
* chapter 4.9 in [http://www.ma.hw.ac.uk/~foss/StochMod/Ross_S.pdf Introduction to Probability Models] by Sheldon Ross<br />
<br />
====Example of Gibbs sampling: Multi-variate normal====<br />
<br />
We'd like to generate samples from a bivariate normal with parameters<br />
<br />
<math>\mu = \begin{bmatrix}1\\ 2 \end{bmatrix} = \begin{bmatrix}\mu_1 \\ \mu_2 \end{bmatrix}</math> <br />
and <math>\sigma = \begin{bmatrix}1 && 0.9 \\ 0.9 && 1 \end{bmatrix}= \begin{bmatrix}1 && \rho \\ \rho && 1 \end{bmatrix}</math><br />
<br />
The conditional distributions of multi-variate normal random variables are also normal:<br />
<br />
<math>\, f(x_1|x_2)=N(\mu_1 + \rho(x_2-\mu_2), 1-\rho^2)</math><br />
<br />
<math>\, f(x_2|x_1)=N(\mu_2 + \rho(x_1-\mu_1), 1-\rho^2)</math><br />
<br />
(In general, if the joint distribution has parameters<br />
<br />
<math>\mu = \begin{bmatrix}\mu_1 \\ \mu_2 \end{bmatrix}</math> and <math>\Sigma = \begin{bmatrix} \Sigma _{1,1} && \Sigma _{1,2} \\ \Sigma _{2,1} && \Sigma _{2,2} \end{bmatrix}</math><br />
<br />
then the conditional distribution <math>\, f(x_1|x_2)</math> has mean <math>\, \mu_1 + \Sigma _{1,2}(\Sigma _{1,1})^{-1}(x_2-\mu_2)</math> and variance <math>\, \Sigma _{1,1}-\Sigma _{1,2}(\Sigma _{2,2})^{-1}\Sigma _{2,1})</math>.<br />
<br />
=='''Principal Component Analysis (PCA) - November 8, 2011'''==<br />
<br />
Principal component analysis is an 100 year old algorithm used for the dimensionality reduction of data. As dimensions increase, the data points needed to sample accurately increase by an exponential factor.<br />
<br />
<math>\, x\in \mathbb{R}^D \rarr y\in \mathbb{R}^d</math><br />
<br />
<math>\ d \le D </math><br />
<br />
We want to transform <math>\, x</math> to <math>\, y</math> by reducing dimensionality yet losing little information.<br />
<br />
For example, consider dots in a three dimensional space. By unrolling the 2D manifold that they are on, we can reduce the data to 2D while losing little information. Note: This is not an application of PCA, but simple illustrates one way we can reduce dimensionality.<br />
<br />
Principle Component Analysis lets us reduce data to a linear subspace of its original space. It works best when data is in a lower dimensional subspace of its original space, or is close to.<br />
<br />
<br />
'''Probabilistic View'''<br />
<br />
We can see data set <math>\, x</math> as a high dimensional random variable governed by a low dimensional random variable <math>\, y</math>. Given <math>\, x</math>, we are trying to estimate <math>\, y</math>.<br />
<br />
We can see this in 2D linear regression, as the locations of data points in a scatter plot are governed by its approximate linear regression. The subspace that we have reduced the data to here is in the direction of variation in the data.<br />
<br />
'''Principal Component Analysis'''<br />
<br />
Principal component analysis is an orthogonal linear transform on a data set. It transforms the data coordinates to associate with a new set of orthogonal vectors, each representing the direction of the maximum variance of the the data. E.G. the first principal component is the direction of the max variance, the second principal component is the direction of the max variance orthogonal to the first vector, the third principal component is the direction of the max variance orthogonal to the first and second vectors and etc. until we have D vectors, where D is the dimension of the original data.<br />
<br />
Suppose we have data represented by <math>\, X = \begin{bmatrix}<br />
x^1\\<br />
x^2\\<br />
\vdots \\ <br />
x^D<br />
\end{bmatrix}<br />
\in \mathbb{R}^{D \times n} </math><br />
<br />
For some <math>\, W = \begin{bmatrix}<br />
w^1\\<br />
w^2\\<br />
\vdots \\ <br />
w^D<br />
\end{bmatrix}<br />
\in \mathbb{R}^{D} </math><br />
<br />
We can write any vector in <math>\, \mathbb{R}^D </math> as<br />
<br />
<math>\, w^1x^1 + w^2x^2 + \cdots + w^dx^d = W^TX</math><br />
<br />
To find the first principal component, we want to maximize the variance of <math>\,W^TX</math>.<br />
<br />
The variance of <math>\,W^TX</math> is <math>\,W^TSW</math> where <math>\,S</math> is the covariance matrix of X.<br />
<br />
<math>\, S = (x-\mu)(x-\mu)^T</math><br />
<br />
<br />
So we have to solve the problem<br />
<br />
<math>\, \text {Max } W^TSW</math><br />
<br />
<math>\, \text{such that } W^TW = 1</math>.<br />
<br />
<br />
We restrict W to unit vectors as otherwise the maximum is unbounded. We are only looking for the direction of of the vector, the actual scale of it is unnecessary.<br />
<br />
Using the method of Lagrange multipliers, we have<br />
<br />
<math>\,L(W, \lambda) = W^TSW - \lambda(W^TW - 1) </math><br />
<br />
We set<br />
<br />
<math>\, \frac{\partial L}{\partial W} = 0 </math><br />
<br />
<br />
<br />
Note that <math>\, W^TSW</math> is a quadratic form. So we have<br />
<br />
<br />
<br />
<math>\, \frac{\partial L}{\partial W} = 2SW - 2\lambda W = 0 </math><br />
<br />
<math>\, SW = \lambda W </math><br />
<br />
Since S is a matrix and lambda is a scaler, W is an eigenvector of S and lambda is its corresponding eigenvalue.<br />
<br />
Suppose that<br />
<br />
<math>\, \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_d</math><br />
are eigenvalues of S and <math>\, u_1, u_2, \cdots u_d</math> are their corresponding eigenvectors.<br />
<br />
We want to choose some <math>\, W = u </math><br />
<br />
<math>\,u^TSu =u^T\lambda u = \lambda u^Tu = \lambda</math><br />
<br />
So to maximize <math>\, u^TSu</math>, choose the eigenvector corresponding to the max eiegenvalue, e.g. <math>\, u_1</math>.<br />
<br />
So we let <math>\, W = u_1 </math> be the first principal component.<br />
<br />
The principal component's decompose the total variance in the data.<br />
<br />
<math>\, \sum_{i=1}^D \text{Var}(u_i) = \sum_{i=1}^D \lambda_i = \text{Tr}(S) = \sum_{i=1}^D \text{Var}(x_i)</math><br />
<br />
<br><br />
===Singular Value Decomposition===<br />
Singular value decomposition is a "generalization" of eigenvalue decomposition "to rectangular matrices of size ''mxn''."<ref name="Abdel_SVD">Abdel-Rahman, E. (2011). Singular Value Decomposition [Lecture notes]. Retrieved from http://uwace.uwaterloo.ca</ref> Singular value decomposition solves:<br><br><br />
:<math>\ A_{mxn}\ v_{nx1}=s\ u_{mx1}</math><br><br><br />
"for the right singular vector ''v'', the singular value ''s'', and the left singular vector ''u''. There are ''n'' singular values ''s''<sub>''i''</sub> and ''n'' right and left singular vectors that must satisfy the following conditions"<ref name="Abdel_SVD"/>:<br />
# "All singular values are non-negative"<ref name="Abdel_SVD"/>, <br> <math>\ s_i \ge 0.</math><br />
# All "right singular vectors are pairwise orthonormal"<ref name="Abdel_SVD"/>, <br> <math>\ v_iv_j=\delta_{i,j}.</math><br />
# All "left singular vectors are pairwise orthonormal"<ref name="Abdel_SVD"/>, <br> <math>\ u_iu_j=\delta_{i,j}.</math><br />
where<br />
:<math>\delta_{i,j}=\left\{\begin{matrix}1 & \mathrm{if}\ i=j \\ 0 & \mathrm{if}\ i\neq j\end{matrix}\right.</math><br><br><br />
<br />
'''Procedure to find the singular values and vectors'''<br><br />
Observe the following about the eigenvalue decomposition of a real square matrix ''A'' where ''v'' is the unit eigenvector:<br><br />
::<math><br />
\begin{align}<br />
& Av=\lambda v \\<br />
& (Av)^T=(\lambda v)^T \\<br />
& (Av)^TAv=(\lambda v)^T\lambda v \\<br />
& v^TA^TAv=\lambda^2v^Tv \\<br />
& vv^TA^TAv=v\lambda^2 \\<br />
& A^TAv=\lambda^2v<br />
\end{align}<br />
</math><br />
As a result:<br />
# "The matrices ''A'' and ''A''<sup>''T''</sup>''A'' have the same eigenvectors."<ref name="Abdel_SVD"/><br />
# "The eigenvalues of matrix ''A''<sup>''T''</sup>''A'' are the square of the eigenvalues of matrix ''A''."<ref name="Abdel_SVD"/><br />
# Since matrix ''A''<sup>''T''</sup>''A'' is symmetric,<br />
## "all the eigenvalues of matrix ''A''<sup>''T''</sup>''A'' are real and distinct."<ref name="Abdel_SVD"/><br />
## "the eigenvectors of matrix ''A''<sup>''T''</sup>''A'' are orthogonal and can be chosen to be orthonormal."<ref name="Abdel_SVD"/><br />
# "The eigenvalues of matrix ''A''<sup>''T''</sup>''A'' are non-negative"<ref name="Abdel_SVD"/> since <math>\ \lambda^2_i \ge 0.</math><br />
Conclusions 3 and 4 are "true even for a rectangular matrix ''A'' since ''A''<sup>''T''</sup>''A'' is still a square symmetric matrix"<ref name="Abdel_SVD"/> and its eigenvalues and eigenvectors can be found.<br><br><br />
Therefore, for a rectangular matrix ''A'', assuming ''m>n'', the singular values and vectors can be found by:<br />
# "Form the ''nxn'' symmetric matrix ''A''<sup>''T''</sup>''A''."<ref name="Abdel_SVD"/><br />
# Perform an eigenvalue decomposition to get ''n'' eigenvalues and their "corresponding eigenvectors, ordered such that"<ref name="Abdel_SVD"/> <br><math>\lambda^2_1 \ge \lambda^2_2 \ge \dots \ge \lambda^2_n \ge 0</math> and <math>\{v_1, v_2, \dots, v_n\}.</math><br />
# "The singular values are"<ref name="Abdel_SVD"/>: <br><math>s_1=\sqrt{\lambda^2_1} \ge s_2=\sqrt{\lambda^2_2} \ge \dots \ge s_n=\sqrt{\lambda^2_n} \ge 0.</math><br>"The non-zero singular values are distinct; the equal sign applies only to the singular values that are equal to zero."<ref name="Abdel_SVD"/><br />
# "The ''n''-dimensional right singular vectors are"<ref name="Abdel_SVD"/><br><math>\{v_1, v_2, \dots, v_n\}.</math><br />
# "For the first <math>r \le n</math> singular values such that ''s''<sub>''i''</sub> ''> 0'', the left singular vectors are obtained as unit vectors"<ref name="Abdel_SVD"/> by <math>\tfrac{1}{s_i}Av_i=u_i.</math><br />
# Select "the <math>\ m-r</math> left singular vectors corresponding to the zero singular values such that they are unit vectors orthogonal to each other and to the first ''r'' left singular vectors"<ref name="Abdel_SVD"/> <math>\{u_1, u_2, \dots, u_r\}.</math><br><br><br />
<br />
'''Finding Singular value Decomposition Using MATLAB Code'''<br />
Please refer to the following link: http://www.mathworks.com/help/techdoc/ref/svd-singular-value-decomposition.html<br />
<br />
'''Formal definition'''<br><br />
"We can now decompose the rectangular matrix ''A'' in terms of singular values and vectors as follows"<ref name="Abdel_SVD"/>:<br><br><br />
<math>A_{mxn} \begin{bmatrix} v_1 & | & \cdots & | & v_n \end{bmatrix}_{nxn} = \begin{bmatrix} u_1 & | & \cdots & | & u_n & | I_{n+1} & | & \cdots & | & I_m \end{bmatrix}_{mxm} \begin{bmatrix} s_1 & 0 & \cdots & 0 \\ 0 & s_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & s_n \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix}_{mxn}</math><br><br />
:<math>\ AV=US</math><br><br><br />
Since "the matrices ''V'' and ''U'' are orthogonal"<ref name="Abdel_SVD"/>, ''V ''<sup>''-1''</sup>=''V''<sup>T</sup> and ''U ''<sup>''-1''</sup>=''U''<sup>T</sup>:<br><br><br />
:<math>\ A=USV^T</math><br><br><br />
"which is the formal definition of the singular value decomposition."<ref name="Abdel_SVD"/><br><br><br />
<br />
'''Relevance to PCA'''<br><br />
In order to perform PCA, one needs to do eigenvalue decomposition on the covariance matrix. By transforming the mean for all attributes to zero, the covariance matrix can be simplified to:<br><br><br />
<math>\ S=XX^T</math><br><br><br />
Since the eigenvalue decomposition of ''A''<sup>''T''</sup>''A'' gives the same eigenvectors as the singular value decomposition of ''A'', an additional and more consistent method (if a matrix has eigenvectors that are not invertible, its eigenvalue decomposition does not exist) for performing PCA is through the singular value decomposition of ''X''.<br />
<br />
The following MATLAB code uses singular value decomposition for performing PCA; 20 principal components, and thus the top 20 maximum variation directions, are selected for reconstructing facial images that have had noise applied to them:<br />
<br />
load noisy.mat<br />
%first noisy image; each image has a resolution of 20x28<br />
imagesc(reshape(X(:,1),20,28)')<br />
%to grayscale<br />
colormap gray<br />
%singular value decomposition <br />
[u s v]=svd(X);<br />
%reduced feature space: 20 principal components<br />
Xh=u(:,1:20)*s(1:20,1:20)*v(:,1:20)';<br />
figure<br />
imagesc(reshape(Xh(:,1),20,28)')<br />
colormap gray<br />
<br />
Since the reduced feature space image is noiseless, the added noise feature has less variation than the 20 principal components.<br />
<br />
=='''References'''==<br />
<br />
<references/><br />
<br />
==''' PCA and Introduction to Kernel Function-November,10,2011'''==<br />
===Continue with the last lecture===<br />
Some notations:<br />
Let <math>\displaystyle X_{d\times n}</math> be a matrix. <br />
<br />
Let <math>\displaystyle X_j,j=1,2,...,d</math> be the j th the data point,and <math>\displaystyle X_j\in\R^d</math>.<br />
<br />
Let <math>\displaystyle Q=\sum_{j=1}^n(X_j-\bar{X})(X_j-\bar{X})^T</math>, where <math> \bar{X}=\frac{1}{n}\sum_{j=1}^n X_j)</math>.<br />
<br />
But now, we are assuming that we have already centered the data, which means our <math>\displaystyle Q=\sum_{j=1}^n(X_j)(X_j)^T=X X^T </math>.<br />
<br />
*Find PC,which means finding eigenvectors of Q or do the singular value decomposition,[u s v]=svd(X), where the columns of u are eigenvectors of <math>\displaystyle Q=X X^T</math>.<br />
<br />
*Map the data in lower dimension space.<br />
We can choose the first p (p<d) eigenvectors, which means <math>\displaystyle u^T</math> is a <math>\displaystyle p\times n</math> matrix.<br />
Thus,we can project our original data points <math>\displaystyle x_j</math> to p dimension.<br />
Mathematically, it is <math>\displaystyle Y_{p\times n}={u^T}_{p\times d} X_{d\times n}</math>.Also,this means that we can reduce our original d variables to p principal components.<br />
<br />
*Reconstruct Points.<br />
We can also use those dimension-reduced data to project back to high dimension.<br />
However, we will lose some information because when we map those points into lower dimension, we throw away the last (d-p) eigenvectors which contain some of the original information.<br />
Since <math>\displaystyle u^T</math> is an orthogonal matrix, we can have <math> u_{d\times p} Y_{p\times n}=u_{d\times p}{u^T}_{p\times d}\hat{x}_{d\times n}= \hat{x}_{d\times n} </math>.<br />
<br />
*Map a new data point to a lower dimensional space and reconstruct it to the high dimension <math>\displaystyle y_{d\times 1}={u^T}_{p\times d} x_{d\times 1}=x_{d\times 1}=u_{d\times p} y_{p\times 1}</math><br />
<br />
===3 and 2 digits example===<br />
The data X is a 64 by 400 matrix. Every column can be imaged out as either "3" or "2". The first 200 columns are "2" and the last 200 columns are "3".<br />
We can first modify the data to centered data, and then try to find the first p(p<d) columns of the singular value decomposition of u.<br />
<br />
MATLAB CODE:<br />
MU=repmat(mean(X,2),1,400);<br />
% mean(X,2) is the average of each row <br />
%In order to center the data, we should change mean(X,2) which is a 64 by 1 matrix into a 64 by 400 matirx<br />
Xt=X-MU;<br />
% modify the data to zero mean data<br />
[u s v]=svd(Xt);<br />
%note that size(u)=64*64, and the columns of u are eigenvectors of VCM<br />
Y=u(:,1:2)'*X;<br />
%using the first two PCs to transform the high dimensional points to lower onces<br />
One way to look at this case is that, we can plot Principle Component #1 and Principle Component #2 in a two dimensional space.<br />
plot(Y(1,:)',Y(2,:)')<br />
The result is as follows, we can see clearly there are two classes.<br />
<br />
[[file:pca2.png|350px|400px]]<br />
<br />
To dig more into what kind of difference of these two classes, we can try to seperate the first 200 columns and the last 200 columns to find whether it has a significant difference due to the different types of digits.<br />
plot(Y(1,1:200)',Y(2,1:200)','d')<br />
% Note that the first 200 columns represent digit "2",and are in the form of "diamond"<br />
hold on<br />
% draw different graphs in one figure<br />
plot(Y(1,201:400)',Y(2,201:400)','ro')<br />
% Note that the first 200 columns represent digit "3",and are in the form of "o"<br />
<br />
[[file:pca3.png|350px|400px]]<br />
<br />
image=reshape(X,8,8,400);<br />
plotdigits(image,Y,.1,1);<br />
The result can be seen more clearly from the following picture.<br />
It is clearly to seperate "3" and "2" apart.<br />
<br />
[[file:Pca.png|350px|400px]]<br />
<br />
===Introduction to Kernel Function===<br />
PCA is useful when those data points spread in or close to a plane. This means that PCA is powerful when dealing with linear problems. But when data points spread in a manifold space, PCA is hard to implement. But there is a solution to this problem---we can use a "trick" to change the nonlinear classification problems into linear ones. And this is called the "Kernel Trick".<br />
<br />
'''An intuitive example'''<br />
<br />
[[File:Kernel trick.png|400px|300px]]<br />
<br />
From the picture, we can see the red dots are in the middle of the blue ones.However,it is hard to separate those two classes by using any lines(linear in the two dimensional space). But we can pull the red ones out of the two dimensional space to form a three dimensional space, in which case, we can easily tell them apart.<br />
<br />
For more details about this trick,please see http://omega.albany.edu:8008/machine-learning-dir/notes-dir/ker1/ker1.pdf<br />
<br />
More in detail,the significance of Kernel Function is that we can change the data points into a high dimension implicitly.<br />
Let's look at how this is possible:<br />
<br />
<math>Z_1=<br />
\begin{bmatrix}<br />
x_1\\<br />
y_1<br />
\end{bmatrix}\xrightarrow{\phi}<br />
</math><br />
<math>\phi(Z_1)=<br />
\begin{bmatrix}<br />
x_1^2\\<br />
y_1^2\\<br />
\sqrt2x_1y_1<br />
\end{bmatrix}.<br />
<br />
</math><br />
<math>Z_2=<br />
\begin{bmatrix}<br />
x_2\\<br />
y_2<br />
\end{bmatrix}\xrightarrow{\phi}<br />
</math><br />
<math>\phi(Z_2)=<br />
\begin{bmatrix}<br />
x_2^2\\<br />
y_2^2\\<br />
\sqrt2x_2y_2<br />
\end{bmatrix}<br />
</math><br />
<br />
The inner product of <math>\displaystyle \phi(Z1)</math> and <math>\displaystyle\phi(Z2)</math>, which is denoted as <math>\displaystyle\phi(Z1)\phi(Z2)^T</math>, is equal to:<br />
<math><br />
\begin{bmatrix}<br />
x_1^2&y_1^2&\sqrt2x_1y_1 <br />
\end{bmatrix}<br />
\begin{bmatrix}<br />
x_2^2\\<br />
y_2^2\\<br />
\sqrt2x_2y_2 <br />
\end{bmatrix}=</math> <math>\displaystyle (x_1x_2+y_1y_2)^2=K(Z_1,Z_2)</math>.<br />
<br />
'''The most common Kernel functions are as follows:'''<br />
*Linear: <math>\displaystyle K_{ij}=<X_i,X_j></math><br />
*Polynomial:<math>\displaystyle K_{ij}=(1+<X_i,X_j>)^p</math><br />
*Gausian:<math>\displaystyle K_{ij}=e^\frac{-{\left\Vert X_i-X_j\right\|}^2}{2\sigma^2}</math>,<br />
where <math>\displaystyle <X_i,X_j></math> denotes the inner product of <math>\displaystyle X_i</math> and <math>\displaystyle X_j</math>, <math>{\left\Vert X_i-X_j\right\|}^2</math> denotes the distance between vector<math>\displaystyle X_i</math> and vector <math>\displaystyle X_j</math>.<br />
<br />
<br />
==''' Kernel PCA -November,15,2011'''==<br />
<br />
First we look at the algorithm for PCA and see how we can kernelize PCA:<br />
<br />
Find eigenvectors of $XX^t$, call it U<br />
<math><br />
Y=U^TX \\<br />
\hat{X}=UY \\<br />
Y=U^TX \\<br />
\hat{X}=UY<br />
</math><br />
<br />
To solve PCA:<br />
<math>[U \Sigma V] = svd(X) \\<br />
Z=U\Sigma{V^T}<br />
</math><br />
U is eigenvectors of <math>XX^T</math><br />
V is eigenvectors of <math>X^T{X}</math><br />
<br />
Now we want to kernelize this classical version of PCA.<br />
<br />
We would like to express everything based on V which is eigenvectors of X^T{X} which can be kernelized. This is called Dual PCA.<br />
<br />
<math><br />
X=U\SigmaV^T \\<br />
XV=U\SigmaV^T{V} = U\Sigma \\<br />
U=XV\Sigma^{-1}<br />
</math><br />
<br />
Find eigenvectors of $X^TX$, call it V.<br />
<math><br />
X=U\SigmaV^T$ \\<br />
U^T{X} = U^T{U}\SigmaV^$ \\<br />
U^T{X} = \Sigma{V^T}$ \\<br />
Y=\Sigma{V^T}$ \\<br />
</math><br />
<br />
Reconstruct Points<br />
<br />
<math><br />
\hat{X}=UY \\<br />
X=XV\Sigma^{-1}\Sigma{V^T} \\<br />
\hat{X} = XVV^T<br />
</math><br />
<br />
Map an out of sample point x to low-dimensional space<br />
<math>Y=U^TX = (XV\Sigma^1)^TX = \Sigma^{-1}{V^T}{X^T}X</math><br />
Reconstruct an out of sample point <br />
<math>\hat{X}=UY=XV\Sigma^{-1}\Sigma{-1}V^T{X^T}X = XV\Sigma^{-2}V^T{X^T}X</math></div>S9huhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat341f11&diff=14853stat341f112011-11-15T18:09:43Z<p>S9hu: /* Continue with the last lecture */</p>
<hr />
<div>Please contribute to the discussion of splitting up this page into multiple pages on the [[{{TALKPAGENAME}}|talk page]].<br />
<br />
==[[signupformStat341F11| Editor Sign Up]]==<br />
<br />
==Notation==<br />
<br />
The following guidelines on notation were posted on the Wiki Course Note page for [[Stat946f11|STAT 946]]. Add to them as necessary for consistent notation on this page.<br />
<br />
Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:<br />
<br />
* <math>\{X_1,\ X_2,\ \dots,\ X_n\}</math> random variables<br />
* <math>\{x_1,\ x_2,\ \dots,\ x_n\}</math> observations of the random variables<br />
<br />
The joint ''probability mass function'' can be written as:<br />
<center><math> P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )</math></center><br />
or as shorthand, we can write this as <math>p( x_1, x_2, \dots, x_n )</math>. In these notes both types of notation will be used.<br />
We can also define a set of random variables <math>X_Q</math> where <math>Q</math> represents a set of subscripts.<br />
<br />
<br />
==Sampling - September 20, 2011==<br />
<br />
The meaning of sampling is to generate data points or numbers such that these data follow a certain distribution.<br /><br />
i.e. From <math>x \sim~f(x)</math> sample <math>\,x_{1}, x_{2}, ..., x_{1000}</math><br />
<br />
In practice, it maybe difficult to find the joint distribution of random variables. Through simulating the random variables, we can make an inference from the data.<br />
<br />
===Sampling from Uniform Distribution===<br />
Computers cannot generate random numbers as they are deterministic; however they can produce pseudo random numbers using algorithms. Generated numbers mimic the properties of random numbers but they are never truly random. One famous algorithm that is considered highly reliable is the Mersenne twister[http://en.wikipedia.org/wiki/Mersenne_twister], which generates random numbers in an almost uniform distribution. <br />
<br />
<br />
====Multiplicative Congruential====<br />
*involves four parameters: integers <math>\,a, b, m</math>, and an initial value <math>\,x_0</math> which we call the seed<br />
*a sequence of integers is defined as<br />
:<math>x_{k+1} \equiv (ax_{k} + b) \mod{m}</math><br />
<br />
'''Example:''' <math>\,a=13, b=0, m=31, x_0=1</math> creates a uniform histogram.<br />
<br />
MATLAB code for generating 1000 random numbers using the multiplicative congruential method:<br />
<br />
<pre><br />
a = 13;<br />
b = 0;<br />
m = 31;<br />
x(1) = 1;<br />
<br />
for ii = 2:1000<br />
x(ii) = mod(a*x(ii-1)+b, m);<br />
end<br />
</pre><br />
<br />
MATLAB code for displaying the values of x generated:<br />
<br />
<pre><br />
x<br />
</pre><br />
<br />
MATLAB code for plotting the histogram of x:<br />
<br />
<pre><br />
hist(x)<br />
</pre><br />
<br />
Histogram Output:<br />
<br />
[[File:uniform.jpg]]<br />
<br />
Facts about this algorithm:<br />
*In this example, the first 30 terms in the sequence are a permutation of integers from 1 to 30 and then the sequence repeats itself.<br />
*Values are between <b>0</b> and <b>m-1</b>, inclusive.<br />
*Dividing the numbers by <b> m-1 </b> yields numbers in the interval <b>[0,1]</b>.<br />
*MATLAB's <code>rand</code> function once used this algorithm with <b>a= 7<sup>5</sup></b>, <b>b= 0</b>, <b>m= 2<sup>31</sup>-1</b>,for reasons described in Park and Miller's 1988 paper "Random Number Generators: Good Ones are Hard to Find" (available [http://www.firstpr.com.au/dsp/rand31/p1192-park.pdf online]).<br />
*Visual Basic's <code>RND</code> function also used this algorithm with <b>a= 1140671485</b>, <b>b= 12820163</b>, <b>m= 2<sup>24</sup></b>. ([http://support.microsoft.com/kb/231847 Reference])<br />
<br />
===Inverse Transform Method===<br />
This is a basic method for sampling. Theoretically using this method we can generate sample numbers at random from any probability distribution once we know its cumulative distribution function (cdf).<br />
<br />
====Theorem====<br />
Take <math>U \sim~ \mathrm{Unif}[0, 1]</math> and let <math>X = F^{-1}(U) </math>. Then <math>X</math> has distribution function <math>F(\cdot)</math>, where <math>F(x)=P(X \leq x)</math> and <math>F^{-1}(\cdot)</math> is the inverse of <math>F(\cdot)</math>.<br />
<br />
Therefore <math>F(x)=u\implies x=F^{-1}(u)</math><br />
<br />
'''Proof'''<br />
<br />
Recall that<br />
<br />
:<math>P(a \leq X<b)=\int_a^{b} f(x) dx</math><br />
<br />
:<math>cdf=F(x)=P(X \leq x)=\int_{-\infty}^{x} f(x) dx</math><br />
<br />
Note that if <math>U \sim~ \mathrm{Unif}[0, 1]</math>, we have <math>P(U \leq a)=a</math><br />
<br />
:<math>\begin{align}<br />
<br />
P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\<br />
&{}= P(U \leq F(x)) \\<br />
&{}= F(x)<br />
\end{align}</math><br />
<br />
Completing the proof.<br />
<br />
====Continuous Case====<br />
Generally it takes two steps to get random numbers using this method.<br />
<br />
*Step 1. Draw <math>U \sim~ \mathrm{Unif}[0, 1]</math><br />
*Step 2. <b><i>X=F <sup>&minus;1</sup>(U)</i></b><br />
<br />
'''Example'''<br />
<br />
Take the exponential distribution for example<br />
:<math>\,f(x)={\lambda}e^{-{\lambda}x}</math><br />
:<math>\,F(x)=\int_0^x {\lambda}e^{-{\lambda}u} du=[-e^{-{\lambda}u}]_0^x=1-e^{-{\lambda}x}</math><br />
<br />
Let: <math>\,F(x)=y</math><br />
:<math>\,y=1-e^{-{\lambda}x}</math><br />
:<math>\,ln(1-y)={-{\lambda}x}</math><br />
:<math>\,x=\frac{ln(1-y)}{-\lambda}</math><br />
:<math>\,F^{-1}(x)=\frac{-ln(1-x)}{\lambda}</math><br />
<br />
Therefore, to get a exponential distribution from a uniform distribution takes 2 steps.<br />
*Step 1. Draw <math>U \sim~ \mathrm{Unif}[0, 1]</math><br />
*Step 2. <math>x=\frac{-ln(1-U)}{\lambda}</math><br />
<br />
Note: If U~Unif[0, 1], then (1 - U) and U have the same distribution. This allows us to slightly simplify step 2 into an alternate form:<br />
*Alternate Step 2. <math>x=\frac{-ln(U)}{\lambda}</math><br />
<br />
'''MATLAB code'''<br />
for exponential distribution case,assuming <math>\lambda=0.5</math><br />
<br />
<pre><br />
for ii = 1:1000<br />
u = rand;<br />
x(ii) = -log(1-u)/0.5;<br />
end<br />
hist(x)<br />
</pre><br />
<br />
MATLAB result<br />
<br />
[[File:MATLAB_Exp.jpg|center|300px]]<br />
<br />
====Discrete Case - September 22, 2011====<br />
This same technique can be applied to the discrete case. Generate a discrete random variable <math>\,x</math> that has probability mass function <math>\,P(X=x_i)=P_i </math> where <math>\,x_0<x_1<x_2...</math> and <math>\,\sum_i P_i=1</math><br />
*Step 1. Draw <math>u \sim~ \mathrm{Unif}[0, 1]</math><br />
*Step 2. <math>\,x=x_i</math> if <math>\,F(x_{i-1})<u \leq F(x_i)</math><br />
<br />
'''Example'''<br />
<br />
Let x be a discrete random variable with the following probability mass function:<br />
<br />
:<math>\begin{align}<br />
P(X=0) = 0.3 \\<br />
P(X=1) = 0.2 \\<br />
P(X=2) = 0.5<br />
\end{align}</math><br />
<br />
Given the pmf, we now need to find the cdf.<br />
<br />
We have:<br />
<br />
:<math><br />
F(x) = \begin{cases}<br />
0 & x < 0 \\<br />
0.3 & 0 \leq x < 1 \\<br />
0.5 & 1 \leq x < 2 \\<br />
1 & 2 \leq x<br />
\end{cases}</math><br />
<br />
We can apply the inverse transform method to obtain our random numbers from this distribution.<br />
<br />
'''Pseudo Code for generating the random numbers:'''<br />
<pre><br />
Draw U ~ Unif[0,1] <br />
if U <= 0.3 <br />
return 0 <br />
else if 0.3 < U <= 0.5 <br />
return 1<br />
else if 0.5 < U <= 1 <br />
return 2<br />
</pre><br />
<br />
'''MATLAB code for generating 1000 random numbers in the discrete case:'''<br />
<br />
<pre><br />
for ii = 1:1000<br />
u = rand;<br />
<br />
if u <= 0.3<br />
x(ii) = 0;<br />
else if u <= 0.5<br />
x(ii) = 1;<br />
else<br />
x(ii) = 2;<br />
end<br />
end<br />
</pre><br />
<br />
Matlab Output:<br />
<br />
[[File:Discreteinv.jpg]]<br />
<br />
'''Pseudo code for the Discrete Case:'''<br />
<br />
1. Draw U ~ Unif [0,1]<br />
<br />
2. If <math> U \leq P_0 </math>, deliver <b><i>X= x<sub>0</sub></i></b><br />
<br />
3. Else if <math> U \leq P_0 + P_1 </math>, deliver <b><i>X= x<sub>1</sub></i></b><br />
<br />
4. Else If <math> U \leq P_0 +....+ P_k </math>, deliver <b><i>X= x<sub>k</sub></i></b><br />
<br />
====Limitations====<br />
<br />
Although this method is useful, it isn't practical in many cases since we can't always obtain <math>F</math> or <math> F^{-1} </math> as some functions are not integrable or invertible, and sometimes even <math>f(x)</math> itself cannot be obtained in closed form. Let's look at some examples:<br />
*Continuous case<br />
If we want to use this method to draw the ''pdf'' of '''normal distribution''', we may find ourselves get stuck in finding its ''cdf''. <br />
The simplest case of '''normal distribution''' is <math>f(x)=\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}</math>,<br />
whose ''cdf'' is <math>F(x)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{x}{e^{-\frac{u^2}{2}}}du</math>. This integral cannot be expressed in terms of elementary functions. So evaluating it and then finding the inverse is a very difficult task.<br />
*Discrete case <br />
It is easy for us to simulate when there are only a few values taken by the particular random variable, like the case above.<br />
And it is easy to simulate the '''binomial distribution''' <math>X \sim~ \mathrm{B}(n,p)</math> when the parameter n is not too large.<br />
But when n takes on values that are very large, say 50, it is hard to do so.<br />
<br />
===Acceptance/Rejection Method===<br />
<br />
<br />
The aforementioned difficulties of the inverse transform method motivates a sampling method that does not require analytically calculating cdf's and their inverses, which is the acceptance/rejection sampling method. Here, <math> \displaystyle f(x)</math> is approximated by another function, say <math>\displaystyle g(x)</math>, with the idea being that <math>\displaystyle g(x)</math> is a "nicer" function to work with than <math>\displaystyle f(x)</math>.<br />
<br />
Suppose we assume the following:<br />
<br />
1. There exists another distribution <math>\displaystyle g(x)</math> that is easier to work with and that you know how to sample from, and<br />
<br />
2. There exists a constant c such that <math>f(x) \leq c \cdot g(x)</math> for all x<br />
<br />
Under these assumptions, we can sample from <math>\displaystyle f(x)</math> by sampling from <math>\displaystyle g(x)</math><br />
<br />
====General Idea====<br />
<br />
Looking at the image below we have graphed <math> c \cdot g(x) </math> and <math>\displaystyle f(x)</math>.<br />
<br />
[[File:Graph_updated.jpg]]<br />
<br />
Using the acceptance/rejection method we will accept some of the points from <math>\displaystyle g(x)</math> and reject some of the points from <math>\displaystyle g(x)</math>. The points that will be accepted from <math>\displaystyle g(x)</math> will have a distribution similar to <math>\displaystyle f(x)</math>. We can see from the image that the values around <math>\displaystyle x_1</math> will be sampled more often under <math>c \cdot g(x)</math> than under <math>\displaystyle f(x)</math>, so we will have to reject more samples taken at x<sub>1</sub>. Around <math>\displaystyle x_2</math> the number of samples that are drawn and the number of samples we need are much closer, so we accept more samples that we get at <math>\displaystyle x_2</math><br />
<br />
====Procedure====<br />
<br />
1. Draw y ~ g<br />
<br />
2. Draw U ~ Unif [0,1]<br />
<br />
3. If <math> U \leq \frac{f(y)}{c \cdot g(y)}</math> then x=y; else return to 1<br />
<br />
Note that the choice of <math> c </math> plays an important role in the efficiency of the algorithm. We want <math> c \cdot g(x) </math> to be "tightly fit" over <math> f(x) </math> to increase the probability of accepting points, and therefore reducing the number of sampling attempts. Mathematically, we want to minimize <math> c </math> such that <math>f(x) \leq c \cdot g(x) \ \forall x</math>. We do this by setting<br />
<br />
<math> \frac{d}{dx}(\frac{f(x)}{g(x)}) = 0 </math>, solving for a maximum point <math> x_0 </math> and setting <math> c = \frac{f(x_0)}{g(x_0)}. </math><br />
<br />
====Proof====<br />
<br />
Mathematically, we need to show that the sample points given that they are accepted have a distribution of f(x).<br />
<br />
<math>\begin{align} P(y|accepted) &= \frac{P(y, accepted)}{P(accepted)} \\<br />
<br />
&= \frac{P(accepted|y) P(y)}{P(accepted)}\end{align} </math> (Bayes' Rule)<br />
<br />
<br />
<br />
<math>\displaystyle P(y) = g(y)</math><br />
<br />
<math>P(accepted|y) =P(u\leq \frac{f(y)}{c \cdot g(y)}) =\frac{f(y)}{c \cdot g(y)} </math>,where u ~ Unif [0,1]<br />
<br />
<math>P(accepted) = \sum P(accepted|y)\cdot P(y)=\int^{}_y \frac{f(y)}{c \cdot g(y)}g(y) dy=\int^{}_y \frac{f(y)}{c} dy=\frac{1}{c} \cdot\int^{}_y f(y) dy=\frac{1}{c}</math><br />
<br />
So,<br />
<br />
<math> P(y|accepted) = \frac{ \frac {f(y)}{c \cdot g(y)} \cdot g(y)}{\frac{1}{c}} =f(y) </math><br />
<br />
====Continuous Case====<br />
<br />
'''Example'''<br />
<br />
Sample from Beta(2,1)<br />
<br />
In general:<br />
<br />
Beta(<math>\alpha, \beta) = \frac{\Gamma (\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}</math> <math>\displaystyle x^{\alpha-1}</math> <math>\displaystyle(1-x)^{\beta-1}</math>, <math>\displaystyle 0<x<1</math><br />
<br />
Note: <math>\!\Gamma(n) = (n-1)!</math> if n is a positive integer<br />
<br />
<math>\begin{align} f(x) &= Beta(2,1) \\<br />
&= \frac{\Gamma(3)}{\Gamma(2)\Gamma(1)} x^1(1-x)^0 \\<br />
&= \frac{2!}{1! 0!}\cdot (1) x \\<br />
&= 2x \end{align}</math><br />
<br />
We want to choose <math>\displaystyle g(x)</math> that is easy to sample from. So we choose <math>\displaystyle g(x)</math> to be uniform distribution.<br />
<br />
We now want a constant c such that <math>f(x) \leq c \cdot g(x) </math> for all x from Unif(0,1)<br />
<br />
<br />
So,<br /><br />
<br />
<math>c \geq \frac{f(x)}{g(x)}</math>, for all x from (0,1)<br />
<br />
<br />
<math>\begin{align}c &\geq max (\frac {f(x)}{g(x)}, 0<x<1) \\<br />
<br />
<br />
&= max (\frac {2x}{1},0<x<1) \\<br />
<br />
<br />
&= 2 \end{align}</math><br />
<br />
<br />
<br />
Now that we have c =2,<br />
<br />
1. Draw y ~ g(x) => Draw y ~ Unif [0,1] <br />
<br />
2. Draw u ~ Unif [0,1] <br />
<br />
3. if <math>u \leq \frac{2y}{2 \cdot 1}</math> then x=y; else return to 1<br />
<br />
<br />
'''MATLAB code for generating 1000 samples following Beta(2,1):'''<br />
<br />
<pre><br />
close all<br />
clear all<br />
ii=1;<br />
while ii < 1000<br />
y = rand;<br />
u = rand;<br />
<br />
if u <= y<br />
x(ii)=y;<br />
ii=ii+1;<br />
end<br />
end<br />
hist(x)<br />
</pre><br />
<br />
'''MATLAB result'''<br />
<br />
[[File:MATLAB_Beta.jpg]]<br />
<br />
====Discrete Example====<br />
<br />
Generate random variables according to the p.m.f:<br />
<br />
:<math>\begin{align}<br />
P(Y=1) = 0.15 \\<br />
P(Y=2) = 0.22 \\<br />
P(Y=3) = 0.33 \\<br />
P(Y=4) = 0.10 \\<br />
P(Y=5) = 0.20 <br />
\end{align}</math><br />
<br />
find a g(y) discrete uniform distribution from 1 to 5<br />
<br />
<math>c \geq \frac{P(y)}{g(y)} </math><br><br />
<math>c = \max \left(\frac{P(y)}{g(y)} \right)</math><br><br />
<math>c = \max \left(\frac{0.33}{0.2} \right) = 1.65</math> Since P(Y=3) is the max of P(Y) and g(y) = 0.2 for all y.<br><br />
<br />
1. Generate Y according to the discrete uniform between 1 - 5<br />
<br />
2. U ~ unif[0,1]<br />
<br />
3. If <math>U \leq \frac{P(y)}{1.65 \times 0.2} \leq \frac{P(y)}{0.33} </math>, then x = y; else return to 1.<br />
<br />
In MATLAB, the code would be:<br />
<br />
py = [0.15 0.22 0.33 0.1 0.2];<br />
ii =1;<br />
while ii <= 1000<br />
y = unidrnd(5);<br />
u = rand;<br />
if u <= py(y)/0.33<br />
x(ii) = y;<br />
ii = ii+1;<br />
end<br />
end<br />
hist(x);<br />
<br />
MATLAB result<br />
<br />
[[File:MATLAB_Y.jpg]]<br />
<br />
====Limitations====<br />
<br />
Most of the time we have to sample many more points from g(x) before we can obtain an acceptable amount of samples from f(x), hence this method may not be computationally efficient. It depends on our choice of g(x). For example, in the example above to sample from Beta(2,1), we need roughly 2000 samples from g(X) to get 1000 acceptable samples of f(x).<br />
<br />
In addition, in situations where a g(x) function is chosen and used, there can be a discrepancy between the functional behaviors of f(x) and g(x) that render this method unreliable. For example, given the normal distribution function as g(x) and a function of f(x) with a "fat" mid-section and "thin tails", this method becomes useless as more points near the two ends of f(x) will be rejected, resulting in a tedious and overwhelming number of sampling points having to be sampled due to the high rejection rate of such a method.<br />
<br />
===Sampling From Gamma and Normal Distribution - September 27, 2011===<br />
<br />
====Sampling From Gamma====<br />
<br />
'''Gamma Distribution'''<br />
<br />
The Gamma function is written as <math>X \sim~ Gamma (t, \lambda) </math><br />
<br />
:<math> F(x) = \int_{0}^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy </math><br />
<br />
If you have t samples of the exponential distribution,<br><br />
<br> <math> \begin{align} X_1 \sim~ Exp(\lambda)\\ \vdots \\ X_t \sim~ Exp(\lambda) \end{align}<br />
</math><br />
<br />
The sum of these t samples has a gamma distribution,<br />
<br />
:<math> X_1+X_2+ ... + X_t \sim~ Gamma (t, \lambda) </math><br><br />
:<math> \sum_{i=1}^{t} X_i \sim~ Gamma (t, \lambda) </math> where <math>X_i \sim~Exp(\lambda)</math><br><br />
<br />
'''Method'''<br />
<br />
We can sample the exponential distribution using the inverse transform method from previous class,<br><br />
:<math>\,f(x)={\lambda}e^{-{\lambda}x}</math><br><br />
:<math>\,F^{-1}(u)=\frac{-ln(1-u)}{\lambda}</math><br><br />
:<math>\,F^{-1}(u)=\frac{-ln(u)}{\lambda}</math> <br />
1 - u is the same as x since <math>U \sim~ unif [0,1] </math><br><br />
:<math> \begin{align} \frac{-ln(u_1)}{\lambda} - \frac{ln(u_2)}{\lambda} - ... - \frac{ln(u_t)}{\lambda} = x_1\\ \vdots \\ \frac{-ln(u_1)}{\lambda} - \frac{ln(u_2)}{\lambda} - ... - \frac{ln(u_t)}{\lambda} = x_t \end{align}<br />
:</math><br><br />
:<math> \frac {-\sum_{i=1}^{t} ln(u_i)}{\lambda} = x</math><br />
<br />
'''MATLAB code''' for a Gamma(3,1) is<br />
<br />
<pre><br />
x = sum(-log(rand(1000,3)),2); <br />
hist(x)<br />
</pre><br />
<br />
And the Histogram of X follows a Gamma distribution with long tail: <br />
<br />
[[File:Hist.PNG|center|500px]]<br />
<br />
We can improve the quality of histogram by adding the number of bins we want, like hist(x, number_of_bins)<br />
<br />
<pre><br />
x = sum(-log(rand(20000,3)),2); <br />
hist(x,40)<br />
</pre><br />
<br />
[[File:untitled.jpg|center|500px]]<br />
<br />
''' R code''' for a Gamma(3,1) is<br />
<pre><br />
a<-apply(-log(matrix(runif(3000),nrow=1000)),1,sum);<br />
hist(a);<br />
</pre><br />
And the histogram is <br />
<br />
[[File:hist_gamma.png|center|500px]]<br />
<br />
Here is another histogram of Gamma coding with R<br />
<pre><br />
a<-apply(-log(matrix(runif(3000),nrow=1000)),1,sum);<br />
hist(a,freq=F);<br />
lines(density(a),col="blue");<br />
rug(jitter(a));<br />
</pre><br />
[[File:hist_gamma_2.png|center|500px]]<br />
<br />
====Sampling from Normal Distribution using Box-Muller Transform - September 29, 2011====<br />
<br />
=====Procedure=====<br />
<br />
# Generate <math>\displaystyle u_1</math> and <math>\displaystyle u_2</math>, two values sampled from a uniform distribution between 0 and 1.<br />
# Set <math>\displaystyle R^2 = -2log(u_1)</math> so that <math>\displaystyle R^2</math> is exponential with mean 1/2 <br> Set <math>\!\theta = 2*\pi*u_2</math> so that <math>\!\theta</math> ~ Unif[0, 2<math>\displaystyle\pi</math>]<br />
# Set <math>\displaystyle X = R cos(\theta)</math> <br> Set <math>\displaystyle Y = R sin(\theta)</math><br />
<br />
=====Justification=====<br />
<br />
Suppose we have X ~ N(0, 1) and Y ~ N(0, 1) where X and Y are independent normal random variables. The relative probability density function of these two random variables using Cartesian coordinates is:<br />
<br />
<math> f(X, Y) dxdy= f(X) f(Y) dxdy= \frac{1}{\sqrt{2\pi}}e^{-x^2/2} \frac{1}{\sqrt{2\pi}}e^{-y^2/2} dxdy= \frac{1}{2\pi}e^{-(x^2+y^2)/2}dxdy </math> <br><br />
<br />
In polar coordinates <math>\displaystyle R^2 = x^2 + y^2</math>, so the relative probability density function of these two random variables using polar coordinates is:<br />
<br />
<math> f(R, \theta) = \frac{1}{2\pi}e^{-R^2/2} </math> <br><br />
<br />
If we have <math>\displaystyle R^2 \sim exp(1/2)</math> and <math>\!\theta \sim unif[0, 2\pi]</math> we get an equivalent relative probability density function. Notice that after the two on two transformation, a determinant of jocobian should be added according to the change of variable and rule of differential multiplication where<br />
<br />
<math> |J|=\left|\frac{\partial(x,y)}{\partial(R,\theta)}\right|= \left|\begin{matrix}\frac{\partial x}{\partial R}&\frac{\partial x}{\partial \theta}\\\frac{\partial y}{\partial R}&\frac{\partial y}{\partial \theta}\end{matrix}\right|=R </math> <br><br />
<br />
<math> f(X, Y) dxdy = f(R, \theta)|J|dRd\theta = \frac{1}{2\pi}e^{-R^2/2}R dRd\theta= \frac{1}{4\pi}e^{-\frac{s}{2}} dSd\theta </math> <br>where <math> S=R^2. </math> <br><br />
<br />
Therefore we can generate a point in polar coordinates using the uniform and exponential distributions, then convert the point to Cartesian coordinates and the resulting X and Y values will be equivalent to samples generated from N(0, 1).<br />
<br />
'''MATLAB code'''<br />
<br />
In MatLab this algorithm can be implemented with the following code, which generates 20,000 samples from N(0, 1):<br />
<br />
<pre><br />
x = zeros(10000, 1);<br />
y = zeros(10000, 1);<br />
for ii = 1:10000<br />
u1 = rand;<br />
u2 = rand;<br />
R2 = -2 * log(u1);<br />
theta = 2 * pi * u2;<br />
x(ii) = sqrt(R2) * cos(theta);<br />
y(ii) = sqrt(R2) * sin(theta);<br />
end<br />
hist(x)<br />
</pre><br />
<br />
In one execution of this script, the following histogram for x was generated:<br />
<br />
[[File:Hist standard normal.jpg|center|500px]]<br />
<br />
=====Non-Standard Normal Distributions=====<br />
<br />
'''Example 1: Single-variate Normal'''<br />
<br />
If X ~ Norm(0, 1) then (a + bX) has a normal distribution with a mean of <math>\displaystyle a</math> and a standard deviation of <math>\displaystyle b</math> (which is equivalent to a variance of <math>\displaystyle b^2</math>). Using this information with the Box-Muller transform, we can generate values sampled from some random variable <math>\displaystyle Y\sim N(a,b^2) </math> for arbitrary values of <math>\displaystyle a,b</math>.<br />
<br />
# Generate a sample u from Norm(0, 1) using the Box-Muller transform.<br />
# Set v = a + bu.<br />
<br />
The values for v generated in this way will be equivalent to sample from a <math>\displaystyle N(a, b^2)</math>distribution. We can modify the MatLab code used in the last section to demonstrate this. We just need to add one line before we generate the histogram:<br />
<br />
<pre><br />
x = a + b * x;<br />
</pre><br />
<br />
For instance, this is the histogram generated when b = 15, a = 125:<br />
<br />
[[File:Hist normal.jpg|center|500]]<br />
<br />
'''Example 2: Multi-variate Normal'''<br />
<br />
The Box-Muller method can be extended to higher dimensions to generate multivariate normals. The objects generated will be nx1 vectors, and their variance will be described by nxn covariance matrices.<br />
<br />
<math>\mathbf{z} = N(\mathbf{u}, \Sigma)</math> defines the n by 1 vector <math>\mathbf{z}</math> such that:<br />
<br />
* <math>\displaystyle u_i</math> is the average of <math>\displaystyle z_i</math><br />
* <math>\!\Sigma_{ii}</math> is the variance of <math>\displaystyle z_i</math><br />
* <math>\!\Sigma_{ij}</math> is the co-variance of <math>\displaystyle z_i</math> and <math>\displaystyle z_j</math><br />
<br />
If <math>\displaystyle z_1, z_2, ..., z_d</math> are normal variables with mean 0 and variance 1, then the vector <math>\displaystyle (z_1, z_2,..., z_d) </math> has mean 0 and variance <math>\!I</math>, where 0 is the zero vector and <math>\!I</math> is the identity matrix. This fact suggests that the method for generating a multivariate normal is to generate each component individually as single normal variables.<br />
<br />
The mean and the covariance matrix of a multivariate normal distribution can be adjusted in ways analogous to the single variable case. If <math>\mathbf{z} \sim N(0,I)</math>, then <math>\Sigma^{1/2}\mathbf{z}+\mu \sim N(\mu,\Sigma)</math>. Note here that the covariance matrix is symmetric and nonnegative, so its square root should always exist.<br />
<br />
We can compute <math>\mathbf{z}</math> in the following way:<br />
<br />
# Generate an n by 1 vector <math>\mathbf{x} = \begin{bmatrix}x_{1} & x_{2} & ... & x_{n}\end{bmatrix}</math> where <math>x_{i}</math> ~ Norm(0, 1) using the Box-Muller transform.<br />
# Calculate <math>\!\Sigma^{1/2}</math> using singular value decomposition.<br />
# Set <math>\mathbf{z} = \Sigma^{1/2} \mathbf{x} + \mathbf{u}</math>.<br />
<br />
The following MatLab code provides an example, where a scatter plot of 10000 random points is generated. In this case x and y have a co-variance of 0.9 - a very strong positive correlation.<br />
<br />
<pre><br />
x = zeros(10000, 1);<br />
y = zeros(10000, 1);<br />
for ii = 1:10000<br />
u1 = rand;<br />
u2 = rand;<br />
R2 = -2 * log(u1);<br />
theta = 2 * pi * u2;<br />
x(ii) = sqrt(R2) * cos(theta);<br />
y(ii) = sqrt(R2) * sin(theta);<br />
end<br />
<br />
E = [1, 0.9; 0.9, 1];<br />
[u s v] = svd(E);<br />
root_E = u * (s ^ (1 / 2));<br />
<br />
z = (root_E * [x y]);<br />
z(:,1) = z(:,1) + 5;<br />
z(:,2) = z(:,2) + -8;<br />
<br />
scatter(z(:,1), z(:,2))<br />
</pre><br />
<br />
This code generated the following scatter plot:<br />
<br />
[[File:scatter covar.jpg|center|500px]]<br />
<br />
In Matlab, we can also use the function "sqrtm()" or "chol()" (Cholesky Decomposition) to calculate square root of a matrix directly. Note that the resulting root matrices may be different but this does materially affect the simulation.<br />
Here is an example:<br />
<br />
<pre><br />
E = [1, 0.9; 0.9, 1];<br />
r1 = sqrtm(E);<br />
r2 = chol(E);<br />
</pre><br />
<br />
R code for a multivariate normal distribution:<br />
<br />
<pre><br />
n=10000;<br />
r2<--2*log(runif(n));<br />
theta<-2*pi*(runif(n));<br />
x<-sqrt(r2)*cos(theta);<br />
<br />
y<-sqrt(r2)*sin(theta);<br />
a<-matrix(c(x,y),nrow=n,byrow=F);<br />
e<-matrix(c(1,.9,09,1),nrow=2,byrow=T);<br />
svde<-svd(e);<br />
root_e<-svde$u %*% diag(svde$d)^1/2;<br />
z<-t(root_e %*%t(a));<br />
z[,1]=z[,1]+5;<br />
z[,2]=z[,2]+ -8;<br />
par(pch=19);<br />
plot(z,col=rgb(1,0,0,alpha=0.06))<br />
</pre><br />
<br />
[[File:m_normal.png|center|500px]]<br />
<br />
=====Remarks=====<br />
MATLAB's randn function uses the ziggurat method to generate normal distributed samples. It is an efficient rejection method based on covering the probability density function with a set of horizontal rectangles so as to obtain points within each rectangle. It is reported that a 800 MHz Pentium III laptop can generate over 10 million random numbers from normal distribution in less than one second. ([http://www.mathworks.com/company/newsletters/news_notes/clevescorner/spring01_cleve.html Reference])<br />
<br />
===Sampling From Binomial Distributions===<br />
<br />
In order to generate a sample x from <math>\displaystyle X \sim Bin(n, p)</math>, we can follow the following procedure:<br />
<br />
1. Generate n uniform random numbers sampled from <math>\displaystyle Unif [0, 1] </math>: <math>\displaystyle u_1, u_2, ..., u_n</math>.<br />
<br />
2. Set x to be the total number of cases where <math>\displaystyle u_i <= p</math> for all <math>\displaystyle 1 <= i <= n</math>.<br />
<br />
In MatLab this can be coded with a single line. The following generates a sample from <math>\displaystyle X \sim Bin(n, p)</math> <br />
<br />
>> sum(rand(n, 1) <= p, 1)<br />
<br />
==Bayesian Inference and Frequentist Inference - October 4, 2011==<br />
<br />
===Bayesian inference vs Frequentist inference===<br />
The Bayesian method has become popular in the last few decades as simulation and computer technology makes it more applicable. For more information about its history and application, please refer to http://en.wikipedia.org/wiki/Bayesian_inference.<br />
As for frequentists, please refer to http://en.wikipedia.org/wiki/Frequentist_inference.<br />
<br />
====Example====<br />
Consider: A person drinks a cup of coffee on a specific day.<br />
<br><br><br />
Frequentist: There is no explanation to this situation. It is essentially meaningless since it has only occurred once. Therefore, it is not a probability.<br />
<br><br />
Bayesian: Probability is not just about the frequent occurrences but it is what you believe about this probability.<br />
<br />
<br />
====Example of face identification====<br />
Take the face as input x. And the person as output y. The person can be either Ali or Tom. If it is Ali, y=1. Otherwise, y=0. We can divide the picture into 100*100 pixels and then list them into a 10,000*1 column vector which is x.<br />
<br />
If you are a frequentist, you would compare Pr(X=x|y=1) with Pr(X=x|y=0) and see which one is higher. But if you are a Bayesianist, you would compare Pr(y=1|X=x) with Pr(y=0|X=x).<br />
<br />
====Summary of differences between two schools====<br />
*Frequentist: Probability refers to limiting relative frequency. (objective)<br />
*Bayesian: Probability describes degree of belief not frequency. (subjective)<br />
e.g. The probability that you drank a cup of tea on May 20, 2001 is 0.62 does not refer to any frequency.<br />
----<br />
*Frequentist: Parameters are fixed, unknown constants.<br />
*Bayesian: Parameters are random variables and we can make probabilistic statement about them.<br />
----<br />
*Frequentist: Statistical procedures should have long run frequency probabilities.<br />
e.g. a 95% confidence interval should trap true value of the parameter for at least 95% of limited frequency<br />
*Bayesian: It makes inferences about <math>\theta</math> by producing a prbability distribution for <math>\theta</math>. Inference (e.g. point estimation) will be extracted from this distribution.<br />
<br />
====Bayesian inference====<br />
<br />
Bayesian inference is usually carried out in the following way:<br />
<br />
1. Choose a prior probability density function of <math>\!\theta</math> which is <math>f(\!\theta)</math>. This is our belief about <math>\theta</math> before we see any data.<br />
<br />
2. Choose a statistical model <math>\displaystyle f(x|\theta)</math> that reflects our beliefs about X.<br />
<br />
3. After observing data <math>\displaystyle x_1,...,x_n</math>, we update our beliefs and calculate the posterior probability.<br />
<br />
<math>f(\theta|x) = \frac{f(\theta,x)}{f(x)}=\frac{f(x|\theta) \cdot f(\theta)}{f(x)}=\frac{f(x|\theta) \cdot f(\theta)}{\int^{}_\theta f(x|\theta) \cdot f(\theta) d\theta}</math>, where <math>\displaystyle f(\theta|x)</math> is the posterior probability, <math>\displaystyle f(\theta)</math> is the prior probability, <math>\displaystyle f(x|\theta)</math> is the likelihood of observing X=x given <math>\!\theta</math> and f(x) is the marginal probability of X=x.<br />
<br />
If we have i.i.d. observations <math>\displaystyle x_1,...,x_n</math>, we can replace <math>\displaystyle f(x|\theta)</math> with <math>f({x_1,...,x_n}|\theta)=\prod_{i=1}^n f(x_i|\theta)</math> because of independency.<br />
<br />
We denote <math>\displaystyle f({x_1,...,x_n}|\theta)</math> as <math>\displaystyle L_n(\theta)</math> which is called likelihood. And we use <math>\displaystyle x^n</math> to denote <math>\displaystyle (x_1,...,x_n)</math>.<br />
<br />
<math>f(\theta|x^n) = \frac{f(x^n|\theta) \cdot f(\theta)}{f(x^n)}=\frac{f(x^n|\theta) \cdot f(\theta)}{\int^{}_\theta f(x^n|\theta) \cdot f(\theta) d\theta}</math> , where <math>\int^{}_\theta f(x^n|\theta) \cdot f(\theta) d\theta</math> is a constant <math>\displaystyle c_n</math>. So <math>f(\theta|x^n) \propto f(x^n|\theta) \cdot f(\theta)</math>. The posterior probability is proportional to the likelihood times prior probability.<br />
<br />
<math>E(\theta)=\int^{}_\theta \theta \cdot f(\theta|x^n) d\theta</math> which is the posterior mean of <math>\!\theta</math>.<br />
<br />
Let <math>\tilde{\theta}=(\theta_1,...,\theta_d)^T</math>, then <math>f(\theta_1|x^n) = \int^{} \int^{} \dots \int^{}f(\theta|X)d\theta_2d\theta_3 \dots d\theta_d </math> and <math>E(\theta_1)=\int^{}\theta_1 \cdot f(\theta_1|x^n) d\theta_1</math><br />
<br />
====Example 1: Estimating parameters of a univariate Gaussian distribution====<br />
<br />
Suppose X follows a univariate Gaussian distribution (i.e. a Normal distribution) with parameters <math>\!\mu</math> and <br />
<math>\displaystyle {\sigma^2}</math>.<br />
<br />
(a) For Frequentists:<br />
<br />
<math>f(x|\theta)= \frac{1}{\sqrt{2\pi}\sigma} \cdot e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}</math><br />
<br />
<math>L_n(\theta)= \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma} \cdot e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2}</math><br />
<br />
<br />
<math>\ln L_n(\theta) = l(\theta) = \sum_{i=1}^n -\frac{1}{2}\ln 2\pi-\ln \sigma-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2</math><br />
<br />
To get the maximum likelihood estimator of <math>\!\mu</math> (mle), we find the <math>\hat{\mu}</math> which maximizes <math>\displaystyle L_n(\theta)</math>:<br />
<br />
<math>\frac{\partial l(\theta)}{\partial \mu}= \sum_{i=1}^n \frac{1}{\sigma}(\frac{x_i-\mu}{\sigma})=0 \Rightarrow \sum_{i=1}^n x_i = n\mu \Rightarrow \hat{\mu}_{mle}=\bar{x}</math><br />
<br />
(b) For Bayesians:<br />
<br />
<math>f(\theta|x) \propto f(x|\theta) \cdot f(\theta)</math><br />
<br />
We assume that the mean of the above normal distribution is itself distributed normally with mean <math>\!\mu_0</math> and variance <math>\!\Gamma</math>.<br />
<br />
Suppose <math>\!\mu\sim N(\mu_0, \!\Gamma^2</math>),<br />
<br />
so <math>f(\mu) = \frac{1}{\sqrt{2\pi}\Gamma} \cdot e^{-\frac{1}{2}(\frac{\mu-\mu_0}{\Gamma})^2}</math><br />
<br />
<math>f(\mu|x) = \frac{1}{\sqrt{2\pi}\tilde{\sigma}} \cdot e^{-\frac{1}{2}(\frac{\mu-\tilde{\mu}}{\tilde{\sigma}})^2}</math><br />
<br />
<math>\tilde{\mu} = \frac{\frac{n}{\sigma^2}}{\frac{n}{\sigma^2}+\frac{1}{\Gamma^2}}\bar{x}+\frac{\frac{1}{\Gamma^2}}{\frac{n}{\sigma^2}+\frac{1}{\Gamma^2}}\mu_0</math>, where <math>\tilde{\mu}</math> is the estimator of <math>\!\mu</math>.<br />
<br />
* If prior belief about <math>\!\mu_0</math> is strong, then <math>\!\Gamma</math> is small and <math>\frac{1}{\Gamma^2}</math> is large. <math>\tilde{\mu}</math> is close to <math>\!\mu_0</math> and the observations will not affect too much. On the contrary, if prior belief about <math>\!\mu_0</math> is weak, <math>\!\Gamma</math> is large and <math>\frac{1}{\Gamma^2}</math> is small. <math>\tilde{\mu}</math> depends more on observations.(This is intuitive, when our original belief is reliable, then the sample is not important in improving the result; when the belief is not reliable, then we depend a lot on the sample.)<br />
<br />
* When the sample is large (i.e. n <math>\to \infty</math>), <math>\tilde{\mu} \to \bar{x}</math> and the impact of prior belief about <math>\!\mu</math> is weakened.<br />
<br />
=='''Basic Monte Carlo Integration - October 6th, 2011'''==<br />
<br />
Three integration methods would be taught in this course:<br />
*Basic Monte Carlo Integration<br />
*Importance Sampling<br />
*Markov Chain Monte Carlo (MCMC)<br />
<br />
The first, and most basic, method of numerical integration we will see is Monte Carlo Integration. We use this to solve an integral of the form: <math> I = \int_{a}^{b} h(x) dx </math><br />
<br />
Note the following derivation: <br />
<br />
<math>\begin{align}<br />
\displaystyle I & = \int_{a}^{b} h(x)dx \\<br />
& = \int_{a}^{b} h(x)((b-a)/(b-a))dx \\<br />
& = \int_{a}^{b} (h(x)(b-a))(1/(b-a))dx \\<br />
& = \int_{a}^{b} w(x)f(x)dx \\<br />
& = E[w(x)] \\<br />
\end{align}<br />
</math><br />
<br />
~<math>(1/n) \sum_{i=1}^{n} w(x) </math><br />
<br />
Where w(x) = h(x)(b-a) and f(x) is the probability density function of a uniform random variable on the interval [a,b]. The expectation, with respect to the distribution of f, of w is taken from n samples of x.<br />
<br />
<br />
===='''General Procedure'''====<br />
<br />
i) Draw n samples <math> x_i \sim~ U[a,b] </math><br />
<br />
ii) Compute <math> \ w(x_i) </math> for every sample<br />
<br />
iii) Obtain an estimate of the integral, <math> \hat{I} </math>, as follows:<br />
<br />
<math> \hat{I} = 1/n \sum_{i=1}^{n} w(x</math><sub>i</sub><math> )</math> . Clearly, this is just the average of the simulation results.<br />
<br />
By the strong law of large numbers <math> \hat{I} </math> converges to <math> \ I </math> as <math> \ n \rightarrow \infty </math>. Because of this, we can compute all sorts of useful information, such as variance, standard error, and confidence intervals.<br />
<br />
Standard Error: <math> SE = Standard Deviation / \sqrt{n} </math><br />
<br />
Variance: <math> V = (\sum_{i=1}^{n} (w(x)-I)^2)/(n-1) </math><br />
<br />
Confidence Interval: <math> I \pm t_{(\alpha/2)} SE </math><br />
<br />
==='''Example: Uniform Distribution'''===<br />
<br />
Consider the integral, <math> \int_{0}^{1} x^3dx </math>, which is easily solved through standard analytical integration methods, and is equal to .25. Now, let us check this answer with a numerical approximation using Monte Carlo Integration. <br />
<br />
We generate a 1 by 10000 vector of uniform (on the interval [0,1]) random variables and call that vector 'u'. We see that our 'w' in this case is <math> x^3 </math>, so we set <math> w = u^3 </math>. Our I<sup>^</sup> is equal to the mean of w.<br />
<br />
In Matlab, we can solve this integration problem with the following code:<br />
<br />
<pre><br />
u = rand(1,10000);<br />
w = u.^3;<br />
mean(w)<br />
ans = 0.2475<br />
</pre><br />
<br />
Note the '.' after 'u' in the second line of code, indicating that each entry in the matrix is cubed. Also, our approximation is close to the actual value of .25. Now let's try to get an even better approximation by generating more sample points. <br />
<br />
<pre><br />
u= rand(1,100000);<br />
w= u.^3;<br />
mean(w)<br />
ans = .2503<br />
</pre><br />
<br />
We see that when the number of sample points is increased, our approximation improves, as one would expect.<br />
<br />
==='''Generalization'''===<br />
<br />
Up to this point we have seen how to numerically approximate an integral when the distribution of f is uniform. Now we will see how to generalize this to other distributions.<br />
<br />
<math> I = \int h(x)f(x)dx </math> <br />
<br />
If f is a distribution function (pdf), then <math> I </math> can be estimated as E<sub>f</sub>[h(x)]. This means taking the expectation of h with respect to the distribution of f. Our previous example is the case where f is the uniform distribution between [a,b].<br />
<br />
'''Procedure for the General Case'''<br />
<br />
i) Draw n samples from f <br />
<br />
ii) Compute h(x<sub>i</sub>)<br />
<br />
iii) <math>\hat{I} = 1/n \sum_{i=1}^{n} h(x</math><sub>i</sub><math>)</math><br />
<br />
==='''Example: Exponential Distribution'''===<br />
<br />
Find <math> E[\sqrt{x}] </math> for <math> \displaystyle f = e^{-x} </math>, which is the exponential distribution with mean 1.<br />
<br />
<math> I = \int_{0}^{\infty} \sqrt{x} e^{-x}dx </math><br />
<br />
We can see that we must draw samples from f, the exponential distribution.<br />
<br />
To find a numerical solution using Monte Carlo Integration we see that: <br />
<br />
u= rand(1,10000)<br />
X= -log(u)<br />
h= <math> \sqrt{x} </math> <br />
I= mean(h)<br />
<br />
To implement this procedure in Matlab, use the following code:<br />
<br />
<pre><br />
u = rand(1,10000);<br />
X = -log(u);<br />
h = x.^.5;<br />
mean(h)<br />
ans = .8841<br />
</pre><br />
<br />
An easy way to check whether your approximation is correct is to use the built in Matlab function 'quadl' which takes a function and bounds for the integral and returns a solution for the definite integral of that function. For this specific example, we can enter:<br />
<br />
<pre><br />
f = @(x) sqrt(x).*exp(-x);<br />
% quadl runs into computational problems when the upper bound is "inf" or an extremely large number, <br />
% so choose just a moderately large number.<br />
quadl(f,0,100)<br />
ans =<br />
0.8862<br />
</pre><br />
<br />
From the above result, we see that our approximation was quite close.<br />
<br />
==='''Example: Normal Distribution'''===<br />
<br />
Let <math> f(x) = (1/(2 \pi)^{1/2}) e^{(-x^2)/2} </math>. Compute the cumulative distribution function at some point x.<br />
<br />
<math> F(x)= \int_{-\infty}^{x} f(s)ds = \int_{-\infty}^{x}(1)(1/(2 \pi)^{1/2}) e^{(-s^2)/2}ds </math>. The (1) is inserted to illustrate that our h(x) will be the constant function 1, and our f(x) is the normal distribution. To take into account the upper bound of integration, x, any values sampled that are greater than x will be set to zero. <br />
<br />
This is the Matlab code for solving F(2):<br />
<br />
<pre><br />
<br />
u = randn(1,10000)<br />
h = u < 2;<br />
mean(h)<br />
ans = .9756<br />
<br />
</pre><br />
<br />
We generate a 1 by 10000 vector of standard normal random variables and we return a value of 1 if u is less than 2, and 0 otherwise.<br />
<br />
We can also build the function F(x) in matlab in the following way:<br />
<br />
<pre><br />
function F(x)<br />
u=rand(1,1000000);<br />
h=u<x;<br />
mean(h)<br />
</pre><br />
<br />
<br />
==='''Example: Binomial Distribution'''===<br />
<br />
In this example we will see the Bayesian Inference for 2 Binomial Distributions.<br />
<br />
Let <math> X ~ Bin(n,p) </math> and <math> Y ~ Bin(m,q) </math>, and let <math> \!\delta = p-q </math>.<br />
<br />
Therefore, <math> \displaystyle \!\delta = x/n - y/m </math> which is the frequentist approach.<br />
<br />
Bayesian wants <math> \displaystyle f(p,q|x,y) = f(x,y|p,q)f(p,q)/f(x,y) </math>, where <math> f(x,y)=\iint\limits_{\!\theta} f(x,y|p,q)f(p,q)\,dp\,dq</math> is a constant.<br />
<br />
Thus, <math> \displaystyle f(p,q|x,y)\propto f(x,y|p,q)f(p,q) </math>. Now we assume that <math>\displaystyle f(p,q) = f(p)f(q) = 1 </math> and f(p) and f(q) are uniform.<br />
<br />
Therefore, <math> \displaystyle f(p,q|x,y)\propto p^x(1-p)^{n-x}q^y(1-q)^{m-y} </math>.<br />
<br />
<math> E[\delta] = \int_{0}^{1} \int_{0}^{1} (p-q)f(p,q|x,y)dxdy </math>.<br />
<br />
As you can see this is much tougher than the frequentist approach.<br />
<br />
=='''Importance Sampling and Basic Monte Carlo Integration - October 11th, 2011'''==<br />
<br />
==='''Example: Binomial Distribution (Continued)'''===<br />
<br />
Suppose we are given two independent Binomial Distributions <math>\displaystyle X \sim Bin(n, p_1)</math>, <math>\displaystyle Y \sim Bin(m, p_2)</math>. We would like to give an Monte Carlo estimate of <math>\displaystyle \delta = p_1 - p_2</math><br><br />
<br />
Frequentist approach: <br><br><math>\displaystyle \hat{p_1} = \frac{X}{n}</math> ; <math>\displaystyle \hat{p_2} = \frac{Y}{m}</math><br><br><math>\displaystyle \hat{\delta} = \hat{p_1} - \hat{p_2} = \frac{X}{n} - \frac{Y}{m}</math><br><br><br />
Bayesian approach to compute the expected value of <math>\displaystyle \delta</math>:<br><br><br />
<math>\displaystyle E(\delta) = \int\int(p_1-p_2) f(p_1,p_2|X,Y)\,dp_1dp_2</math><br><br><br />
<br />
Assume that <math>\displaystyle n = 100, m = 100, p_1 = 0.5, p_2 = 0.8</math> and the sample size is 1000.<br><br />
MATLAB code of the above example:<br />
<pre><br />
n = 100;<br />
m = 100;<br />
p_1 = 0.5;<br />
p_2 = 0.8;<br />
p1 = mean(rand(n,1000)<p_1);<br />
p2 = mean(rand(m,1000)<p_2);<br />
delta = p2 - p1;<br />
hist(delta)<br />
mean(delta)<br />
</pre><br />
<br />
In one execution of the code, the mean of delta was 0.3017. The histogram of delta generated was:<br />
[[File:Hist delta.jpg|center|]]<br />
<br />
Through Monte Carlo simulation, we can obtain an empirical distribution of delta and carry out inference on the data obtained, such as computing the mean, maximum, variance, standard deviation and the standard error of delta.<br />
<br />
==='''Importance Sampling'''===<br />
<br />
====Motivation====<br />
<br />
Consider the integral <math>\displaystyle I = \int h(x)f(x)\,dx</math><br><br><br />
According to basic Monte Carlo Integration, if we can sample from the probability density function <math>\displaystyle f(x)</math> and feed the samples of <math>\displaystyle f(x)</math> back to <math>\displaystyle h(x)</math>, <math>\displaystyle I</math> can be estimated as an average of <math>\displaystyle h(x)</math> ( i.e. <math>\hat{I} = \frac{1}{n} \sum_{i=1}^{n} h(x_i)</math> )<br><br />
However, the Monte Carlo method works when we know how to sample from <math>\displaystyle f(x)</math>. In the case where it is difficult to sample from <math>\displaystyle f(x)</math>, importance sampling is a technique that we can apply. Importance Sampling relies on another function <math>\displaystyle g(x)</math> which we know how to sample from.<br />
<br />
The above integral can be rewritten as follow:<br><br />
<math>\begin{align}<br />
\displaystyle I & = \int h(x)f(x)\,dx \\<br />
& = \int h(x)f(x)\frac{g(x)}{g(x)}\,dx \\<br />
& = \int \frac{h(x)f(x)}{g(x)}g(x)\,dx \\<br />
& = \int y(x)g(x)\,dx \\<br />
& = E_g(y(x)) \\<br />
\end{align}<br />
</math><br><br />
<math>where \ y(x) = \frac{h(x)f(x)}{g(x)}</math><br><br />
<br />
The integral can thus be simulated as <math>\displaystyle \hat{I} = \frac{1}{n} \sum_{i=1}^{n} Y_i \ , \ where \ Y_i = \frac{h(x_i)f(x_i)}{g(x_i)}</math><br><br />
<br />
====Procedure====<br />
<br />
Suppose we know how to sample from <math>\displaystyle g(x)</math><br><br />
#Choose a suitable <math>\displaystyle g(x)</math> and draw n samples <math>x_1,x_2....,x_n \sim~ g(x)</math><br />
#Set <math>Y_i =\frac{h(x_i)f(x_i)}{g(x_i)}</math><br />
#Compute <math> \hat{I} = \frac{1}{n}\sum_{i=1}^{n} Y_i </math><br><br />
<br />
By the Law of large numbers, <math>\displaystyle \hat{I} \rightarrow I </math> provided that the sample size n is large enough.<br><br><br />
<br />
'''Remarks:''' One can think of <math>\frac{f(x)}{g(x)}</math> as a weight to <math>\displaystyle h(x)</math> in the computation of <math>\hat{I}</math><br><br><br />
<math>\displaystyle i.e. \ \hat{I} = \frac{1}{n}\sum_{i=1}^{n} Y_i = \frac{1}{n}\sum_{i=1}^{n} (\frac{f(x_i)}{g(x_i)})h(x_i)</math><br><br><br />
Therefore, <math>\displaystyle \hat{I} </math> is a weighted average of <math>\displaystyle h(x_i)</math><br><br><br />
<br />
====Problem====<br />
<br />
If <math>\displaystyle g(x)</math> is not chosen appropriately, then the variance of the estimate <math>\hat{I}</math> may be very large. Here we actually face a similar problem with Rejection-Acceptance Approach. Consider the second moment of <math>\displaystyle I</math>:<br><br><br />
<math>\begin{align}<br />
\displaystyle I & = E_g((y(x))^2) \\<br />
& = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx \\<br />
& = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx \\<br />
& = \int \frac{h^2(x)f^2(x)}{g(x)} dx \\<br />
\end{align}<br />
</math><br><br><br />
<br />
When <math>\displaystyle g(x)</math> is very small, then the above integral could be very large, hence the variance can be very large when g is not chosen appropriately. This occurs when <math>\displaystyle g(x)</math> has a thinner tail than <math>\displaystyle f(x)</math> such that the quantity <math>\displaystyle \frac{h^2(x)f^2(x)}{g(x)}</math> is large.<br />
<br />
'''Remarks:''' <br />
<br />
1. We can actually compute the form of <math>\displaystyle g(x)</math> to have optimal variance. <br>Mathematically, it is to find <math>\displaystyle g(x)</math> subject to <math>\displaystyle \min_g [\ E_g([y(x)]^2) - (E_g[y(x)])^2\ ]</math><br><br />
It can be shown that the optimal <math>\displaystyle g(x)</math> is <math>\displaystyle \frac{|h(x)|f(x)}{\int_{-\infty}^{\infty}|h(s)|f(s)ds}</math>. Using the optimal <math>\displaystyle g(x)</math> will minimize the variance of estimation in Importance Sampling. This is of theoretical interest but not useful in practice. As we can see, if we can actually show the expression of g(x), we must first have the value of the integration---which is what we want in the first place.<br />
<br />
2. In practice, we shall choose <math>\displaystyle g(x)</math> which has similar shape as <math>\displaystyle f(x)</math> but with a thicker tail than <math>\displaystyle f(x)</math> in order to avoid the problem mentioned above.<br><br />
<br />
====Example====<br />
<br />
Estimate <math>\displaystyle I = Pr(Z>3),\ where\ Z \sim N(0,1)</math><br><br><br />
'''Method 1: Basic Monte Carlo'''<br />
<br />
<math>\begin{align} Pr(Z>3) & = \int^\infty_3 f(x)\,dx \\<br />
& = \int^\infty_{-\infty} h(x)f(x)\,dx \end{align}</math><br /><br />
<math> where \ <br />
h(x) = \begin{cases}<br />
0, & \text{if } x \le 3 \\<br />
1, & \text{if } x > 3<br />
\end{cases}</math><br />
<math>\ ,\ f(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}</math><br />
<br />
MATLAB code to compute <math>\displaystyle I</math> from 100 samples of standard normal distribution:<br />
<pre><br />
h = randn(100,1) > 3;<br />
I = mean(h)<br />
</pre><br />
<br />
In one execution of the code, it returns a value of 0 for <math>\displaystyle I</math>, which differs significantly from the true value of <math>\displaystyle I \approx 0.0013 </math>. The problem of using Basic Monte Carlo in this example is that <math>\displaystyle Pr(Z>3)</math> has a small value, and hence many points sampled from the standard normal distribution will be wasted. Therefore, although Basic Monte Carlo is a feasible method to compute <math>\displaystyle I</math>, it gives a poor estimation.<br />
<br />
'''Method 2: Importance Sampling'''<br />
<br />
<math>\displaystyle I = Pr(Z>3)= \int^\infty_3 f(x)\,dx </math><br><br />
<br />
To apply importance sampling, we have to choose a <math>\displaystyle g(x)</math> which we will sample from. In this example, we can choose <math>\displaystyle g(x)</math> to be the probability density function of exponential distribution, normal distribution with mean 0 and variance greater than 1 or normal distribution with mean greater than 0 and variance 1 etc.. For the following, we take <math>\displaystyle g(x)</math> to be the pdf of <math>\displaystyle N(4,1)</math>.<br><br />
<br />
Procedure:<br />
#Draw n samples <math>x_1,x_2....,x_n \sim~ g(x)</math><br />
#Calculate <math>\begin{align} \frac{f(x)}{g(x)} & = \frac{ \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}<br />
}{ \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x-4)^2} } \\<br />
& = e^{8-4x} \end{align} </math><br><br />
#Set <math> Y_i = h(x_i)e^{8-4x_i}\ with\ h(x) = \begin{cases}<br />
0, & \text{if } x \le 3 \\<br />
1, & \text{if } x > 3<br />
\end{cases}<br />
</math><br><br />
#Compute <math> \hat{Y} = \frac{1}{n}\sum_{i=1}^{n} Y_i </math><br><br />
<br />
The above procedure from 100 samples of <math>\displaystyle g(x)</math>can be implemented in MATLAB as follow:<br />
<pre><br />
for ii = 1:100<br />
x = randn + 4 ;<br />
h = x > 3 ;<br />
y(ii) = h * exp(8-4*x) ;<br />
end<br />
mean(y)<br />
</pre><br />
<br />
In one execution of the code, it returns a value of 0.001271 for <math> \hat{Y} </math>, which is much closer to the true value of <math>\displaystyle I \approx 0.0013 </math>. From many executions of the code, the variance of basic monte carlo is approximately 150 times that of importance sampling. This demonstrates that this method can provide a better estimate than the Basic Monte Carlo method.<br />
<br />
==''' Importance Sampling with Normalized Weight and Markov Chain Monte Carlo - October 13th, 2011'''==<br />
==='''Importance Sampling with Normalized Weight'''===<br />
<br />
Recall that we can think of <math>\displaystyle b(x) = \frac{f(x)}{g(x)}</math> as a weight applied to the samples <math>\displaystyle h(x)</math>. If the form of <math>\displaystyle f(x)</math> is known only up to a constant, we can use an alternate, normalized form of the weight, <math>\displaystyle b^*(x)</math>. (This situation arises in Bayesian inference.) Importance sampling with normalized or standard weight is also called indirect importance sampling.<br />
<br />
We derive the normalized weight as follows:<br><br />
<math>\begin{align}<br />
\displaystyle I & = \int h(x)f(x)\,dx \\<br />
&= \int h(x)\frac{f(x)}{g(x)}g(x)\,dx \\<br />
&= \frac{\int h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int f(x) dx} \\<br />
&= \frac{\int h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int \frac{f(x)}{g(x)}g(x) dx} \\<br />
&= \frac{\int h(x)b(x)g(x)\,dx}{\int\ b(x)g(x) dx} <br />
\end{align}</math><br />
<br />
<math>\hat{I}= \frac{\sum_{i=1}^{n} h(x_i)b(x_i)}{\sum_{i=1}^{n} b(x_i)} </math><br />
<br />
Then, the normalized weight is <math>b^*(x) = \displaystyle \frac{b(x_i)}{\sum_{i=1}^{n} b(x_i)}</math><br />
<br />
Note that <math> \int f(x) dx = 1 = \int b(x)g(x) dx = 1 </math><br />
<br />
We can also determine the associated Monte Carlo variance of this estimate by<br />
<br />
<math> Var(\hat{I})= \frac{\sum_{i=1}^{n} b(x_i)(h(x_i) - \hat{I})^2}{\sum_{i=1}^{n} b(x_i)} </math><br />
<br />
==='''Markov Chain Monte Carlo'''===<br />
We still want to solve <math> I = \displaystyle\int^\ h(x)f(x)\,dx </math><br />
<br />
====Stochastic Process====<br />
A stochastic process <math> \{ x_t : t \in T \}</math> is a collection of random variables. Variables <math>\displaystyle x_t</math> take values in some set <math>\displaystyle X</math> called the '''space set.''' The set <math>\displaystyle T</math> is called the '''index set.'''<br />
<br />
====Markov Chain====<br />
A Markov Chain is a stochastic process for which the distribution of <math>\displaystyle x_t</math> depends only on <math>\displaystyle x_{t-1}</math>. It is a random process characterized as being memoryless; meaning that the next occurrence of a defined event is only dependent on the current event and not on the sequence of events that preceded it. <br />
Formal Definition: The process <math> \{ x_t : t \in T \}</math> is a Markov Chain if <math>\displaystyle Pr(x_t|x_0, x_1,..., x_{t-1})= Pr(x_t|x_{t-1})</math> for all <math> \{t \in T \}</math> and for all <math> \{x \in X \}</math><br />
For a Markov Chain, <math>\displaystyle f(x_1,...x_n)= f(x_1)f(x_2|x_1)f(x_3|x_2)...f(x_n|x_{n-1})</math><br />
<br><br>Real Life Example:<br />
<br>When going for an interview, the employer only looks at your highest education achieved. The employer would not look at the past educations received (elementary school, high school etc.) because the employer believes that the highest education achieved summarizes your previous educations. Therefore, anything before your most recent previous education is irrelevant. In other word, we can say that<math> x_t </math>is regarded as the summary of <math>x_{t-1},...,x_2,x_1</math>, so when we need to determine <math>x_{t+1}</math>, we only need to pay attention in <math>x_{t}</math>.<br />
<br />
====Transition Probabilities====<br />
A Transition Probability is the probability of jumping from one state to another state.<br />
Formal Definition: We call <math>\displaystyle P_{ij} = Pr(x_{t+1}=j|x_t=i)</math> the transition probability.<br />
That is, P(i,j) is the probability of going to state j from state i. The matrix P whose (i,j) element is <math>\displaystyle P_{ij}</math> is called the Transition Matrix.<br />
<br />
Properties of P: <br />
:1) <math>\displaystyle P_{ij} >= 0</math> The probability of going to another state cannot be negative<br />
:2) <math>\displaystyle \sum_{\forall i}P_{ij} = 1</math> The probability of going to some state from state i (including remaining in state i) is certainty<br />
<br />
====Random Walk====<br />
Example: Start at one point and flip a coin where <math>\displaystyle Pr(H)=p</math> and <math>\displaystyle Pr(T)=1-p=q</math>. Take one step right if heads and one step left if tails. If at an endpoint, stay there.<br />
The transition matrix is<br />
<math>P=\left(\begin{matrix}1&0&0&\dots&\dots&0\\<br />
q&0&p&0&\dots&0\\<br />
0&q&0&p&\dots&0\\<br />
\vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\<br />
\vdots&\vdots&\vdots&\vdots&\ddots&\vdots\\<br />
0&0&\dots&\dots&\dots&1<br />
\end{matrix}\right)</math><br />
<br />
Let <math>\displaystyle P_n</math> be the matrix such that its (i,j) element is <math>\displaystyle P_{ij}(n)</math>. This is called n-step probability.<br />
<br />
:<math>\displaystyle P_n = P^n</math><br />
:<math>\displaystyle P_1 = P</math><br />
:<math>\displaystyle P_2 = P^2</math><br />
<br />
<br />
==''' Markov Chain Properties and Page Rank - October 18th, 2011'''==<br />
<br />
===Summary of Terminology===<br />
<br />
====Transition Matrix====<br />
<br />
A matrix <math>\!P</math> that defines a Markov Chain has the form:<br />
<br />
<math>P = \begin{bmatrix}<br />
P_{11} & \cdots & P_{1N} \\<br />
\vdots & \ddots & \vdots \\ <br />
P_{N1} & \cdots & P_{NN}<br />
\end{bmatrix}</math><br />
<br />
where <math>\!P(i,j) = P_{ij} = Pr(x_{t+1} = j | x_t = i) </math> is the probability of transitioning from state i to state j in the Markov Chain in a single step. Note that this implies that all rows add up to one.<br />
<br />
====n-step Transition matrix====<br />
<br />
A matrix <math>\!P_n</math> whose (i,j)<sup>th</sup> entry is the probability of moving from state i to state j after n transitions:<br />
<br />
<math>\!P_n(i,j) = Pr(x_{m+n}=j|x_m = i)</math><br />
<br />
This probability is called the n-step transition probability. A nice property of this matrix is that<br />
<br />
<math>\!P_n = P^n</math><br />
<br />
For all n >= 0, where P is the transition matrix. Note that the rows of <math>P_n</math> should still add up to one.<br />
<br />
====Marginal distribution of a Markov Chain====<br />
<br />
We represent the state at time t as a vector.<br />
<br />
<math>\mu_t = (\mu_t(1) \; \mu_t(2) \; ... \; \mu_t(n))</math><br />
<br />
Consider this Markov Chain:<br />
<br />
[[File:MarkovSample.png|300px]]<br />
<br />
<math>\mu_t = (A \; B)</math>, where A is the probability of being in state a at time t, and B is the probability of being in state b at time t.<br />
<br />
For example if <math>\mu_t = (0.1 \; 0.9)</math>, we have a 10% chance of being in state a at time t, and a 90% chance of being in state b at time t.<br />
<br />
Suppose we run this Markov chain many times, and record the state at each step.<br />
<br />
In this example, we run 4 trials, up until t=5.<br />
<br />
{| class="wikitable"<br />
|-<br />
! t<br />
! Trial 1<br />
! Trial 2<br />
! Trial 3<br />
! Trial 4<br />
! Observed <math>\mu</math><br />
|-<br />
| 1<br />
| a<br />
| b<br />
| b<br />
| a<br />
| (0.5, 0.5)<br />
|-<br />
| 2<br />
| b<br />
| a<br />
| a<br />
| a<br />
| (0.75, 0.25)<br />
|-<br />
| 3<br />
| a<br />
| a<br />
| b<br />
| a<br />
| (0.75, 0.25)<br />
|-<br />
| 4<br />
| b<br />
| b<br />
| a<br />
| b<br />
| (0.25, 0.75)<br />
|-<br />
| 5<br />
| b<br />
| b<br />
| b<br />
| a<br />
| (0.25, 0.75)<br />
|}<br />
<br />
Imagine simulating the chain many times. If we collect all the outcomes at time t from all the chains, the histogram of this data would look like <math>\!\mu_t</math>.<br />
<br />
We can find the marginal probabilities as <math>\!\mu_n = \mu_0 P^n</math><br />
<br />
====Stationary Distribution====<br />
<br />
Let <math>\pi = (\pi_i \mid i \in \chi)</math> be a vector of non-negative numbers that sum to 1. (i.e. <math>\!\pi</math> is a pmf)<br />
<br />
If <math>\!\pi = \pi P</math>, then <math>\!\pi</math> is a stationary distribution, also known as an invariant distribution.<br />
<br />
====Limiting Distribution====<br />
<br />
A Markov chain has limiting distribution <math>\!\pi </math> if <math>\lim_{n \to \infty} P^n = \begin{bmatrix} \pi \\ \vdots \\ \pi \end{bmatrix}</math><br />
<br />
That is, <math>\!\pi_j = \lim_{n \to \infty}\left [ P^n \right ]_{ij}</math> exists and is independent of i.<br />
<br />
Here is an example:<br />
<br />
Suppose we want to find the limiting distribution of<br />
<math>P=\left(\begin{matrix}<br />
1/3&1/3&1/3\\<br />
1/4&3/4&0\\<br />
1/2&0&1/2<br />
\end{matrix}\right)</math><br />
<br />
We want to solve <math>\pi=\pi P</math> and we want <math>\displaystyle \pi_0 + \pi_1 + \pi_2 = 1</math><br /><br />
<br />
<math>\displaystyle \pi_0 = 1/3\pi_0 + 1/4\pi_1 + 1/2\pi_2</math><br /><br />
<math>\displaystyle \pi_1 = 1/3\pi_0 + 3/4\pi_1</math><br /><br />
<math>\displaystyle \pi_2 = 1/3\pi_0 + 1/2\pi_2</math><br /><br />
<br />
Solving the system of equations, we get <br /> <br />
<math>\displaystyle \pi_1 = 4/3\pi_0</math><br /><br />
<math>\displaystyle \pi_2 = 2/3\pi_0</math><br /><br />
<br />
So using our condition above, we have <math>\displaystyle \pi_0 + 4/3\pi_0 + 2/3\pi_0 = 1</math> and by solving we get <math>\displaystyle \pi_0 = 1/3</math><br />
<br />
Using this in our system of equations, we obtain: <br /><br />
<math>\displaystyle \pi_1 = 4/9</math><br /><br />
<math>\displaystyle \pi_2 = 2/9</math><br />
<br />
Thus, the limiting distribution is <math>\displaystyle \pi = (1/3, 4/9, 2/9)</math><br />
<br />
====Detailed Balance====<br />
<br />
<math>\!\pi</math> has the detailed balance property if <math>\!\pi_iP_{ij} = P_{ji}\pi_j</math><br />
<br />
'''Theorem'''<br />
<br />
If <math>\!\pi</math> satisfies detailed balance, then <math>\!\pi</math> is a stationary distribution.<br />
<br />
In other words, if <math>\!\pi_iP_{ij} = P_{ji}\pi_j</math>, then <math>\!\pi = \pi P</math><br />
<br />
'''Proof:''' <br />
<br />
<math>\!\pi P =<br />
\begin{bmatrix}\pi_1 & \pi_2 & \cdots & \pi_N\end{bmatrix} \begin{bmatrix}P_{11} & \cdots & P_{1N} \\ \vdots & \ddots & \vdots \\ P_{N1} & \cdots & P_{NN}\end{bmatrix}</math><br />
<br />
Observe that the j<sup>th</sup> element of <math>\!\pi P</math> is<br />
<br />
<math>\!\left [ \pi P \right ]_j = \pi_1 P_{1j} + \pi_2 P_{2j} + \dots + \pi_N P_{Nj}</math><br />
<br />
::<math>\! = \sum_{i=1}^N \pi_i P_{ij}</math><br />
<br />
::<math>\! = \sum_{i=1}^N P_{ji} \pi_j</math>, by the definition of detailed balance.<br />
<br />
::<math>\! = \pi_j \sum_{i=1}^N P_{ji}</math><br />
<br />
::<math>\! = \pi_j</math>, as the sum of the entries in a column of P must sum to 1.<br />
<br />
So <math>\!\pi = \pi P</math>.<br />
<br />
<br />
'''Example'''<br />
<br />
Find the marginal distribution of <br />
<br />
[[File:MarkovSample.png|300px]]<br />
<br />
Start by generating the matrix P.<br />
<br />
<math>\!P = \begin{pmatrix} 0.2 & 0.8 \\ 0.6 & 0.4 \end{pmatrix}</math><br />
<br />
We must assume some starting value for <math>\mu_0</math><br />
<br />
<math>\!\mu_0 = \begin{pmatrix} 0.1 & 0.9 \end{pmatrix}</math><br />
<br />
For t = 1, the marginal distribution is<br />
<br />
<math>\!\mu_1 = \mu_0 P</math><br />
<br />
Notice that this <math>\mu</math> converges. <br />
<br />
If you repeatedly run:<br />
<br />
<math>\!\mu_{i+1} = \mu_i P</math><br />
<br />
It converges to <math>\mu = \begin{pmatrix} 0.4286 & 0.5714 \end{pmatrix}</math><br />
<br />
This can be seen by running the following Matlab code:<br />
P = [0.2 0.8; 0.6 0.4];<br />
mu = [0.1 0.9]; <br />
while 1 <br />
mu_old = mu; <br />
mu = mu * P;<br />
if mu_old == mu <br />
disp(mu);<br />
break;<br />
end<br />
end<br />
<br />
Another way of looking at this simple question is that we can see whether the ultimate pmf converges:<br />
<br />
Let <math>\hat{p_n}(1)=\frac{1}{n}\sum_{k=1}^n I(X_k=1)</math> denote the estimator of the stationary probability of state 1,<math>\hat{p_n}(2)=\frac{1}{n}\sum_{k=1}^n I(X_k=2)</math> denote the estimator of the stationary probability of state 2, where <math>\displaystyle I(X_k=1)</math> and <math>\displaystyle I(X_k=2)</math> are indicator variables which equal 1 if <math>X_k=1</math>(or <math>X_k=2</math> for the latter one).<br />
<br />
Matlab codes for this explanation is<br />
<br />
n=1;<br />
if rand<0.1<br />
x(1)=1;<br />
else<br />
x(1)=0;<br />
end<br />
p1(1)=sum(x)/n;<br />
p2(1)=1-p1(1);<br />
for i=2:10000<br />
n=n+1;<br />
if (x(i-1)==1&rand<0.2)|(x(i-1)==0&rand<0.6)<br />
x(i)=1;<br />
else<br />
x(i)=0;<br />
end<br />
p1(i)=sum(x)/n;<br />
p2(i)=1-p1(i); <br />
end<br />
plot(p1,'red');<br />
hold on;<br />
plot(p2)<br />
<br />
The results can be easily seen from the graph below:<br />
<br />
[[File:Stationary distribution.png|300px]]<br />
<br />
Additionally, we can plot the marginal distribution as it converges without estimating it. The following Matlab code shows this:<br />
<br />
%transition matrix<br />
P=[0.2 0.8; 0.6 0.4];<br />
%mu at time 0<br />
mu=[0.1 0.9];<br />
%number of points for simulation<br />
n=20;<br />
for i=1:n<br />
mu_a(i)=mu(1);<br />
mu_b(i)=mu(2);<br />
mu=mu*P;<br />
end<br />
t=[1:n];<br />
plot(t, mu_a, t, mu_b);<br />
hleg1=legend('state a', 'state b');<br />
<br />
[[File:Marginal distribution convergence.png|300px]]<br />
<br />
Note that there are chains with stationary distributions that don't converge (the chain might not naturally reach the stationary distribution, and isn't limiting). An example of this is:<br />
<br />
<math>P = \begin{pmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \end{pmatrix}, \mu_0 = \begin{pmatrix} 1/3 & 1/3 & 1/3 \end{pmatrix}</math><br />
<br />
<math>\!\mu_0</math> is a stationary distribution, so <math>\!\mu P</math> is the same for all iterations.<br />
<br />
But,<br />
<br />
<math>P^{1000} = \begin{pmatrix} 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{pmatrix} \ne \begin{pmatrix} \mu \\ \mu \\ \mu \end{pmatrix}</math><br />
<br />
So <math>\!\mu</math> is not a limiting distribution. Also, if<br />
<br />
<math>\mu = \begin{pmatrix} 0.2 & 0.1 & 0.7 \end{pmatrix}</math><br />
<br />
Then <math>\!\mu = \mu P</math> does not converge.<br />
<br />
This can be observed through the following Matlab code.<br />
<br />
P = [0 0 1; 1 0 0; 0 1 0];<br />
mu = [0.2 0.1 0.7]; <br />
for i= 1:4 <br />
mu = mu * P;<br />
disp(mu);<br />
end<br />
<br />
This outputs<br />
0.1000 0.7000 0.2000<br />
0.7000 0.2000 0.1000<br />
0.2000 0.1000 0.7000<br />
0.1000 0.7000 0.2000<br />
<br />
Note that <math>\!\mu_1 = \!\mu_4</math>, which indicates that <math>\!\mu</math> will cycle forever.<br />
<br />
This means that this chain has a stationary distribution, but is not limiting.<br />
<br />
===Page Rank===<br />
<br />
Page Rank was the original ranking algorithm used by Google's search engine to rank web pages.<ref><br />
http://ilpubs.stanford.edu:8090/422/<br />
</ref> The algorithm was created by the founders of Google, Larry Page and Sergey Brin as part of Page's PhD thesis. When a query is entered in a search engine, there are a set of web pages which are matched by this query, but this set of pages must be ordered by their "importance" in order to identify the most meaningful results first. Page Rank is an algorithm which assigns importance to every web page based on the links in each page.<br />
<br />
==== Intuition ====<br />
<br />
We can represent web pages by a set of nodes, where web links are represented as edges connecting these nodes. Based on our intuition, there are three main factors in deciding whether a web page is important or not.<br />
<br />
# A web page is important if many other pages point to it.<br />
# The more important a webpage is, the more weight is placed on its links.<br />
# The more links a webpage has, the less weight is placed on its links.<br />
<br />
====Modelling====<br />
<br />
We can model the set of links as a N-by-N matrix L, where N is the number of web pages we are interested in:<br />
<br />
<math>L_{ij} =<br />
\left\{<br />
\begin{array}{lr}<br />
1 : \text{if page j points to i}\\<br />
0 : \text{otherwise}<br />
\end{array}<br />
\right. <br />
</math><br />
<br />
<br />
<br />
The number of outgoing links from page j is<br />
<br />
<math>c_j = \sum_{i=1}^N L_{ij}</math><br />
<br />
For example, consider the following set of links between web pages:<br />
<br />
[[File:PageRank.png|250px]]<br />
<br />
According to the factors relating to importance of links, we can consider two possible rankings :<br />
<br />
<br />
<math>\displaystyle 3 > 2 > 1 > 4 </math> <br />
<br />
or<br />
<br />
<math>\displaystyle 3>1>2>4 </math> <br />
if we consider that the high importance of the link from 3 to 1 is more influent than the fact that there are two outgoing links from page 1 and only one from page 2.<br />
<br />
<br />
We have <math>L = \begin{bmatrix} <br />
0 & 0 & 1 & 0 \\ <br />
1 & 0 & 0 & 0 \\ <br />
1 & 1 & 0 & 1 \\<br />
0 & 0 & 0 & 0<br />
\end{bmatrix}</math>, and <math>c = \begin{pmatrix}2 & 1 & 1 & 1\end{pmatrix} </math><br />
<br />
We can represent the ranks of web pages as the vector P, where the i<sup>th</sup> element is the rank of page i:<br />
<br />
<math>P_i = (1-d) + d\sum_j \frac{L_{ij}}{c_j} P_j</math><br />
<br />
Here we take the sum of the weights of the incoming links, where links are reduced in weight if the linking page has a lot of outgoing links, and links are increased in weight if the linking page has a lot of incoming links. <br />
<br />
We don't want to completely ignore pages with no incoming links, which is why we add the constant (1 - d).<br />
<br />
If <br />
<br />
<math>L = \begin{bmatrix} L_{11} & \cdots & L_{1N} \\<br />
\vdots & \ddots & \vdots \\<br />
L_{N1} & \cdots & L_{NN} \end{bmatrix}</math><br />
<br />
<math>D = \begin{bmatrix} c_1 & \cdots & 0 \\<br />
\vdots & \ddots & \vdots \\<br />
0 & \cdots & c_N \end{bmatrix}</math><br />
<br />
Then <math>D^{-1} = \begin{bmatrix} c_1^{-1} & \cdots & 0 \\<br />
\vdots & \ddots & \vdots \\<br />
0 & \cdots & c_N^{-1} \end{bmatrix}</math><br />
<br />
<math>\!P = (1-d)e + dLD^{-1}P</math><br />
<br />
where <math>\!e = \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}</math> is the vector with all 1's<br />
<br />
To simplify the problem, we let <math>\!e^T P = N \Rightarrow \frac{e^T P}{N} = 1</math>. This means that the average importance of all pages on the internet is 1.<br />
<br />
Then<br />
<math>\!P = (1-d)\frac{ee^TP}{N} + dLD^{-1}P</math><br />
::<math>\! = \left [ (1-d)\frac{ee^T}{N} + dLD^{-1} \right ] P</math><br />
::<math>\! = \left [ \left ( \frac{1-d}{N} \right ) E + dLD^{-1} \right ] P</math>, where <math> E </math> is an NxN matrix filled with ones.<br />
<br />
Let <math>\!A = \left [ \left ( \frac{1-d}{N} \right ) E + dLD^{-1} \right ]</math><br />
<br />
Then <math>\!P = AP</math>.<br />
<br />
<br />
Note that P is a stationary distribution and, more importantly, P is an eigenvector of A with eigenvalue 1. Therefore, we can find the ranks of all web pages by solving this equation for P. <br />
<br />
We can find the vector P for the example above, using the following Matlab code:<br />
L = [0 0 1 0; 1 0 0 0; 1 1 0 1; 0 0 0 0];<br />
D = [2 0 0 0; 0 1 0 0; 0 0 1 0; 0 0 0 1];<br />
d = 0.8 ;% pages with no links get a weight of 0.2<br />
N = 4 ;<br />
<br />
A = ((1-d)/N) * ones(N) + d * L * inv(D);<br />
[EigenVectors, EigenValues] = eigs(A)<br />
s=sum(EigenVectors(:,1));% we should note that the average entry of P should be 1 according to our assumption<br />
P=(EigenVectors(:,1))/s*N<br />
<br />
This outputs:<br />
<br />
EigenVectors =<br />
-0.6363 0.7071 0.7071 -0.0000 <br />
-0.3421 -0.3536 + 0.3536i -0.3536 - 0.3536i -0.7071 <br />
-0.6859 -0.3536 - 0.3536i -0.3536 + 0.3536i 0.0000 <br />
-0.0876 0.0000 + 0.0000i 0.0000 - 0.0000i 0.7071 <br />
<br />
<br />
EigenValues =<br />
1.0000 0 0 0 <br />
0 -0.4000 - 0.4000i 0 0 <br />
0 0 -0.4000 + 0.4000i 0 <br />
0 0 0 0.0000 <br />
<br />
P =<br />
<br />
1.4528<br />
0.7811<br />
1.5660<br />
0.2000<br />
<br />
Note that there is an eigenvector with eigenvalue 1. <br />
The reason why there always exist a 1-eigenvector is that A is a stochastic matrix. <br />
<br />
Thus our vector P is <math> <br />
\begin{bmatrix}1.4528 \\ 0.7811 \\ 1.5660\\ 0.2000 \end{bmatrix}</math><br />
<br />
However, this method is not practical, because there are simply too many web pages on the internet. So instead Google uses a method to approximate an eigenvector with eigenvalue 1.<br />
<br />
Note that page three has the rank with highest magnitude and page four has the rank with lowest magnitude, as expected.<br />
<br />
==''' Markov Chain Monte Carlo - Metropolis-Hastings - October 25th, 2011'''==<br />
<br />
We want to find <math> \int h(x)f(x)\, \mathrm dx </math>, but we don't know how to sample from <math>\,f</math>.<br />
<br />
We have seen simple techniques before. This one is used in real life.<br />
It consists of the search of a Markov Chain such that its stationary distribution is <math>\,f</math>.<br />
<br />
==== Main procedure ====<br />
<br />
Let us suppose that <math>\,q(y|x)</math> is a friendly distribution: we can sample from this function.<br />
<br />
1. Initialize the chain with <math>\,x_{i}</math> and set <math>\,i=0</math>.<br />
<br />
2. Draw a point from <math>\,q(y|x)</math> i.e. <math>\,Y \backsim q(y|x_{i})</math>.<br />
<br />
3. Evaluate <math>\,r(x,y)=min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\}</math><br />
<br />
<br />
4. Draw a point <math>\,U \backsim Unif[0,1]</math>.<br />
<br />
5. <math>\,x_{i+1}=\begin{cases}y & \text{ if } U<r \\x_{i} & \text{ otherwise } \end{cases} </math>.<br />
<br />
6. <math>\,i=i+1</math>. Go back to 2.<br />
<br />
==== Remark 1 ====<br />
<br />
A very common choice for <math>\,q(y|x)</math> is <math>\,N(y;x,b^{2})</math>, a normal distribution centered at the current point.<br />
<br />
Note : In this case <math>\,q(y|x)</math> is symmetric i.e. <math>\,q(y|x)=q(x|y)</math>.<br />
<br />
(Because <math>\,q(y|x)=\frac{1}{\sqrt{2\pi}b}e^{-\frac{1}{2b^{2}}(y-x)^{2}}</math> and <math>\,(y-x)^{2}=(x-y)^{2}</math>).<br />
<br />
Thus we have <math>\,\frac{q(x|y)}{q(y|x)}=1</math>, which implies :<br />
<br />
<math>\,r(x,y)=min\left\{\frac{f(y)}{f(x)},1\right\}</math>.<br />
<br />
In general, if <math>\,q(x|y)</math> is symmetric then the algorithm is called Metropolis in reference to the original algorithm (made in 1953)<ref>http://en.wikipedia.org/wiki/Equations_of_State_Calculations_by_Fast_Computing_Machines</ref>.<br />
<br />
<br />
<br />
====Remark 2====<br />
<br />
The value y is accepted if <math>\,u<min\left\{\frac{f(y)}{f(x)},1\right\}</math> so it is accepted with the probability <math>\,min\left\{\frac{f(y)}{f(x)},1\right\}</math>.<br />
<br />
Thus, if <math>\,f(y)>f(x)</math>, then <math>\,y</math> is always accepted.<br />
<br />
The higher that value of the pdf is in the vicinity of a point <math>\,y_1</math>, the more likely it is that a random variable will take on values around <math>\,y_1</math>. As a result it makes sense that we would want a high probability of acceptance for points generated near <math>\,y_1</math>.<br />
<br />
====Remark 3====<br />
<br />
One strength of the Metropolis-Hastings algorithm is that normalizing constants, which are often quite difficult to determine, can be cancelled out in the ratio <math> r </math>. For example, consider the case where we want to sample from the beta distribution, which has the pdf:<br />
<br />
<math><br />
\begin{align}<br />
f(x;\alpha,\beta)& = \frac{1}{\mathrm{B}(\alpha,\beta)}\, x^{\alpha-1}(1-x)^{\beta-1}\end{align}<br />
</math><br />
<br />
The beta function, ''B'', appears as a normalizating constant but it can be simplified by construction of the method.<br />
<br />
====Example====<br />
<br />
<math>\,f(x)=\frac{1}{\pi^{2}}\frac{1}{1+x^{2}}</math><br />
<br />
Then, we have <math>\,f(x)\propto\frac{1}{1+x^{2}}</math>.<br />
<br />
And let us take <math>\,q(x|y)=\frac{1}{\sqrt{2\pi}b}e^{-\frac{1}{2b^{2}}(y-x)^{2}}</math>.<br />
<br />
Then <math>\,q(x|y)</math> is symmetric.<br />
<br />
Therefore Y can be simplified.<br />
<br />
<br />
We get :<br />
<br />
<math>\,\begin{align}<br />
\displaystyle r(x,y) <br />
& =min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} \\<br />
& =min\left\{\frac{f(y)}{f(x)},1\right\} \\<br />
& =min\left\{ \frac{ \frac{1}{1+y^{2}} }{ \frac{1}{1+x^{2}} },1\right\}\\<br />
& =min\left\{ \frac{1+x^{2}}{1+y^{2}},1\right\}\\<br />
\end{align}<br />
</math>.<br />
<br />
<br />
<br />
The Matlab code of the algorithm is the following :<br />
<br />
<pre><br />
clear all<br />
close all<br />
clc<br />
b=2;<br />
x(1)=randn;<br />
for i=2:10000<br />
y=b*randn+x(i-1);<br />
r=min((1+x(i-1)^2)/(1+y^2),1);<br />
u=rand;<br />
if u<r<br />
x(i)=y;<br />
else<br />
x(i)=x(i-1);<br />
end<br />
<br />
end<br />
hist(x(5000:end));<br />
%The Markov Chain usually takes some time to converge and this is known as the "burning time".<br />
%Therefore, we don't display the first 5000 points because they don't show the limiting behaviour of the Markov <br />
Chain.<br />
</pre><br />
<br />
As we can see, the choice of the value of b is made by us.<br />
<br />
Changing this value has a significant impact on the results we obtain. There is a pitfall when b is too big or too small.<br />
<br />
Example with <math>\,b=0.1</math> (Also with graph after we run j=5000:10000; plot(j,x(5000:10000)):<br />
<br />
[[File:redaccoursb01.JPG|300px]] [[File:001Metr.PNG|300px]]<br />
<br />
With <math>\,b=0.1</math>, the chain takes small steps so the chain doesn't explore enough of the sample space. It doesn't give an accurate report of the function we want to sample.<br />
<br />
<br />
<br />
Example with <math>\,b=10</math> :<br />
<br />
[[File:redaccoursb10.JPG|300px]] [[File:010metro.PNG|300px]]<br />
<br />
With <math>\,b=10</math>, jumps are very unlikely to be accepted as they deviate far from the mean (i.e. <math>\,y</math> is rejected as <math>\ u<r </math> and <math>\,x(i)=x(i-1)</math> most of the time, hence most sample points stay fairly close to the origin.<br />
The third graph that resembles white noise (as in the case of <math>\,b=2</math>) indicates better sampling as more points are covered and accepted. For <math>\,b=0.1</math>, we have lots of jumps but most values are not repeated, hence the stationary distribution is less obvious; whereas in the <math>\,b=10</math> case, many points remains around 0. Approximately 73% were selected as x(i-1).<br />
<br />
<br />
Example with <math>\,b=2</math> :<br />
<br />
[[File:redaccoursb2.JPG|300px]] [[File:100metr.PNG|300px]]<br />
<br />
With <math>\,b=2</math>, we get a more accurate result as we avoid these extremes. Approximately 37% were selected as x(i-1).<br />
<br />
<br />
If the sample from the Markov Chain starts to look like the target distribution quickly, we say the chain is mixing well.<br />
<br />
==''' Theory and Applications of Metropolis-Hastings - October 27th, 2011'''==<br />
<br />
As mentioned in the previous section, the idea of the Metropolis-Hastings (MH) algorithm is to produce a Markov chain that converges to a stationary distribution <math>f</math> which we are interested in sampling from.<br />
<br />
====Convergence====<br />
<br />
One important fact to check is that <math>\displaystyle f</math> is indeed a stationary distribution in the MH scheme. For this, we can appeal to the implications of the detailed balance property:<br />
<br />
Given a probability vector <math>\!\pi</math> and a transition matrix <math>\displaystyle P</math>, <math>\!\pi</math> has the detailed balance property if <math>\!\pi_iP_{ij} = P_{ji}\pi_j</math><br />
<br />
If <math>\!\pi</math> satisfies detailed balance, then it is a stationary distribution.<br />
<br />
The above definition applies to the case where the states are discrete. In the continuous case, <math>\displaystyle f</math> satisfies detailed balance if <math>\displaystyle f(x)p(x,y)=f(y)p(y,x)</math>. Where <math>\displaystyle p(x,y)</math> and <math>\displaystyle p(y,x)</math> are the probabilities of transitioning from x to y and y to x respectively. If we can show that <math>\displaystyle f</math> has the detailed balance property, we can conclude that it is a stationary distribution. Because <math>\int^{}_y f(y)p(y,x)dy=\int^{}_y f(x)p(x,y)dy=f(x)</math>.<br />
<br />
In the MH algorithm, we use a proposal distribution to generate y~<math>\displaystyle q(y|x)</math>, and accept y with probability <math>min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\}</math><br />
<br />
Suppose, without loss of generality, that <math>\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)} <= 1</math>. This implies that <math>\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)} >= 1</math><br />
<br />
Let <math>\,r(x,y)</math> be the chance of accepting point y given that we are at point x.<br />
<br />
So <math>\,r(x,y) = min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} = \frac{f(x)}{f(y)} \frac{q(x|y)}{q(y|x)}</math><br />
<br />
Let <math>\,r(y,x)</math> be the chance of accepting point x given that we are at point y.<br />
<br />
So <math>\,r(y,x) = min\left\{\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)},1\right\} = 1</math><br />
<br />
<br />
<math>\,p(x,y)</math> is the probability of generating and accepting y, while at point x.<br />
<br />
So <math>\,p(x,y) = q(y|x)r(x,y) = q(y|x) \frac{f(y)}{f(x)} \frac{q(x|y)}{q(y|x)} = \frac{f(y)q(x|y)}{f(x)}</math><br />
<br />
<br />
<math>\,p(y,x)</math> is the probability of generating and accepting x, while at point y.<br />
<br />
So <math>\,p(y,x) = q(x|y)r(y,x) = q(x|y)</math><br />
<br />
<br />
<math>\,f(x)p(x,y) = f(x)\frac{f(y)q(x|y)}{f(x)} = f(y)q(x|y) = f(y)p(y,x)</math><br />
<br />
Thus, detailed balance holds.<br />
:i.e. <math>\,f(x)</math> is stationary distribution<br />
<br />
It can be shown (although not here) that <math>f</math> is a limiting distribution as well. Therefore, the MH algorithm generates a sequence whose distribution converges to <math>f</math>, the target.<br />
<br />
====Implementation====<br />
<br />
In the implementation of MH, the proposal distribution is commonly chosen to be symmetric, which simplifies the calculations and makes the algorithm more intuitively understandable. The MH algorithm can usually be regarded as a random walk along the distribution we want to sample from. Suppose we have a distribution <math>f</math>:<br />
<br />
[[File:Standard normal distribution.gif]]<br />
<br />
Suppose we start the walk at point <math>x</math>. The point <math>y_{1}</math> is in a denser region than <math>x</math>, therefore, the walk will always progress from <math>x</math> to <math>y_{1}</math>. On the other hand, <math>y_{2}</math> is in a less dense region, so it is not certain that the walk will progress from <math>x</math> to <math>y_{2}</math>. In terms of the MH algorithm:<br />
<br />
<math>r(x,y_{1})=min(\frac{f(y_{1})}{f(x)},1)=1</math> since <math>f(y_{1})>f(x)</math>. Thus, any generated value with a higher density will be accepted.<br />
<br />
<math>r(x,y_{2})=\frac{f(y_{2})}{f(x)}</math>. The lower the density of <math>y_{2}</math> is, the less chance it will have of being accepted.<br />
<br />
A certain class of proposal distributions can be written in the form:<br />
<br />
<math>\,y|x_i = x_i + \epsilon_i</math><br />
<br />
where <math>\,\epsilon_i = g(|x-y|)</math><br />
<br />
The density depends only on the distance between the current point and the next one (which can be seen as the "step" being taken). These proposal distributions give the Markov chain the random walk nature. The normal distribution that we frequently use in our examples satisfies the above definition.<br />
<br />
In actual implementations of the MH algorithm, the proposal distribution needs to be chosen judiciously, because not all proposals will work well with all target distributions we want to sample from. Take a trimodal distribution for example:<br />
<br />
[[File:trimodal.jpg]]<br />
<br />
If we choose the proposal distribution to be a standard normal as we have done before, problems will arise. The low densities between the peaks means that the MH algorithm will almost never walk to any points generated in these regions and get stuck at one peak. One way to address this issue is to increase the variance, so that the steps will be large enough to cross the gaps. Of course, in this case, it would probably be beneficial to come up with a different proposal function. As a rule of thumb, such functions should result in an approximately 50% acceptance rate for generated points.<br />
<br />
====Simulated Annealing====<br />
<br />
Metropolis-Hastings is very useful in simulation methods for solving optimization problems. One such application is simulated annealing, which addresses the problems of minimizing a function <math>h(x)</math>. This method will not always produce the global solution, but it is intuitively simple and easy to implement.<br />
<br />
Consider <math>e^{\frac{-h(x)}{T}}</math>, maximizing this expression is equivalent to minimizing <math>h(x)</math>. Suppose <math>\mu</math> is the maximizing value and <math>h(x)=(x-\mu)^2</math>, then the maximization function is a gaussian distribution <math>e^{-\frac{(x-\mu)^2}{T}}</math>. When many samples are taken from this distribution, the mean will converge to the desired maximizing value. The annealing comes into play by lowering T (the temperature) as the sampling progresses, making the distribution narrower. The steps of simulated annealing are outlined below:<br />
<br />
1. start with a random <math>x</math> and set T to a large number<br />
<br />
2. generate <math>y</math> from a proposal distribution <math>q(y|x)</math>, which should be symmetric<br />
<br />
3. accept <math>y</math> with probability <math>min(\frac{f(y)}{f(x)},1)</math><br />
<br />
4. decrease T, and then go to step 2<br />
<br />
The following plot and Matlab code illustrates the simulated annealing procedure as temperature ''T'', the variance, decreases for a Gaussian distribution with zero mean. Starting off with a large value for the temperature ''T'' allows the Metropolis-Hastings component of the procedure to capture the mean, before gradually decreasing the temperature ''T'' in order to converge to the mean. <br />
<br />
[[File:Simulated annealing illustration.png]]<br />
<br />
x=-10:0.1:10;<br />
mu=0;<br />
T=5;<br />
colour = ['b', 'g', 'm', 'r', 'k'];<br />
for i=1:5<br />
pdfNormal=normpdf(x, mu, T);<br />
plot(x, pdfNormal, colour(i));<br />
T=T-1;<br />
hold on<br />
end<br />
hleg1=legend('T=5', 'T=4', 'T=3', 'T=2', 'T=1');<br />
title('Simulated Annealing Illustration');<br />
<br />
=='''References'''==<br />
<br />
<references/><br />
<br />
=='''Simulated Annealing and Gibbs Sampling - November 1, 2011'''==<br />
<br />
continued from previous lecture...<br />
<br />
We will now look at a couple cases where <math> \displaystyle h(y) > h(x) </math> or <math> \displaystyle h(y) < h(x) </math>, and explore whether to accept or reject <math> y </math>.<br />
<br />
Recall r(x,y)=min{<math>\frac{f(y)}{f(x)}</math>,1} where <math> \frac{f(y)}{f(x)} = \frac{e^{\frac{-h(x)}{T}}}{e^{\frac{-h(y)}{T}}} = e^{\frac{h(x)-h(y)}{T}}</math>. And r(x,y) represents the probability of accepting <math>y</math>.<br />
<br />
====Cases====<br />
<br />
Case a)<br />
Suppose <math> \displaystyle h(y) < h(x) </math>. Since we want to find the minimum value for <math>\displaystyle h(x) </math>, and the point <math>\displaystyle y </math> creates a lower value than our previous point, we accept the new point. Mathematically, <math>\displaystyle h(y) < h(x) </math> implies that:<br />
<br />
<math> \frac{f(y)}{f(x)} > 1 </math>. Therefore,<br />
<math> \displaystyle r = 1 </math>.<br />
So, we will always accept <math>\displaystyle y </math>.<br />
<br />
Case b)<br />
Suppose <math> \displaystyle h(y) > h(x) </math>. This is bad, since our goal is to minimize <math>\displaystyle h(x) </math>. However, we may still accept <math>\displaystyle y </math> with some chance:<br />
<br />
<math> \frac{f(y)}{f(x)} < 1 </math>. Therefore,<br />
<math>\displaystyle r < 1 </math>.<br />
So, we may accept <math>\displaystyle y </math> with probability <math>\displaystyle r </math>.<br />
<br />
<br />
Next, we will look at these cases as <math>\displaystyle T\to0 </math>.<br />
<br />
As <math>\displaystyle T\to0 </math> and case a) happens, <math> e^{\frac{h(x)-h(y)}{T}} </math> approaches infinity, so we will always accept <math>\displaystyle y </math>.<br />
<br />
As <math>\displaystyle T\to0 </math> and case b) happens, <math> e^{\frac{h(x)-h(y)}{T}} </math> approaches zero, so the probability that <math>\displaystyle y </math> will be accepted gets extremely small.<br />
<br />
It is worth noting that if we simply start with a small value of T, we may end up rejecting all the generated points, and hence we will get stuck somewhere in the function (due to case b)), which might be a maximum value in some intervals (local maxima), but not in the whole domain (global maxima). It is therefore necessary to start with a large value of T in order to explore the whole function. At the same time, a good estimation of x0 is needed (at least cannot differ from the maximum point too much). <br />
<br />
=====Example=====<br />
<br />
Let <math>\displaystyle h(x) = (x-2)^2 </math>.<br />
The graph of it is:<br />
[[File:PCh(x).jpg|center|500]]<br />
<br />
Then, <math> e^{\frac{-h(x)}{T}} = e^{\frac{-(x-2)^2}{T}} </math> . Take an initial value of T = 20. A graph of this is:<br />
[[File:PC-highT.jpg|center|500]]<br />
<br />
<br />
In comparison, we look a graph of T = 0.2:<br />
[[File:PC-lowT.jpg|center|500]]<br />
<br />
One can see that with a low T value, the graph has a lot of r = 0, and r >1, while having a bigger T value gives smoother transitions in the graph.<br />
<br />
The MATLAB code for the above graphs are:<br />
<pre><br />
ezplot('(x-2)^2',[-6,10])<br />
ezplot('exp((-(x-2)^2)/20)',[-6,10])<br />
ezplot('exp((-(x-2)^2)/0.2)',[-6,10])<br />
</pre><br />
<br />
=====Travelling Salesman Problem=====<br />
<br />
The simulated annealing method can be applied to compute the solution to the travelling salesman problem. Suppose there are N cities and the salesman only have to visit each city once. The objective is to find out the shortest path (i.e. shortest total length of journey) connecting the cities. An algorithm using simulated annealing on the problem can be found here ([http://www.cs.ubbcluj.ro/~csatol/mestint/pdfs/Numerical_Recipes_Simulated_Annealing.pdf Reference]).<br />
<br />
===Gibbs Sampling===<br />
<br />
Gibbs sampling is another Markov chain Monte Carlo method, similar to Metropolis-Hastings. There are two main differences between Metropolis-Hastings and Gibbs sampling. First, the candidate state is always accepted as the next state in Gibbs sampling. Second, it is assumed that the full conditional distributions are known, i.e. <math>P(X_i=x|X_j=x_j, \forall j\neq i)</math> for all <math>\displaystyle i</math>. The idea is that it is easier to sample from conditional distributions which are sets of one dimensional distributions than to sample from a joint distribution which is a higher dimensional distribution. Gibbs is a way to turn the joint distribution into multiple conditional distribution. <br />
<br />
<b>Advantages:</b><br /><br />
- sampling from conditional distributions may be easier than sampling from joint distributions<br />
<br />
<b>Disadvantages:</b><br /><br />
- we do not necessarily know the conditional distributions<br />
<br />
For example, if we want to sample from <math>\, f_{X,Y}(x,y)</math>, we need to know how to sample from <math>\, f_{X|Y}(x|y)</math> and <math>\, f_{Y|X}(y|x)</math>. Suppose the chain starts with <math>\,(X_0,Y_0)</math> and <math>(X_1,Y_1), \dots , (X_n,Y_n)</math> have been sampled. Then,<br />
<br />
<math>\, (X_{n+1},Y_{n+1})=(f_{X|Y}(x|Y_n),f_{Y|X}(y|X_{n+1}))</math><br />
<br />
Gibbs sampling turns a multi-dimensional distribution into a set of one-dimensional distributions. If we want to sample from <br />
<br />
<math>P_{X^1,\dots ,X^p}(x^1,\dots ,x^p)</math> <br />
<br />
and the full conditionals are known, then:<br />
<br />
<math>X^1_{n+1}=f(X^1|X^2_n,\dots ,X^p_n)</math><br />
<br />
<math>X^2_{n+1}=f(X^2|X^1_{n+1},X^3_n\dots ,X^p_n)</math><br />
<br />
<math>\vdots</math><br />
<br />
<math>X^{p-1}_{n+1}=f(X^{p-1}|X^1_{n+1},\dots ,X^{p-2}_{n+1},X^p_n)</math><br />
<br />
<math>X^p_{n+1}=f(X^p|X^1_{n+1},\dots ,X^{p-1}_{n+1})</math><br />
<br />
With Gibbs sampling, we can simulate <math>\displaystyle n</math> random variables sequentially from <math>\displaystyle n</math> univariate conditionals rather than generating one <math>n</math>-dimensional vector using the full joint distribution, which could be a lot more complicated.<br />
<br />
Computational inference deals with probabilistic graphical models. Gibbs sampling is useful here: graphical models show the dependence relations among random variables. For instance, Bayesian networks are graphical models represented using directed acyclic graphs. Looking at such a graphical model tells us on which random variable the distribution of a certain random variable depends (i.e. its parent). The model can be used to "factor" a joint distribution into conditional distributions.<br />
<br />
[[File:stat341_nov_1_graphical_model.png|200px|thumb|left|Sample graphical model of five RVs]]<br />
<br />
For example, consider the five random variables A, B, C, D, and E. Without making any assumptions about dependence relations among them, all we know is <br />
<br />
<math>\, P(A,B,C,D,E)=</math><math>\, P(A|B,C,D,E) P(B|C,D,E) P(C|D,E) P(D|E) P(E)</math><br />
<br />
However, if we know the relation between the random variables, e.g. given the graphical model on the left, we can simplify this expression:<br />
<br />
<math>\, P(A,B,C,D,E)=P(A) P(B|A) P(C|A) P(D|C) P(E|C)</math><br />
<br />
Although the joint distribution may be very complicated, the conditional distributions may not be.<br />
<br />
Check out the following notes on Gibbs sampling:<br />
<br />
* [http://web.mit.edu/~wingated/www/introductions/mcmc-gibbs-intro.pdf MCMC and Gibbs Sampling, MIT Lecture Notes]<br />
* chapter 7.4 in [http://stat.fsu.edu/~anuj/pdf/classes/CompStatI09/BOOK.pdf Notes on Computational Methods in Statistics]<br />
* chapter 4.9 in [http://www.ma.hw.ac.uk/~foss/StochMod/Ross_S.pdf Introduction to Probability Models] by Sheldon Ross<br />
<br />
====Example of Gibbs sampling: Multi-variate normal====<br />
<br />
We'd like to generate samples from a bivariate normal with parameters<br />
<br />
<math>\mu = \begin{bmatrix}1\\ 2 \end{bmatrix} = \begin{bmatrix}\mu_1 \\ \mu_2 \end{bmatrix}</math> <br />
and <math>\sigma = \begin{bmatrix}1 && 0.9 \\ 0.9 && 1 \end{bmatrix}= \begin{bmatrix}1 && \rho \\ \rho && 1 \end{bmatrix}</math><br />
<br />
The conditional distributions of multi-variate normal random variables are also normal:<br />
<br />
<math>\, f(x_1|x_2)=N(\mu_1 + \rho(x_2-\mu_2), 1-\rho^2)</math><br />
<br />
<math>\, f(x_2|x_1)=N(\mu_2 + \rho(x_1-\mu_1), 1-\rho^2)</math><br />
<br />
(In general, if the joint distribution has parameters<br />
<br />
<math>\mu = \begin{bmatrix}\mu_1 \\ \mu_2 \end{bmatrix}</math> and <math>\Sigma = \begin{bmatrix} \Sigma _{1,1} && \Sigma _{1,2} \\ \Sigma _{2,1} && \Sigma _{2,2} \end{bmatrix}</math><br />
<br />
then the conditional distribution <math>\, f(x_1|x_2)</math> has mean <math>\, \mu_1 + \Sigma _{1,2}(\Sigma _{1,1})^{-1}(x_2-\mu_2)</math> and variance <math>\, \Sigma _{1,1}-\Sigma _{1,2}(\Sigma _{2,2})^{-1}\Sigma _{2,1})</math>.<br />
<br />
=='''Principal Component Analysis (PCA) - November 8, 2011'''==<br />
<br />
Principal component analysis is an 100 year old algorithm used for the dimensionality reduction of data. As dimensions increase, the data points needed to sample accurately increase by an exponential factor.<br />
<br />
<math>\, x\in \mathbb{R}^D \rarr y\in \mathbb{R}^d</math><br />
<br />
<math>\ d \le D </math><br />
<br />
We want to transform <math>\, x</math> to <math>\, y</math> by reducing dimensionality yet losing little information.<br />
<br />
For example, consider dots in a three dimensional space. By unrolling the 2D manifold that they are on, we can reduce the data to 2D while losing little information. Note: This is not an application of PCA, but simple illustrates one way we can reduce dimensionality.<br />
<br />
Principle Component Analysis lets us reduce data to a linear subspace of its original space. It works best when data is in a lower dimensional subspace of its original space, or is close to.<br />
<br />
<br />
'''Probabilistic View'''<br />
<br />
We can see data set <math>\, x</math> as a high dimensional random variable governed by a low dimensional random variable <math>\, y</math>. Given <math>\, x</math>, we are trying to estimate <math>\, y</math>.<br />
<br />
We can see this in 2D linear regression, as the locations of data points in a scatter plot are governed by its approximate linear regression. The subspace that we have reduced the data to here is in the direction of variation in the data.<br />
<br />
'''Principal Component Analysis'''<br />
<br />
Principal component analysis is an orthogonal linear transform on a data set. It transforms the data coordinates to associate with a new set of orthogonal vectors, each representing the direction of the maximum variance of the the data. E.G. the first principal component is the direction of the max variance, the second principal component is the direction of the max variance orthogonal to the first vector, the third principal component is the direction of the max variance orthogonal to the first and second vectors and etc. until we have D vectors, where D is the dimension of the original data.<br />
<br />
Suppose we have data represented by <math>\, X = \begin{bmatrix}<br />
x^1\\<br />
x^2\\<br />
\vdots \\ <br />
x^D<br />
\end{bmatrix}<br />
\in \mathbb{R}^{D \times n} </math><br />
<br />
For some <math>\, W = \begin{bmatrix}<br />
w^1\\<br />
w^2\\<br />
\vdots \\ <br />
w^D<br />
\end{bmatrix}<br />
\in \mathbb{R}^{D} </math><br />
<br />
We can write any vector in <math>\, \mathbb{R}^D </math> as<br />
<br />
<math>\, w^1x^1 + w^2x^2 + \cdots + w^dx^d = W^TX</math><br />
<br />
To find the first principal component, we want to maximize the variance of <math>\,W^TX</math>.<br />
<br />
The variance of <math>\,W^TX</math> is <math>\,W^TSW</math> where <math>\,S</math> is the covariance matrix of X.<br />
<br />
<math>\, S = (x-\mu)(x-\mu)^T</math><br />
<br />
<br />
So we have to solve the problem<br />
<br />
<math>\, \text {Max } W^TSW</math><br />
<br />
<math>\, \text{such that } W^TW = 1</math>.<br />
<br />
<br />
We restrict W to unit vectors as otherwise the maximum is unbounded. We are only looking for the direction of of the vector, the actual scale of it is unnecessary.<br />
<br />
Using the method of Lagrange multipliers, we have<br />
<br />
<math>\,L(W, \lambda) = W^TSW - \lambda(W^TW - 1) </math><br />
<br />
We set<br />
<br />
<math>\, \frac{\partial L}{\partial W} = 0 </math><br />
<br />
<br />
<br />
Note that <math>\, W^TSW</math> is a quadratic form. So we have<br />
<br />
<br />
<br />
<math>\, \frac{\partial L}{\partial W} = 2SW - 2\lambda W = 0 </math><br />
<br />
<math>\, SW = \lambda W </math><br />
<br />
Since S is a matrix and lambda is a scaler, W is an eigenvector of S and lambda is its corresponding eigenvalue.<br />
<br />
Suppose that<br />
<br />
<math>\, \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_d</math><br />
are eigenvalues of S and <math>\, u_1, u_2, \cdots u_d</math> are their corresponding eigenvectors.<br />
<br />
We want to choose some <math>\, W = u </math><br />
<br />
<math>\,u^TSu =u^T\lambda u = \lambda u^Tu = \lambda</math><br />
<br />
So to maximize <math>\, u^TSu</math>, choose the eigenvector corresponding to the max eiegenvalue, e.g. <math>\, u_1</math>.<br />
<br />
So we let <math>\, W = u_1 </math> be the first principal component.<br />
<br />
The principal component's decompose the total variance in the data.<br />
<br />
<math>\, \sum_{i=1}^D \text{Var}(u_i) = \sum_{i=1}^D \lambda_i = \text{Tr}(S) = \sum_{i=1}^D \text{Var}(x_i)</math><br />
<br />
<br><br />
===Singular Value Decomposition===<br />
Singular value decomposition is a "generalization" of eigenvalue decomposition "to rectangular matrices of size ''mxn''."<ref name="Abdel_SVD">Abdel-Rahman, E. (2011). Singular Value Decomposition [Lecture notes]. Retrieved from http://uwace.uwaterloo.ca</ref> Singular value decomposition solves:<br><br><br />
:<math>\ A_{mxn}\ v_{nx1}=s\ u_{mx1}</math><br><br><br />
"for the right singular vector ''v'', the singular value ''s'', and the left singular vector ''u''. There are ''n'' singular values ''s''<sub>''i''</sub> and ''n'' right and left singular vectors that must satisfy the following conditions"<ref name="Abdel_SVD"/>:<br />
# "All singular values are non-negative"<ref name="Abdel_SVD"/>, <br> <math>\ s_i \ge 0.</math><br />
# All "right singular vectors are pairwise orthonormal"<ref name="Abdel_SVD"/>, <br> <math>\ v_iv_j=\delta_{i,j}.</math><br />
# All "left singular vectors are pairwise orthonormal"<ref name="Abdel_SVD"/>, <br> <math>\ u_iu_j=\delta_{i,j}.</math><br />
where<br />
:<math>\delta_{i,j}=\left\{\begin{matrix}1 & \mathrm{if}\ i=j \\ 0 & \mathrm{if}\ i\neq j\end{matrix}\right.</math><br><br><br />
<br />
'''Procedure to find the singular values and vectors'''<br><br />
Observe the following about the eigenvalue decomposition of a real square matrix ''A'' where ''v'' is the unit eigenvector:<br><br />
::<math><br />
\begin{align}<br />
& Av=\lambda v \\<br />
& (Av)^T=(\lambda v)^T \\<br />
& (Av)^TAv=(\lambda v)^T\lambda v \\<br />
& v^TA^TAv=\lambda^2v^Tv \\<br />
& vv^TA^TAv=v\lambda^2 \\<br />
& A^TAv=\lambda^2v<br />
\end{align}<br />
</math><br />
As a result:<br />
# "The matrices ''A'' and ''A''<sup>''T''</sup>''A'' have the same eigenvectors."<ref name="Abdel_SVD"/><br />
# "The eigenvalues of matrix ''A''<sup>''T''</sup>''A'' are the square of the eigenvalues of matrix ''A''."<ref name="Abdel_SVD"/><br />
# Since matrix ''A''<sup>''T''</sup>''A'' is symmetric,<br />
## "all the eigenvalues of matrix ''A''<sup>''T''</sup>''A'' are real and distinct."<ref name="Abdel_SVD"/><br />
## "the eigenvectors of matrix ''A''<sup>''T''</sup>''A'' are orthogonal and can be chosen to be orthonormal."<ref name="Abdel_SVD"/><br />
# "The eigenvalues of matrix ''A''<sup>''T''</sup>''A'' are non-negative"<ref name="Abdel_SVD"/> since <math>\ \lambda^2_i \ge 0.</math><br />
Conclusions 3 and 4 are "true even for a rectangular matrix ''A'' since ''A''<sup>''T''</sup>''A'' is still a square symmetric matrix"<ref name="Abdel_SVD"/> and its eigenvalues and eigenvectors can be found.<br><br><br />
Therefore, for a rectangular matrix ''A'', assuming ''m>n'', the singular values and vectors can be found by:<br />
# "Form the ''nxn'' symmetric matrix ''A''<sup>''T''</sup>''A''."<ref name="Abdel_SVD"/><br />
# Perform an eigenvalue decomposition to get ''n'' eigenvalues and their "corresponding eigenvectors, ordered such that"<ref name="Abdel_SVD"/> <br><math>\lambda^2_1 \ge \lambda^2_2 \ge \dots \ge \lambda^2_n \ge 0</math> and <math>\{v_1, v_2, \dots, v_n\}.</math><br />
# "The singular values are"<ref name="Abdel_SVD"/>: <br><math>s_1=\sqrt{\lambda^2_1} \ge s_2=\sqrt{\lambda^2_2} \ge \dots \ge s_n=\sqrt{\lambda^2_n} \ge 0.</math><br>"The non-zero singular values are distinct; the equal sign applies only to the singular values that are equal to zero."<ref name="Abdel_SVD"/><br />
# "The ''n''-dimensional right singular vectors are"<ref name="Abdel_SVD"/><br><math>\{v_1, v_2, \dots, v_n\}.</math><br />
# "For the first <math>r \le n</math> singular values such that ''s''<sub>''i''</sub> ''> 0'', the left singular vectors are obtained as unit vectors"<ref name="Abdel_SVD"/> by <math>\tfrac{1}{s_i}Av_i=u_i.</math><br />
# Select "the <math>\ m-r</math> left singular vectors corresponding to the zero singular values such that they are unit vectors orthogonal to each other and to the first ''r'' left singular vectors"<ref name="Abdel_SVD"/> <math>\{u_1, u_2, \dots, u_r\}.</math><br><br><br />
<br />
'''Finding Singular value Decomposition Using MATLAB Code'''<br />
Please refer to the following link: http://www.mathworks.com/help/techdoc/ref/svd-singular-value-decomposition.html<br />
<br />
'''Formal definition'''<br><br />
"We can now decompose the rectangular matrix ''A'' in terms of singular values and vectors as follows"<ref name="Abdel_SVD"/>:<br><br><br />
<math>A_{mxn} \begin{bmatrix} v_1 & | & \cdots & | & v_n \end{bmatrix}_{nxn} = \begin{bmatrix} u_1 & | & \cdots & | & u_n & | I_{n+1} & | & \cdots & | & I_m \end{bmatrix}_{mxm} \begin{bmatrix} s_1 & 0 & \cdots & 0 \\ 0 & s_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & s_n \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix}_{mxn}</math><br><br />
:<math>\ AV=US</math><br><br><br />
Since "the matrices ''V'' and ''U'' are orthogonal"<ref name="Abdel_SVD"/>, ''V ''<sup>''-1''</sup>=''V''<sup>T</sup> and ''U ''<sup>''-1''</sup>=''U''<sup>T</sup>:<br><br><br />
:<math>\ A=USV^T</math><br><br><br />
"which is the formal definition of the singular value decomposition."<ref name="Abdel_SVD"/><br><br><br />
<br />
'''Relevance to PCA'''<br><br />
In order to perform PCA, one needs to do eigenvalue decomposition on the covariance matrix. By transforming the mean for all attributes to zero, the covariance matrix can be simplified to:<br><br><br />
<math>\ S=XX^T</math><br><br><br />
Since the eigenvalue decomposition of ''A''<sup>''T''</sup>''A'' gives the same eigenvectors as the singular value decomposition of ''A'', an additional and more consistent method (if a matrix has eigenvectors that are not invertible, its eigenvalue decomposition does not exist) for performing PCA is through the singular value decomposition of ''X''.<br />
<br />
The following MATLAB code uses singular value decomposition for performing PCA; 20 principal components, and thus the top 20 maximum variation directions, are selected for reconstructing facial images that have had noise applied to them:<br />
<br />
load noisy.mat<br />
%first noisy image; each image has a resolution of 20x28<br />
imagesc(reshape(X(:,1),20,28)')<br />
%to grayscale<br />
colormap gray<br />
%singular value decomposition <br />
[u s v]=svd(X);<br />
%reduced feature space: 20 principal components<br />
Xh=u(:,1:20)*s(1:20,1:20)*v(:,1:20)';<br />
figure<br />
imagesc(reshape(Xh(:,1),20,28)')<br />
colormap gray<br />
<br />
Since the reduced feature space image is noiseless, the added noise feature has less variation than the 20 principal components.<br />
<br />
=='''References'''==<br />
<br />
<references/><br />
<br />
==''' PCA and Introduction to Kernel Function-November,10,2011'''==<br />
===Continue with the last lecture===<br />
Some notations:<br />
Let <math>\displaystyle X_{d\times n}</math> be a matrix. <br />
<br />
Let <math>\displaystyle X_j,j=1,2,...,d</math> be the j th the data point,and <math>\displaystyle X_j\in\R^d</math>.<br />
<br />
Let <math>\displaystyle Q=\sum_{j=1}^n(X_j-\bar{X})(X_j-\bar{X})^T</math>, where <math> \bar{X}=\frac{1}{n}\sum_{j=1}^n X_j</math>.<br />
<br />
But now, we are assuming that we have already centered the data, which means our <math>\displaystyle Q=\sum_{j=1}^n(X_j)(X_j)^T=X X^T </math>.<br />
<br />
*Find PC,which means finding eigenvectors of Q or do the singular value decomposition,[u s v]=svd(X), where the columns of u are eigenvectors of <math>\displaystyle Q=X X^T</math>.<br />
<br />
*Map the data in lower dimension space.<br />
We can choose the first p (p<d) eigenvectors, which means <math>\displaystyle u^T</math> is a <math>\displaystyle p\times n</math> matrix.<br />
Thus,we can project our original data points <math>\displaystyle x_j</math> to p dimension.<br />
Mathematically, it is <math>\displaystyle Y_{p\times n}={u^T}_{p\times d} X_{d\times n}</math>.Also,this means that we can reduce our original d variables to p principal components.<br />
<br />
*Reconstruct Points.<br />
We can also use those dimension-reduced data to project back to high dimension.<br />
However, we will lose some information because when we map those points into lower dimension, we throw away the last (d-p) eigenvectors which contain some of the original information.<br />
Since <math>\displaystyle u^T</math> is an orthogonal matrix, we can have <math> u_{d\times p} Y_{p\times n}=u_{d\times p}{u^T}_{p\times d}\hat{x}_{d\times n}= \hat{x}_{d\times n} </math>.<br />
<br />
*Map a new data point to a lower dimensional space and reconstruct it to the high dimension <math>\displaystyle y_{d\times 1}={u^T}_{p\times d} x_{d\times 1}=x_{d\times 1}=u_{d\times p} y_{p\times 1}</math><br />
<br />
===3 and 2 digits example===<br />
The data X is a 64 by 400 matrix. Every column can be imaged out as either "3" or "2". The first 200 columns are "2" and the last 200 columns are "3".<br />
We can first modify the data to centered data, and then try to find the first p(p<d) columns of the singular value decomposition of u.<br />
<br />
MATLAB CODE:<br />
MU=repmat(mean(X,2),1,400);<br />
% mean(X,2) is the average of each row <br />
%In order to center the data, we should change mean(X,2) which is a 64 by 1 matrix into a 64 by 400 matirx<br />
Xt=X-MU;<br />
% modify the data to zero mean data<br />
[u s v]=svd(Xt);<br />
%note that size(u)=64*64, and the columns of u are eigenvectors of VCM<br />
Y=u(:,1:2)'*X;<br />
%using the first two PCs to transform the high dimensional points to lower onces<br />
One way to look at this case is that, we can plot Principle Component #1 and Principle Component #2 in a two dimensional space.<br />
plot(Y(1,:)',Y(2,:)')<br />
The result is as follows, we can see clearly there are two classes.<br />
<br />
[[file:pca2.png|350px|400px]]<br />
<br />
To dig more into what kind of difference of these two classes, we can try to seperate the first 200 columns and the last 200 columns to find whether it has a significant difference due to the different types of digits.<br />
plot(Y(1,1:200)',Y(2,1:200)','d')<br />
% Note that the first 200 columns represent digit "2",and are in the form of "diamond"<br />
hold on<br />
% draw different graphs in one figure<br />
plot(Y(1,201:400)',Y(2,201:400)','ro')<br />
% Note that the first 200 columns represent digit "3",and are in the form of "o"<br />
<br />
[[file:pca3.png|350px|400px]]<br />
<br />
image=reshape(X,8,8,400);<br />
plotdigits(image,Y,.1,1);<br />
The result can be seen more clearly from the following picture.<br />
It is clearly to seperate "3" and "2" apart.<br />
<br />
[[file:Pca.png|350px|400px]]<br />
<br />
===Introduction to Kernel Function===<br />
PCA is useful when those data points spread in or close to a plane. This means that PCA is powerful when dealing with linear problems. But when data points spread in a manifold space, PCA is hard to implement. But there is a solution to this problem---we can use a "trick" to change the nonlinear classification problems into linear ones. And this is called the "Kernel Trick".<br />
<br />
'''An intuitive example'''<br />
<br />
[[File:Kernel trick.png|400px|300px]]<br />
<br />
From the picture, we can see the red dots are in the middle of the blue ones.However,it is hard to separate those two classes by using any lines(linear in the two dimensional space). But we can pull the red ones out of the two dimensional space to form a three dimensional space, in which case, we can easily tell them apart.<br />
<br />
For more details about this trick,please see http://omega.albany.edu:8008/machine-learning-dir/notes-dir/ker1/ker1.pdf<br />
<br />
More in detail,the significance of Kernel Function is that we can change the data points into a high dimension implicitly.<br />
Let's look at how this is possible:<br />
<br />
<math>Z_1=<br />
\begin{bmatrix}<br />
x_1\\<br />
y_1<br />
\end{bmatrix}\xrightarrow{\phi}<br />
</math><br />
<math>\phi(Z_1)=<br />
\begin{bmatrix}<br />
x_1^2\\<br />
y_1^2\\<br />
\sqrt2x_1y_1<br />
\end{bmatrix}.<br />
<br />
</math><br />
<math>Z_2=<br />
\begin{bmatrix}<br />
x_2\\<br />
y_2<br />
\end{bmatrix}\xrightarrow{\phi}<br />
</math><br />
<math>\phi(Z_2)=<br />
\begin{bmatrix}<br />
x_2^2\\<br />
y_2^2\\<br />
\sqrt2x_2y_2<br />
\end{bmatrix}<br />
</math><br />
<br />
The inner product of <math>\displaystyle \phi(Z1)</math> and <math>\displaystyle\phi(Z2)</math>, which is denoted as <math>\displaystyle\phi(Z1)\phi(Z2)^T</math>, is equal to:<br />
<math><br />
\begin{bmatrix}<br />
x_1^2&y_1^2&\sqrt2x_1y_1 <br />
\end{bmatrix}<br />
\begin{bmatrix}<br />
x_2^2\\<br />
y_2^2\\<br />
\sqrt2x_2y_2 <br />
\end{bmatrix}=</math> <math>\displaystyle (x_1x_2+y_1y_2)^2=K(Z_1,Z_2)</math>.<br />
<br />
'''The most common Kernel functions are as follows:'''<br />
*Linear: <math>\displaystyle K_{ij}=<X_i,X_j></math><br />
*Polynomial:<math>\displaystyle K_{ij}=(1+<X_i,X_j>)^p</math><br />
*Gausian:<math>\displaystyle K_{ij}=e^\frac{-{\left\Vert X_i-X_j\right\|}^2}{2\sigma^2}</math>,<br />
where <math>\displaystyle <X_i,X_j></math> denotes the inner product of <math>\displaystyle X_i</math> and <math>\displaystyle X_j</math>, <math>{\left\Vert X_i-X_j\right\|}^2</math> denotes the distance between vector<math>\displaystyle X_i</math> and vector <math>\displaystyle X_j</math>.</div>S9huhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=signupformStat341F11&diff=13816signupformStat341F112011-11-03T17:25:55Z<p>S9hu: </p>
<hr />
<div>{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="100pt"|Date<br />
|width="200pt"|Name (1)<br />
|width="200pt"|Name (2)<br />
|-<br />
|Sep 20 || acodd || <br />
|-<br />
|Sep 22 || Samantha Rahman || <br />
|-<br />
|Sep 27 || Pu Zhao || <br />
|-<br />
|Sep 29 || Adam Prins || <br />
|-<br />
|Oct 4 || Zhou Xiaojie || <br />
|-<br />
|Oct 6 || Joel Smith || <br />
|-<br />
|Oct 11 || Choi Chek Hin || <br />
|-<br />
|Oct 13 || Matthew Tacchino || <br />
|-<br />
|Oct 18 || Yin Jie Xu || Tim Dresser<br />
|-<br />
|Oct 20 || No lecture ||<br />
|-<br />
|Oct 25 || Valentin Cardinale || Sui Liang ||Pierre Baudron <br />
|-<br />
|Oct 27 || George Li || Chang Mog Lee<br />
|-<br />
|Nov 1 || Marie-Sarah Lacharité || Lichen Jia<br />
|-<br />
|Nov 3 || || Naresh Jugurnauth <br />
|-<br />
|Nov 8 || Patrick Dornian || Jacob Merkel || Abdullahi Kahiye<br />
|-<br />
|Nov 10 || Zhao Jin || Qianyi Huang<br />
|-<br />
|Nov 15 || Lindsay Millard || Fred Zhao || Samson Hu<br />
|-<br />
|Nov 17 || Hong HUANG || Xin Yao <br />
|- <br />
|Nov 22 || Karen Mok || Yuyan Lin<br />
|-<br />
|Nov 24 || Qingwei Ding || Brendan Briggs<br />
|-<br />
|Nov 29 || Han Li || Fangzhou Li<br />
|-<br />
|Dec 1 || Soo Ah Jung || James Sandham<br />
|-</div>S9huhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f11&diff=12533stat841f112011-10-13T08:28:52Z<p>S9hu: /* Comparison Between Logistic Regression And Linear Discriminant Analysis (LDA) */</p>
<hr />
<div>==[[f11Stat841proposal| Proposal for Final Project]]==<br />
<br />
==[[f11Stat841EditorSignUp| Editor Sign Up]]==<br />
<br />
= STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20 =<br />
== Wiki Course Notes ==<br />
Students will need to contribute to the wiki for 20% of their grade.<br />
Access via wikicoursenote.com<br />
Go to editor sign-up, and use your UW userid for your account name, and use your UW email.<br />
<br />
primary (10%)<br />
Post a draft of lecture notes within 48 hours. <br />
You will need to do this 1 or 2 times, depending on class size.<br />
<br />
secondary (10%)<br />
Make improvements to the notes for at least 60% of the lectures.<br />
More than half of your contributions should be technical rather than editorial.<br />
There will be a spreadsheet where students can indicate what they've done and when.<br />
The instructor will conduct random spot checks to ensure that students have contributed what they claim.<br />
<br />
<br />
== Classification (Lecture: Sep. 20, 2011) ==<br />
=== Definitions ===<br />
'''classification''': Predict a discrete random variable <math>Y</math> (a label) by using another random variable <math>X</math><br />
(new data point) picked iid from a distribution<br />
<br />
<math>X_i = (X_{i1}, X_{i2}, ... X_{id}) \in \mathcal{X} \subset \mathbb{R}^d</math> (<math>d</math>-dimensional vector)<br />
<math>Y_i</math> in some finite set <math>\mathcal{Y}</math><br />
<br />
<br />
'''classification rule''':<br />
<math>h : \mathcal{X} \rightarrow \mathcal{Y}</math><br />
Take new observation <math>X</math> and use a classification function <math>h(x)</math> to generate a label <math>Y</math>. In other words, If we fit the function <math>h(x)</math> with a random variable <math>X</math>, it generates the label <math>Y</math> which is the class to which we predict <math>X</math> belongs.<br />
<br />
Example: Let <math> \mathcal{X}</math> be a set of 2D images and <math>\mathcal{Y}</math> be a finite set of people. We want to learn a classification rule <math>h:\mathcal{X}\rightarrow\mathcal{Y}</math> that with small ''true'' error predicts the person who appears in the image. <br />
<br />
<br />
'''true error rate''' for classifier <math>h</math> is the error with respect to the underlying distribution (that we do not know).<br />
<br />
<math>L(h) = P(h(X) \neq Y )</math><br />
<br />
<br />
'''empirical error rate''' (or training error rate) is the amount of error that our classification function <math>h(x)</math> makes on the training data.<br />
<br />
<math>\hat{L}_n(h) = (1/n) \sum_{i=1}^{n} \mathbf{I}(h(X_i) \neq Y_i)</math><br />
<br />
where <math>\mathbf{I}()</math> is an indicator function. Indicator function is defined by <br />
<br />
<math>\mathbf{I}(x) = \begin{cases} <br />
1 & \text{if } x \text{ is true} \\<br />
0 & \text{if } x \text{ is false}<br />
\end{cases}</math><br />
<br />
So in this case,<br />
<math>\mathbf{I}(h(X_i)\neq Y_i) = \begin{cases}<br />
1 & \text{if } h(X_i)\neq Y_i \text{ (i.e. when misclassification happens)} \\<br />
0 & \text{if } h(X_i)=Y_i \text{ (i.e. classified properly)}<br />
\end{cases}</math><br />
<br />
e.g., 100 new data points with known (true) labels<br />
<br />
<math>y_1 = h(x_1)</math><br />
<br />
...<br />
<br />
<math>y_{100} = h(x_{100})</math><br />
<br />
To calculate the empirical error we count how many labels our function <math>h(x)</math> assigned incorrectly and divide by n=100<br />
<br />
=== Bayes Classifier ===<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability<ref> http://www.wikicoursenote.com/wiki/Stat841#Bayes_Classifier </ref>.<br />
<br />
First recall Bayes' Rule, in the format<br />
<math>P(Y|X) = \frac{P(X|Y) P(Y)} {P(X)} </math> <br />
<br />
P(Y|X) : ''posterior'' , ''probability of <math>Y</math> given <math>X</math>''<br />
<br />
P(X|Y) : ''likelihood'', ''probability of <math>X</math> being generated by <math>Y</math>''<br />
<br />
P(Y) : ''prior'', ''probability of <math>Y</math> being selected''<br />
<br />
P(X) : ''marginal'', ''probability of obtaining <math>X</math>''<br />
<br />
<br />
We will start with the simplest case: <math>\mathcal{Y} = \{0,1\}</math><br />
<br />
<math> r(x) <br />
= P(Y=1|X=x) <br />
= \frac{P(X=x|Y=1) P(Y=1)} {P(X=x)}<br />
= \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}</math><br />
<br />
Bayes' rule can be approached by computing either:<br />
<br />
1) '''The posterior''': <math>\ P(Y=1|X=x) </math> and <math>\ P(Y=0|X=x) </math> or <br />
<br />
2) '''The likelihood''': <math>\ P(X=x|Y=1) </math> and <math>\ P(X=x|Y=0) </math><br />
<br />
<br />
The former reflects a '''Bayesian''' approach. The Bayesian approach uses previous beliefs and observed data (e.g., the random variable <math>\ X </math>) to determine the probability distribution of the parameter of interest (e.g., the random variable <math>\ Y </math>). The probability, according to Bayesians, is a ''degree of belief'' in the parameter of interest taking on a particular value (e.g., <math>\ Y=1 </math>), given a particular observation (e.g., <math>\ X=x </math>). Historically, the difficulty in this approach lies with determining the posterior distribution, however, more recent methods such as '''Markov Chain Monte Carlo (MCMC)''' allow the Bayesian approach to be implemented <ref name="PCAustin">P. C. Austin, C. D. Naylor, and J. V. Tu, "A comparison of a Bayesian vs. a frequentist method for profiling hospital performance," ''Journal of Evaluation in Clinical Practice'', 2001</ref>.<br />
<br />
The latter reflects a '''Frequentist''' approach. The Frequentist approach assumes that the probability distribution, including the mean, variance, etc., is fixed for the parameter of interest (e.g., the variable <math>\ Y </math>, which is ''not'' random). The observed data (e.g., the random variable <math>\ X </math>) is simply a ''sampling'' of a far larger population of possible observations. Thus, a certain repeatability or ''frequency'' is expected in the observed data. If it were possible to make an infinite number of observations, then the true probability distribution of the parameter of interest can be found. In general, frequentists use a technique called '''hypothesis testing''' to compare a ''null hypothesis'' (e.g. an assumption that the mean of the probability distribution is <math>\ \mu_0 </math>) to an alternative hypothesis (e.g. assuming that the mean of the probability distribution is larger than <math>\ \mu_0 </math>) <ref name="PCAustin"/>. For more information on hypothesis testing see <ref>R. Levy, "Frequency hypothesis testing, and contingency tables" class notes for LING251, Department of Linguistics, University of California, 2007. Available: [http://idiom.ucsd.edu/~rlevy/lign251/fall2007/lecture_8.pdf http://idiom.ucsd.edu/~rlevy/lign251/fall2007/lecture_8.pdf] </ref>. <br />
<br />
There was some class discussion on which approach should be used. Both the ease of computation and the validity of both approaches were discussed. A main point that was brought up in class is that Frequentists consider X to be a random variable, but they do not consider Y to be a random variable because it has to take on one of the values from a fixed set (in the above case it would be either 0 or 1 and there is only one ''correct'' label for a given value X=x). Thus, from a Frequentist's perspective it does not make sense to talk about the probability of Y. This is actually a grey area and sometimes ''Bayesians'' and ''Frequentists'' use each others' approaches. So using ''Bayes' rule'' doesn't necessarily mean you're a ''Bayesian''. Overall, the question remains unresolved.<br />
<br />
<br />
The '''Bayes Classifier''' uses <math>\ P(Y=1|X=x)</math><br />
<br />
<math> P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}</math><br />
<br />
P(Y=1) : the prior, based on belief/evidence beforehand<br />
<br />
denominator : marginalized by summation<br />
<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
The set <math>\mathcal{D}(h) = \{ x : P(Y=1|X=x) = P(Y=0|X=x)... \} </math><br />
<br />
which defines a ''decision boundary''.<br />
<br />
<math>h^*(x) = <br />
\begin{cases}<br />
1 \ \ if \ \ P(Y=1|X=x) > P(Y=0|X=x) \\<br />
0 \ \ \ \ \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
''Theorem'': Bayes rule is optimal. I.e., if h is any other classification rule, <br />
then <math>L(h^*) <= L(h)</math><br />
(This is to be proved in homework.)<br />
<br />
Why then do we need other classfication methods?<br />
A: Because X densities are often/typically unknown. I.e., <math>f_k(x)</math> and/or <math>\pi_k</math> unknown.<br />
<br />
<math>P(Y=k|X=x) = \frac{P(X=x|Y=k)P(Y=k)} {P(X=x)} = \frac{f_k(x) \pi_k} {\sum_k f_k(x) \pi_k}</math><br />
f_k(x) is referred to as the class conditional distribution (~likelihood).<br />
<br />
Therefore, we rely on some data to estimate quantities.<br />
<br />
=== Three Main Approaches ===<br />
<br />
'''1. Empirical Risk Minimization''':<br />
Choose a set of classifiers H (e.g., line, neural network) and find <math>h^* \in H</math><br />
that minimizes (some estimate of) L(h).<br />
<br />
'''2. Regression''':<br />
Find an estimate (<math>\hat{r}</math>) of function <math>r</math> and define<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
The <math> 1/2 </math> in the expression above is a threshold set for the regression prediction output. <br />
<br />
In general ''regression'' refers to finding a continuous, real valued y. The problem here is more difficult, because of the restricted domain (y is a set of discrete label values).<br />
<br />
'''3. Density Estimation''':<br />
Estimate <math>P(X=x|Y=0)</math> from <math>X_i</math>'s for which <math>Y_i = 0</math><br />
Estimate <math>P(X=x|Y=1)</math> from <math>X_i</math>'s for which <math>Y_i = 1</math><br />
and let <math>P(Y=?) = (1/n) \sum_{i=1}^{n} Y_i</math><br />
<br />
Define <math>\hat{r}(x) = \hat{P}(Y=1|X=x)</math> and<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
It is possible that there may not be enough data to estimate from for ''density estimation''. But the main problem lies with high dimensional spaces, as the estimation results may not be good (high error rate) and sometimes even infeasible. The term ''curse of dimensionality'' was coined by Bellman <ref>R. E. Bellman, ''Dynamic Programming''. Princeton University Press,<br />
1957</ref> to describe this problem.<br />
<br />
As the dimension of the space goes up, the learning requirements go up exponentially.<br />
<br />
To Learn more about methods for handling high-dimensional data <ref> https://docs.google.com/viewer?url=http%3A%2F%2Fwww.bios.unc.edu%2F~dzeng%2FBIOS740%2Flecture_notes.pdf</ref><br />
<br />
=== Multi-Class Classification ===<br />
Generalize to case Y takes on k>2 values.<br />
<br />
<br />
''Theorem'': <math>Y \in \mathcal{Y} = \{1,2,..., k\} </math> optimal rule<br />
<br />
<math>\ h^{*}(x) = argmax_k P(Y=k|X=x) </math> <br />
<br />
where <math>P(Y=k|X=x) = \frac{f_k(x) \pi_k} {\sum_r f_r \pi_r}</math><br />
<br />
===Examples of Classification===<br />
<br />
* Face detection in images.<br />
* Medical diagnosis.<br />
* Detecting credit card fraud (fraudulent or legitimate).<br />
* Speech recognition.<br />
* Handwriting recognition.<br />
<br />
== LDA and QDA ==<br />
<br />
'''Discriminant function analysis''' finds features that best allow discrimination between two or more classes. The approach is similar to '''analysis of Variance (ANOVA)''' in that discriminant function analysis looks at the mean values to determine if two or more classes are very different and should be separated. Once the discriminant functions (that separate two or more classes) have been determined, new data points can be classified (i.e. placed in one of the classes) based on the discriminant functions <ref> StatSoft, Inc. (2011). ''Electronic Statistics Textbook.'' [Online]. Available: [http://www.statsoft.com/textbook/discriminant-function-analysis/ http://www.statsoft.com/textbook/discriminant-function-analysis/.] </ref>. '''Linear discriminant analysis (LDA)''' and '''Quadratic discriminant analysis (QDA)''' are methods of discriminant analysis that are best applied to linearly and quadradically separable classes, respectively. '''Fisher discriminant analysis (FDA)''' another method of discriminant analysis that is different from linear discriminant analysis, but oftentimes both terms are used interchangeably.<br />
<br />
=== LDA ===<br />
<br />
The simplest method is to use approach 3 (above) and assume a parametric model for densities. Assume class conditional is Gaussian.<br />
<br />
<math>\mathcal{Y} = \{ 0,1 \}</math> assumed (i.e., 2 labels)<br />
<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ P(Y=1|X=x) > P(Y=0|X=x) \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
<math>P(Y=1|X=x) = \frac{f_1(x) \pi_1} {\sum_k f_k \pi_k} \ \ </math> (denom = P(x))<br />
<br />
1) Assume Gaussian distributions<br />
<br />
<math>f_k(x) = \frac{1}{(2\pi)^{d/2} |\Sigma_k|^{1/2}} exp(-(1/2)(\mathbf{x} - \mathbf{\mu_k}) \Sigma_k^{-1}(\mathbf{x}-\mathbf{\mu_k}) )</math><br />
<br />
must compare <br />
<math>\frac{f_1(x) \pi_1} {p(x)}</math> with <math>\frac{f_0(x) \pi_0} {p(x)}</math><br />
Note that the p(x) denom can be ignored:<br />
<math>f_1(x) \pi_1</math> with <math>f_0(x) \pi_0 </math><br />
<br />
To find the decision boundary, set <br />
<math>f_1(x) \pi_1 = f_0(x) \pi_0 </math><br />
<br />
2) Assume <math>\Sigma_1 = \Sigma_0</math>, we can use <math>\Sigma = \Sigma_0 = \Sigma_1</math>.<br />
<br />
Cancel <math>(2\pi)^{-d/2} |\Sigma_k|^{-1/2}</math> from both sides.<br />
<br />
Take log of both sides.<br />
<br />
Subtract one side from both sides, leaving zero on one side.<br />
<br />
<br />
<math>-(1/2)(\mathbf{x} - \mathbf{\mu_1})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_1}) + log(\pi_1) - [-(1/2)(\mathbf{x} - \mathbf{\mu_0})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_0}) + log(\pi_0)] = 0 </math><br />
<br />
<br />
<math>(1/2)[-\mathbf{x}^T \Sigma^{-1}\mathbf{x} - \mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + 2\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{x}<br />
+ \mathbf{x}^T \Sigma^{-1}\mathbf{x} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} - 2\mathbf{\mu_0}^T \Sigma^{-1} \mathbf{x} ]<br />
+ log(\pi_1/\pi_0) = 0 </math><br />
<br />
<br />
Cancelling out the terms quadratic in <math>\mathbf{x}</math> and rearranging results in <br />
<br />
<math>(1/2)[-\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0}<br />
+ (2\mathbf{\mu_1}^T \Sigma^{-1} - 2\mathbf{\mu_0}^T \Sigma^{-1}) \mathbf{x}]<br />
+ log(\pi_1/\pi_0) = 0 </math><br />
<br />
<br />
We can see that the first pair of terms is constant, and the second pair is linear in x.<br />
Therefore, we end up with something of the form <br />
<math>ax + b = 0</math>.<br />
For more about LDA <ref>http://sites.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf</ref><br />
<br />
== LDA and QDA Continued (Lecture: Sep. 22, 2011) == <br />
<br />
If we relax assumption 2 (i.e. <math>\Sigma_1 \neq \Sigma_0</math>) then we get a quadratic equation that can be written as<br />
<math>{x}^Ta{x}+b{x} + c = 0</math><br />
<br />
===Generalizing LDA and QDA===<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,K\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h^*(x) = \arg\max_{k} \delta_k(x)</math><br />
<br />
Where<br />
<br />
<math> \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math><br />
<br />
When the Gaussian variances are equal <math>\Sigma_1 = \Sigma_0</math> (e.g. LDA), then<br />
<br />
<math> \,\delta_k(x) = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math><br />
<br />
(To compute this, we need to calculate the value of <math>\,\delta </math> for each class, and then take the one with the max. value).<br />
<br />
===In practice===<br />
We estimate the prior to be the chance that a random item from the collection belongs to class k, e.g.<br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
The mean to be the average item in set k, e.g.<br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
and calculate the covariance of each class e.g.<br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
If we wish to use LDA we must calculate a common covariance, so we average all the covariances e.g.<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{r=1}^{k}n_r} </math><br />
<br />
Where: <math>\,n_r</math> is the number of data points in class <math>\,r</math>, <math>\,\Sigma_r</math> is the covariance of class <math>\,r</math>, <math>\,n</math> is the total number of data points, and <math>\,k</math> is the number of classes.<br />
<br />
===Computation===<br />
<br />
For QDA we need to calculate: <math> \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math><br />
<br />
Lets first consider when <math>\, \Sigma_k = I, \forall k </math>. This is the case where each distribution is spherical, around the mean point.<br />
<br />
====Case 1====<br />
When <math>\, \Sigma_k = I </math><br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
but <math>\ \log(|I|)=\log(1)=0 </math><br />
<br />
and <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math> is the [http://en.wikipedia.org/wiki/Euclidean_distance#Squared_Euclidean_Distance squared Euclidean distance] between two points <math>\,x</math> and <math>\,\mu_k</math><br />
<br />
Thus in this condition, a new point can be classified by its distance away from the center of a class, adjusted by some prior.<br />
<br />
Further, for two-class problem with equal prior, the discriminating function would be the bisector of the 2-class's means.<br />
<br />
====Case 2==== <br />
When <math>\, \Sigma_k \neq I </math><br />
<br />
Using the [[Singular Value Decomposition(SVD) | Singular Value Decomposition (SVD)]] of <math>\, \Sigma_k</math><br />
we get <math> \, \Sigma_k = U_kS_kV_k^\top</math>. In particular, <math>\, U_k</math> is a collection of eigenvectors of <math>\, \Sigma_k\Sigma_k^*</math>, and <math>\, V_k</math> is a collection of eigenvectors of <math>\,\Sigma_k^*\Sigma_k</math>.<br />
Since <math>\, \Sigma_k</math> is a symmetric matrix<ref> http://en.wikipedia.org/wiki/Covariance_matrix#Properties </ref>, <math>\, \Sigma_k = \Sigma_k^*</math>, so we have <math> \, \Sigma_k = U_kS_kU_k^\top </math>.<br />
<br />
For <math>\,\delta_k</math>, the second term becomes what is also known as the Mahalanobis distance <ref>P. C. Mahalanobis, "On The Generalised Distance in Statistics," ''Proceedings of the National Institute of Sciences of India'', 1936</ref> :<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top U_kS_k^{-1}U_k^T(x-\mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-1}(U_k^\top x-U_k^\top \mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-\frac{1}{2}}S_k^{-\frac{1}{2}}(U_k^\top x-U_k^\top\mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top I(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
If we think of <math> \, S_k^{-\frac{1}{2}}U_k^\top </math> as a linear transformation that takes points in class <math>\,k</math> and distributes them spherically around a point, like in case 1. Thus when we are given a new point, we can apply the modified <math>\,\delta_k</math> values to calculate <math>\ h^*(\,x)</math>. After applying the singular value decomposition, <math>\,\Sigma_k^{-1}</math> is considered to be an identity matrix such that<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}[(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k)] + log (\pi_k) </math><br />
<br />
and,<br />
<br />
<math>\ \log(|I|)=\log(1)=0 </math><br />
<br />
For applying the above method with classes that have different covariance matrices (for example the covariance matrices <math>\ \Sigma_0 </math> and <math>\ \Sigma_1 </math> for the two class case), each of the covariance matrices has to be decomposed using SVD to find the according transformation. Then, each new data point has to be transformed using each transformation to compare its distance to the mean of each class (for example for the two class case, the new data point would have to be transformed by the class 1 transformation and then compared to <math>\ \mu_0 </math> and the new data point would also have to be transformed by the class 2 transformation and then compared to <math>\ \mu_1 </math>).<br />
<br />
<br />
The difference between [[#Case 1 | Case 1]] and [[#Case 2 | Case 2]] (i.e. the difference between using the Euclidean and Mahalanobis distance) can be seen in the illustration below. <br />
<br />
[[File:EuclideanVsMahalonobisDistance2.PNG|frame|center|Illustration of Euclidean distance (a) and Mahalanobis distance (b) where the contours represent equidistant points from the center using each distance metric. Source: <ref>R. De Maesschalck, D. Jouan-Rimbaud and D. L. Massart, "Tutorial - The Mahalanobis distance," ''Chemometrics and Intelligent Laboratory Systems'', 2000 </ref>]]<br />
<br />
As can be seen from the illustration above, the Mahalanobis distance takes into account the distribution of the data points, whereas the Euclidean distance would treat the data as though it has a spherical distribution. Thus, the Mahalanobis distance applies for the more general classification in [[#Case 2 | Case 2]], whereas the Euclidean distance applies to the special case in [[#Case 1 | Case 1]] where the data distribution is assumed to be spherical.<br />
<br />
Generally, we can conclude that QDA provides a better classifier for the data then LDA because LDA assumes that the covariance matrix is identical for each class, but QDA does not. QDA still uses Gaussian distribution as a class conditional distribution. In our real life, this distribution can not be happened each time, so we have to use other distribution as a complement.<br />
<br />
== Principal Component Analysis (PCA) (Lecture: Sep. 27, 2011) ==<br />
<br />
'''Principal Component Analysis (PCA)''' is a method of dimensionality reduction/feature extraction that transforms the data from a D dimensional space into a new coordinate system of dimension d, where d <= D ( the worst case would be to have d=D). The goal is to preserve as much of the variance in the original data as possible when switching the coordinate systems. Give data on D variables, the hope is that the data points will lie mainly in a linear subspace of dimension lower than D. In practice, the data will usually not lie precisely in some lower dimensional subspace.<br />
<br />
<br />
The new variables that form a new coordinate system are called '''principal components''' (PCs). PCs are denoted by <math>\ u_1, u_2, ... , u_D </math>. The principal components form a basis for the data. Since PCs are orthogonal linear transformations of the original variables there is at most D PCs. Normally, not all of the D PCs are used but rather a subset of d PCs, <math>\ u_1, u_2, ... , u_d </math>, to approximate the space spanned by the original data points <math>\ x_1, x_2, ... , x_D </math>. We can choose d based on what percentage of the original data we would like to maintain. <br />
<br />
Let <math>\ PC_j</math> be a linear combination of <math>\ x_1, x_2, ... , x_D </math> defined by the coefficients <br />
<math>\ w^{(j)}</math> = <math> ( {w_1}^{(j)}, {w_2}^{(j)},...,{w_D}^{(j)} )^T </math><br />
<br />
Thus, <math> u_j = {w_1}^{(j)} x_1 + {w_2}^{(j)} x_2 + ... + {w_D}^{(j)} x_D = w^{(j)^T} X </math><br />
<br />
<br />
This is a unique configuration since it sets up the PCs in order from maximum to minimum variances. The first PC, <math>\ u_1 </math> is called '''first principal component''' and has the maximum variance, thus it accounts for the most significant variance in the data <math>\ x_1, x_2, ... , x_D </math>. The second PC, <math>\ u_2 </math> is called '''second principal component''' and has the second highest variance and so on until PC, <math>\ u_D </math> which has the minimum variance. <br />
<br />
<br />
To get the first principal component, we would like to use the following equation:<br />
<br />
<math>\ max (Var(w^T X)) = max (w^T S w) </math> <br />
<br />
Where <math>\ S </math> is the covariance matrix. And we solve for <math>\ w </math>.<br />
<br />
<br />
Note: we require the constraint <math>\ w^T w = 1 </math> because if there is no constraint on the length of <math>\ w </math> then there is no upper bound. With the constraint, the direction and not the length that maximizes the variance can be found. <br />
<br />
<br />
====Lagrange Multiplier====<br />
<br />
Before we proceed, we should review Lagrange multipliers.<br />
<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
<br />
Lagrange multipliers are used to find the maximum or minimum of a function <math>\displaystyle f(x,y)</math> subject to constraint <math>\displaystyle g(x,y)=0</math> <br />
<br />
we define a new constant <math> \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle f(x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example :====<br />
Suppose we want to maximize the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method to find the maximum value for the function <math>\displaystyle f </math>; the Lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain two stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
===Determining w :===<br />
<br />
Use the Lagrange multiplier conversion to obtain:<br />
<math>\displaystyle L(w, \lambda) = w^T Sw - \lambda (w^T w - 1)</math> where <math>\displaystyle \lambda </math> is a constant <br />
<br />
Take the derivative and set it to zero:<br />
<math>\displaystyle{\partial L \over{\partial w}} = 0 </math><br />
<br />
<br />
To obtain: <br />
<math>\displaystyle 2Sw - 2 \lambda w = 0</math><br />
<br />
<br />
Rearrange to obtain:<br />
<math>\displaystyle Sw = \lambda w</math><br />
<br />
<br />
where <math>\displaystyle w</math> is eigenvector of <math>\displaystyle S </math> and <math>\ \lambda </math> is the eigenvalue of <math>\displaystyle S </math> as <math>\displaystyle Sw= \lambda w </math> , and <math>\displaystyle w^T w=1</math> , then we can write<br />
<br />
<math>\displaystyle w^T Sw= w^T\lambda w= \lambda w^T w =\lambda </math> <br />
<br />
Note that the PCs decompose the total variance in the data in the following way :<br />
<br />
<math> \sum_{i=1}^{D} Var(u_i) </math><br />
<br />
<math>= \sum_{i=1}^{D} (\lambda_i) </math> <br />
<br />
<math>\ = Tr(S) </math><br />
<br />
<math>= \sum_{i=1}^{D} Var(x_i)</math><br />
<br />
== Principal Component Analysis (PCA) Continued (Lecture: Sep. 29, 2011) == <br />
As can be seen from the above expressions, <math>\ Var(W^\top X) = W^\top S W= \lambda </math> where lambda is an eigenvalue of the sample covariance matrix <math>\ S </math> and <math>\ W</math> is its corresponding eigenvector. So <math>\ Var(u_i) </math> is maximized if <math>\ \lambda_i </math> is the maximum eigenvalue of <math>\ S </math> and the first principal component (PC) is the corresponding eigenvector. Each successive PC can be generated in the above manner by taking the eigenvectors of <math>\ S</math><ref>www.wikipedia.org/wiki/Eigenvalues_and_eigenvectors</ref> that correspond to the eigenvalues:<br />
<br />
<math>\ \lambda_1 \geq ... \geq \lambda_D </math> <br />
<br />
such that <br />
<br />
<math>\ Var(u_1) \geq ... \geq Var(u_D) </math><br />
<br />
=== Alternative Derivation ===<br />
Another way of looking at PCA is to consider PCA as a projection from a higher D-dimension space to a lower d-dimensional subspace that minimizes the squared ''reconstruction error''. The squared reconstruction error is the difference between the original data set <math>\ X </math> and the new data set <math> \hat{X} </math> obtained by first projecting the original data set into a lower d-dimensional subspace and then projecting it back into the the original higher D-dimension space. Since information is (normally) lost by compressing the the original data into a lower d-dimensional subspace, the new data set will (normally) differ from the original data even though both are part of the higher D-dimension space. The reconstruction error is computed as shown below.<br />
<br />
====Reconstruction Error====<br />
<br />
<math> e = \sum_{i=1}^{n} || x_i - \hat{x}_i ||^2 </math><br />
<br />
====Minimize Reconstruction Error====<br />
<br />
Suppose <math> \bar{x} = 0 </math> where <math> \hat{x}_i = x_i - \bar{x} </math><br />
<br />
Let <math>\ f(y) = U_d y </math> where <math>\ U_d </math> is a D by d matrix with d orthogonal unit vectors as columns.<br />
<br />
Fit the model to the data and minimize the reconstruction error:<br />
<br />
<math>\ min_{U_d, y_i} \sum_{i=1}^n || x_i - U_d y_i ||^2 </math><br />
<br />
Differentiate with respect to <math>\ y_i </math>:<br />
<br />
<math> \frac{\partial e}{\partial y_i} = 0 </math><br />
<br />
we can rewrite reconstruction-error as : <math>\ e = \sum_{i=1}^n(x_i - U_d y_i)^T(x_i - U_d y_i) </math><br />
<br />
<math>\ \frac{\partial e}{\partial y_i} = 2(-U_d)(x_i - U_d y_i) = 0 </math><br />
<br />
since <math>\ U_d(x_i - U_d y_i) </math> is a linear combination of the columns of <math>\ U_d </math>,<br />
<br />
which are independent (orthogonal to each other) we can conclude that:<br />
<br />
<math>\ x_i - U_d y_i = 0 </math> or equivalently,<br />
<br />
<math>\ x_i = U_d y_i </math><br />
<br />
<math>\ y_i = U_d^T x_i </math><br />
<br />
Find the orthogonal matrix <math>\ U_d </math>:<br />
<br />
<math>\ min_{U_d} \sum_{i=1}^n || x_i - U_d U_d^T x_i||^2 </math><br />
<br />
====Using SVD====<br />
<br />
A unique solution can be obtained by finding the [[Singular Value Decomposition(SVD) | Singular Value Decomposition (SVD)]] of <math>\ X </math>:<br />
<br />
<math>\ X = U S V^T </math><br />
<br />
For each rank d, <math>\ U_d </math> consists of the first d columns of <math>\ U </math>. Also, the covariance matrix can be expressed as follows <math>\ S = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)(x_i - \mu)^T </math>.<br />
<br />
Simply put, by subtracting the mean of each of the data point features and then applying SVD, one can find the principal components:<br />
<br />
<math> \tilde{X} = X - \mu </math><br />
<br />
<math>\ \tilde{X} = U S V^T </math><br />
<br />
Where <math>\ X </math> is a d by n matrix of data points and the features of each data point form a column in <math>\ X </math>. Also, <math>\ \mu </math> is a d by n matrix with identical columns each equal to the mean of the <math>\ x_i</math>'s, ie <math>\mu_{:,j}=\frac{1}{n}\sum_{i=1}^n x_i </math>. Note that the arrangement of data points is a convention and indeed in Matlab or conventional statistics, the transpose of the matrices in the above formulae is used.<br />
<br />
As the <math>\ S </math> matrix from the SVD has the eigenvalues arranged from largest to smallest, the corresponding eigenvectors in the <math>\ U </math> matrix from the SVD will be such that the first column of <math>\ U </math> is the first principal component and the second column is the second principal component and so on.<br />
<br />
=== Examples ===<br />
<br />
Note that in the Matlab code in the examples below, the mean was not subtracted from the datapoints before performing SVD. This is what was shown in class. However, to properly perform PCA, the mean should be subtracted from the datapoints.<br />
<br />
==== Example 1 ====<br />
Consider a matrix of data points <math>\ X </math> with the dimensions 560 by 1965. 560 is the number of elements in each column. Each column is a vector representation of a 20x28 grayscale pixel image of a face (see image below) and there is a total of 1965 different images of faces. Each of the images are corrupted by noise, but the noise can be removed by projecting the data back to the original space taking as many dimensions as one likes (e.g, 2, 3 4 0r 5). The corresponding Matlab commands are shown below:<br />
[[File:FreyFaceExample.PNG|thumb|185px|An example of the face images used in [[#Example 1 | Example 1]] with noise removed. Source: <ref>S. Roweis (2011). ''Data for MATLAB.'' [Online]. Available: [http://cs.nyu.edu/~roweis/data.html http://cs.nyu.edu/~roweis/data.html.] |</ref>]]<br />
<pre style="align:left; width: 75%; padding: 2% 2%"><br />
>> % start with a 560 by 1965 matrix X that contains the data points<br />
>> load(noisy.mat);<br />
>> <br />
>> % set the colors to grayscale <br />
>> colormap gray<br />
>> <br />
>> % show image in column 10 by reshaping column 10 into a 20 by 28 matrix<br />
>> imagesc(reshape(X(:,10),20,28)')<br />
>> <br />
>> % perform SVD, if X matrix if full rank, will obtain 560 PCs<br />
>> [S U V] = svd(X);<br />
>> <br />
>> % reconstruct X ( project X onto the original space) using only the first ten principal components<br />
>> Y_pca = U(:, 1:10)'*X;<br />
>> <br />
>> % show image in column 10 of X_hat which is now a 560 by 1965 matrix<br />
>> imagesc(reshape(X_hat(:,10),20,28)')<br />
</pre><br />
The reason why the noise is removed in the reconstructed image is because the noise does not create a major variation in a single direction in the original data. Hence, the first ten PCs taken from <math>\ U </math> matrix are not in the direction of the noise. Thus, reconstructing the image using the first ten PCs, will remove the noise.<br />
<br />
==== Example 2 ====<br />
Consider a matrix of data points <math>\ X </math> with the dimensions 64 by 400. 64 is the number of elements in each column. Each column is a vector representation of a 8x8 grayscale pixel image of either a handwritten number ''2'' or a handwritten number ''3'' (see image below) and there are a total of 400 different images, where the first 200 images show a handwritten number ''2'' and the last 200 images show a handwritten number ''3''. <br />
[[File:Handwritten23.PNG|frame|center|An example of the handwritten number images used in [[#Example 2 | Example 2]]. Source: <ref>A. Ghodsi, "PCA" class notes for STAT841, Department of Statistics and Actuarial Science, University of Waterloo, 2011. </ref>]]<br />
<br />
The corresponding Matlab commands for performing PCA on the data points are shown below:<br />
<pre><br />
>> % start with a 64 by 400 matrix X that contains the data points<br />
>> load 2_3.mat;<br />
>> <br />
>> % set the colors to grayscale <br />
>> colormap gray<br />
>> <br />
>> % show image in column 2 by reshaping column 2 into a 8 by 8 matrix<br />
>> imagesc(reshape(X(:,2),8,8))<br />
>> <br />
>> % perform SVD, if X matrix if full rank, will obtain 64 PCs<br />
>> [U S V] = svd(X);<br />
>> <br />
>> % project data down onto the first two PCs<br />
>> Y = U(:,1:2)'*X;<br />
>> <br />
>> % show Y as an image (can see the change in the first PC at column 200,<br />
>> % when the handwritten number changes from 2 to 3)<br />
>> imagesc(Y)<br />
>> <br />
>> % perform PCA using Matlab build-in function (do not use for assignment)<br />
>> % also note that due to the Matlab convention, the transpose of X is used<br />
>> [COEFF, Y] = princomp(X');<br />
>> <br />
>> % again, use the first two PCs<br />
>> Y = Y(:,1:2);<br />
>> <br />
>> % use plot digits to show the distribution of images on the first two PCs<br />
>> images = reshape(X, 8, 8, 400);<br />
>> plotdigits(images, Y, .1, 1);<br />
</pre><br />
Using the ''plotdigits'' function in Matlab, clearly illustrates that the first PC captured the differences between the numbers ''2'' and ''3'' as they are projected onto different regions of the axis for the first PC. Also, the second PC captured the ''tilt'' of the handwritten numbers as numbers tilted to the left or right were projected onto different regions of the axis for the second PC.<br />
<br />
==== Example 3 ====<br />
(Not discussed in class) In the news recently was a story that captures some of the ideas behind PCA. Over the past two years, Scott Golder and Michael Macy, researchers from Cornell University, collected 509 million Twitter messages from 2.4 million users in 84 different countries. The data they used were words collected at various times of day and they classified the data into two different categories: positive emotion words and negative emotion words. Then, they were able to study this new data to evaluate subjects' moods at different times of day, while the subjects were in different parts of the world. They found that the subjects generally exhibited positive emotions in the mornings and late evenings, and negative emotions mid-day. They were able to "project their data onto a smaller dimensional space" using PCS. Their paper, "Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures," is available in the journal Science.<ref>http://www.pcworld.com/article/240831/twitter_analysis_reveals_global_human_moodiness.html</ref>.<br />
<br />
Assumptions Underlying Principal Component Analysis can be found here<ref>http://support.sas.com/publishing/pubcat/chaps/55129.pdf</ref><br />
<br />
==== Example 4 ====<br />
(Not discussed in class) A somewhat well known learning rule in the field of neural networks called Oja's rule can be used to train networks of neurons to compute the principal component directions of data sets. <ref>A Simplified Neuron Model as a Principal Component Analyzer. Erkki Oja. 1982. Journal of Mathematical Biology. 15: 267-273</ref> This rule is formulated as follows<br />
<br />
<math>\,\Delta w = \eta yx -\eta y^2w </math><br />
<br />
where <math>\,\Delta w </math> is the neuron weight change, <math>\,\eta</math> is the learning rate, <math>\,y</math> is the neuron output given the current input, <math>\,x</math> is the current input and <math>\,w</math> is the current neuron weight. This learning rule shares some similarities with another method for calculating principal components: power iteration. The basic algorithm for power iteration (taken from wikipedia: <ref>Wikipedia. http://en.wikipedia.org/wiki/Principal_component_analysis#Computing_principal_components_iteratively</ref>) is shown below <br />
<br />
<br />
<math>\mathbf{p} =</math> a random vector<br />
do ''c'' times:<br />
<math>\mathbf{t} = 0</math> (a vector of length ''m'')<br />
for each row <math>\mathbf{x} \in \mathbf{X^T}</math><br />
<math>\mathbf{t} = \mathbf{t} + (\mathbf{x} \cdot \mathbf{p})\mathbf{x}</math><br />
<math>\mathbf{p} = \frac{\mathbf{t}}{|\mathbf{t}|}</math><br />
return <math>\mathbf{p}</math><br />
<br />
Comparing this with the neuron learning rule we can see that the term <math>\, \eta y x </math> is very similar to the <math>\,\mathbf{t}</math> update equation in the power iteration method, and identical if the neuron model is assumed to be linear (<math>\,y(x)=x\mathbf{p}</math>) and the learning rate is set to 1. Additionally, the <math>\, -\eta y^2w </math> term performs the normalization, the same function as the <math>\,\mathbf{p}</math> update equation in the power iteration method.<br />
<br />
=== Observations ===<br />
Some observations about the PCA were brought up in class:<br />
<br />
* '''PCA''' assumes that data is on a ''linear subspace'' or close to a linear subspace. For non-linear dimensionality reduction, other techniques are used. Amongst the first proposed techniques for non-linear dimensionality reduction are '''Locally Linear Embedding (LLE)''' and '''Isomap'''. More recent techniques include '''Maximum Variance Unfolding (MVU)''' and '''t-Distributed Stochastic Neighbor Embedding (t-SNE)'''. '''Kernel PCAs''' may also be used, but they depend on the type of kernel used and generally do not work well in practice. (Kernels will be covered in more detail later in the course.)<br />
<br />
* Finding the number of PCs to use is not straightforward. It requires knowledge about the ''instrinsic dimentionality of data''. In practice, oftentimes a heuristic approach is adopted by looking at the eigenvalues ordered from largest to smallest. If there is a "dip" in the magnitude of the eigenvalues, the "dip" is used as a cut off point and only the large eigenvalues before the "dip" are used. Otherwise, it is possible to add up the eigenvalues from largest to smallest until a certain percentage value is reached. This percentage value represents the percentage of variance that is preserved when projecting onto the PCs corresponding to the eigenvalues that have been added together to achieve the percentage. <br />
<br />
* It is a good idea to normalize the variance of the data before applying PCA. This will avoid PCA finding PCs in certain directions due to the scaling of the data, rather than the real variance of the data.<br />
<br />
* PCA can be considered as an unsupervised approach, since the main direction of variation is not known beforehand, i.e. it is not completely certain which dimension the first PC will capture. The PCs found may not correspond to the desired labels for the data set. There are, however, alternate methods for performing supervised dimensionality reduction.<br />
<br />
* (Not in class) Even though the traditional PCA method does not work well on data set that lies on a non-linear manifold. A revised PCA method, called c-PCA, has been introduced to improve the stability and convergence of intrinsic dimension estimation. The approach first finds a minimal cover (a cover of a set X is a collection of sets whose union contains X as a subset<ref>http://en.wikipedia.org/wiki/Cover_(topology)</ref>) of the data set. Since set covering is an NP-hard problem, the approach only finds an approximation of minimal cover to reduce the complexity of the run time. In each subset of the minimal cover, it applies PCA and filters out the noise in the data. Finally the global intrinsic dimension can be determined from the variance results from all the subsets. The algorithm produces robust results.<ref>Mingyu Fan, Nannan Gu, Hong Qiao, Bo Zhang, Intrinsic dimension estimation of data by principal component analysis, 2010. Available: http://arxiv.org/abs/1002.2050</ref><br />
<br />
*(Not in class) While PCA finds the mathematically optimal method (as in minimizing the squared error), it is sensitive to outliers in the data that produce large errors PCA tries to avoid. It therefore is common practice to remove outliers before computing PCA. However, in some contexts, outliers can be difficult to identify. For example in data mining algorithms like correlation clustering, the assignment of points to clusters and outliers is not known beforehand. A recently proposed generalization of PCA based on a '''Weighted PCA''' increases robustness by assigning different weights to data objects based on their estimated relevancy.<ref>http://en.wikipedia.org/wiki/Principal_component_analysis</ref><br />
<br />
* (Not in class) Comparison between PCA and LDA: Principal Component Analysis (PCA)and Linear Discriminant Analysis (LDA) are two commonly used techniques for data classification and dimensionality reduction. Linear Discriminant Analysis easily handles the case where the within-class frequencies are unequal and their performances has been examined on randomly generated test data. This method maximizes the ratio of between-class variance to the within-class variance in any particular data set thereby guaranteeing maximal separability. ... The prime difference between LDA and PCA is that PCA does more of feature classification and LDA does data classification. In PCA, the shape and location of the original data sets changes when transformed to a different space whereas LDA doesn’t change the location but only tries to provide more class separability and draw a decision region between the given classes. This method also helps to better understand the distribution of the feature data." [24] [[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf]]<br />
<br />
=== Summary ===<br />
The PCA algorithm can be summarized into the following steps:<br />
<br />
# '''Recover basis'''<br />
#: <math>\ \text{ Calculate } XX^T=\Sigma_{i=1}^{t}x_ix_{i}^{T} \text{ and let } U=\text{ eigenvectors of } XX^T \text{ corresponding to the largest } d \text{ eigenvalues.} </math><br />
# '''Encode training data'''<br />
#: <math>\ \text{Let } Y=U^TX \text{, where } Y \text{ is a } d \times t \text{ matrix of encodings of the original data.} </math><br />
# '''Reconstruct training data'''<br />
#: <math> \hat{X}=UY=UU^TX </math>.<br />
# '''Encode test example'''<br />
#: <math>\ y = U^Tx \text{ where } y \text{ is a } d\text{-dimensional encoding of } x </math>.<br />
# '''Reconstruct test example'''<br />
#: <math> \hat{x}=Uy=UU^Tx </math>.<br />
<br />
== Fisher Discriminant Analysis (FDA) (Lecture: Sep. 29, 2011) ==<br />
<br />
'''Fisher Discriminant Analysis (FDA)''' is sometimes called ''Fisher Linear Discriminant Analysis (FLDA)'' or just ''Linear Discriminant Analysis (LDA)''. This causes confusion with the [[#LDA | ''Linear Discriminant Analysis (LDA)'']] technique covered earlier in the course. The LDA technique covered earlier in the course has a normality assumption and is a boundary finding technique. The FDA technique outlined here is a supervised feature extraction technique. FDA differs from PCA as well because PCA does not use the class labels, <math>\ y_i</math>, of the data <math>\ (x_i,y_i)</math> while FDA organizes data into their ''classes'' by finding the direction of maximum separation between classes.<br />
<br />
== Fisher Discriminant Analysis (FDA) Continued (Lecture: Oct. 04, 2011) ==<br />
<br />
One main drawback of the PCA technique is that the direction of greatest variation may not be the classification we desire. For example, imagine if the [[#Example 2 | data set]] above had a lightening filter applied to a random subset of the images. Then the greatest variation would be the brightness and not the more important variations we wish to classify. FDA circumvents this problem by using the labels, <math>\ y_i</math>, of the data <math>\ (x_i,y_i)</math> i.e. the FDA uses ''supervised learning''. An elementary way to see the algorithm is to imagine two classes of data projected onto a suitably chosen line that minimizes the within class variance, and maximizes the distance between the two classes i.e. group similar data together and spread different data apart. This way, new data acquired can be compared, after a transformation, to where these projections, using some well-chosen metric.<br />
<br />
<br />
We first consider the cases of two-classes. Denote the mean and covariance matrix of class <math>i=0,1</math> by <math>\mathbf{\mu}_i</math> and <math>\mathbf{\Sigma}_i</math> respectively. We transform the data so that it is projected into 1 dimension i.e. a scalar value. To do this, we compute the inner product of our <math>dx1</math>-dimensional data, <math>\mathbf{x}</math>, by a to-be-determined <math>dx1</math>-dimensional vector <math>\mathbf{w}</math>. The new means and covariances of the transformed data:<br />
<br />
::<math> \mu'_i:\rightarrow \mathbf{w}^{T}\mathbf{\mu}_i </math> <br/><br />
::<math> \Sigma'_i :\rightarrow \mathbf{w}^{T}\mathbf{\sigma}_i \mathbf{w}</math><br />
<br />
The new means and variances are actually scalar values now, but we will use vector and matrix notation and arguments throughout the following derivation as the multi-class case is then just a simpler extension. <br />
<br />
===Goals of FDA===<br />
<br />
As will be shown in the objective function, the goal of FDA is to maximize the separation of the classes (between class variance) and minimize the scatter within each class (within class variance). That is, our ideal situation is that the individual classes are as far away from each other as possible and at the same time the data within each class are as close to each other as possible (collapsed to a single point in the most extreme case). An interesting note is that R. A. Fisher who FDA is named after, used the FDA technique for purposes of taxonomy, in particular for categorizing different species of iris flowers. <ref name="RAFisher">R. A. Fisher, "The Use of Multiple measurements in Taxonomic Problems," ''Annals of Eugenics'', 1936</ref>. It is very easy to visualize what is meant by within class variance (i.e. differences between the iris flowers of the same species) and between class variance (i.e. the differences between the iris flowers of different species) in that case.<br />
<br />
<br />
'''1)''' Our '''first''' goal is to minimize the individual classes' covariance. This will help to collapse the data together. <br />
We have two minimization problems<br />
<br />
::<math>\min_{\mathbf{w}} \mathbf{w} \mathbf{\Sigma}_0 \mathbf{w}^{T}</math> <br />
and <br />
::<math>\min_{\mathbf{w}} \mathbf{w} \mathbf{\Sigma}_1 \mathbf{w}^{T}</math>.<br />
<br />
But these can be combined:<br />
::<math> \min_{\mathbf{w}} \mathbf{w} \mathbf{\Sigma}_0 \mathbf{w}^{T} + \mathbf{w} \mathbf{\Sigma}_1 \mathbf{w}^{T}</math> <br />
:: <math> = \min_{\mathbf{w}} \mathbf{w} ( \mathbf{\Sigma_0} + \mathbf{\Sigma_1} ) \mathbf{w}^{T} </math><br />
<br />
Define <math> \mathbf{S}_W =\mathbf{\Sigma_0} + \mathbf{\Sigma_1} </math>, called the ''within class variance matrix''. <br />
<br />
'''2)''' Our '''second''' goal is to move the minimized classes as far away from each other as possible. One way to accomplish this is to maximize the distances between the means of the transformed data i.e.<br />
<br />
<math> \max_{\mathbf{w}} |\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1|^2 </math><br />
<br />
Simplifying:<br />
::<math> \max_{\mathbf{w}} \,(\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1)^T (\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{mu}_1) </math> <br/><br />
::<math> = \max_{\mathbf{w}}\, (\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}\mathbf{w}^{T} \mathbf{w} (\mathbf{\mu}_0-\mathbf{\mu}_1)</math> <br/><br />
::<math> = \max_{\mathbf{w}} \,\mathbf{w}^{T}(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}\mathbf{w}</math><br />
<br />
Recall that <math> \mathbf{\mu}_i </math> are known. Denote<br />
<br />
::<math> \mathbf{S}_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}</math> <br />
<br />
This matrix, called the ''between class variance matrix'', is a rank 1 matrix, so an inverse does not exist. Altogether, we have two optimization problems we must solve simultaneously:<br />
<br />
::1) <math> \min_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} </math><br/><br />
::2) <math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} </math><br />
<br />
There are other metrics one can use to both minimize the data's variance and maximizes the distance between classes, and other goals we can try to accomplish (see metric learning, below...one day), but Fisher used this elegant method, hence his recognition in the name, and we will follow his method.<br />
<br />
We can combine the two optimization problems into one after noting that the negative of max is min:<br />
<br />
::<math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} - \alpha \mathbf{w} \mathbf{S_B} \mathbf{w}^{T}</math><br/><br />
<br />
The <math>\alpha</math> coefficient is a necessary scaling factor: if the scale of one of the terms is much larger than the other, the optimization problem will be dominated by the larger term. This means we have another unknown, <math>\alpha</math>, to solve for. Instead, we can circumvent the scaling problem by looking at the ratio of the quantities, the original solution Fisher proposed:<br />
<br />
::<math> \max_{\mathbf{w}} \frac{\mathbf{w} \mathbf{S_B} \mathbf{w}^{T}}{\mathbf{w} \mathbf{S_W} \mathbf{w}^{T}} </math><br />
<br />
This optimization problem can be shown<ref><br />
http://www.socher.org/uploads/Main/optimizationTutorial01.pdf<br />
</ref> to be equivalent to the following optimization problem:<br />
<br />
:: <math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T}</math> <br />
<br />
subject to:<br />
<br />
:: <math> \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} = 1 </math><br />
<br />
A heuristic understanding of this equivalence is that we have two degrees of freedom: direction and scalar. The scalar value is irrelevant to our discussion. Thus, we can set one of the values to be a constant. We can use Lagrange multipliers to solve this optimization problem:<br />
<br />
::<math>L( \mathbf{w}, \lambda) = \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} - \lambda(\mathbf{w} \mathbf{S_W} \mathbf{w}^{T}-1)</math><br />
:: <math> \Rightarrow \frac{\partial L}{\partial \mathbf{w}} = 2 \mathbf{S}_B \mathbf{w} - 2\lambda \mathbf{S}_W\mathbf{w} </math><br />
<br />
Setting the partial derivative to 0 gives us a ''generalized eigenvalue problem'':<br />
<br />
::<math> \mathbf{S}_B \mathbf{w} = \lambda \mathbf{S}_W \mathbf{w} </math><br />
:: <math> \Rightarrow \mathbf{S}_W^{-1} \mathbf{S}_B \mathbf{w} = \lambda \mathbf{w} </math><br />
<br />
This is a generalized eigenvalue problem and <math>\ W </math> can be computed as the eigenvector corresponds to the largest eigenvalue of <br />
:: <math> \mathbf{S}_W^{-1} \mathbf{S}_B </math><br />
<br />
It is very likely that <math> \mathbf{S}_W </math> has an inverse. If not, the pseudo-inverse<ref><br />
http://en.wikipedia.org/wiki/Generalized_inverse<br />
</ref><ref><br />
http://www.mathworks.com/help/techdoc/ref/pinv.html<br />
</ref> can be used. In Matlab the pseudo-inverse function is named ''pinv''. Thus, we should choose <math>\mathbf{w}</math> to equal the eigenvector of the largest eigenvalue as our projection vector. <br />
<br />
In fact we can simplify the above expression further in the of two classes. Recall the definition of <math>\mathbf{S}_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}</math>. Substituting this into our expression:<br />
<br />
::<math> \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T} \mathbf{w} = \lambda \mathbf{w} </math><br />
::<math> (\mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) ) ((\mathbf{\mu}_0-\mathbf{\mu}_1)^{T} \mathbf{w}) = \lambda \mathbf{w} </math><br />
<br />
This second term is a scalar value, let's denote it <math>\beta</math>. Then<br />
::<math> \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) = \frac{\lambda}{\beta} \mathbf{w} </math><br />
::<math> \Rightarrow \, \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) \propto \mathbf{w} </math><br />
<br />
All we are interested in the direction of <math>\mathbf{w}</math>, so to compute this is sufficient to finding our projection vector. Though this will not work in higher dimensions, as <math>\mathbf{w}</math> would be a matrix and not a vector in higher dimensions.<br />
<br />
=== Extensions to Multiclass Case ===<br />
If we have <math>\ k</math> classes, we need <math>\ k-1</math> directions i.e. we need to project <math>\ k</math> 'points' onto a <math>\ k-1</math> dimensional hyperplane. What does this change in our above derivation? The most significant difference is that our projection vector,<math>\mathbf{w}</math>, is no longer a vector but instead is a matrix <math>\mathbf{W}</math>. We transform the data as:<br />
<br />
::<math> \mathbf{x}' :\rightarrow \mathbf{W}^{T} \mathbf{x}</math><br />
so our new mean and covariances for class k are:<br />
::<math> \mathbf{\mu_k}' :\rightarrow \mathbf{W}^{T} \mathbf{\mu_k}</math><br />
::<math> \mathbf{\Sigma_k}' :\rightarrow \mathbf{W}^{T} \mathbf{\Sigma_k} \mathbf{W}</math><br />
<br />
What are our new optimization sub-problems? As before, we wish to minimize the within class variance. This can be formulated as:<br />
::<math>\min_{\mathbf{W}} \mathbf{W}^{T} \mathbf{\Sigma_1} \mathbf{W} + \dots + \mathbf{W}^{T} \mathbf{\Sigma_k} \mathbf{W} </math><br />
<br />
Again, denoting <math>\mathbf{S}_W = \mathbf{\Sigma_1} + \dots + \mathbf{\Sigma_k}</math>, we can simplify above expression:<br />
<br />
::<math>\min_{\mathbf{W}} \mathbf{W}^{T} \mathbf{S}_W \mathbf{W} </math><br />
<br />
Similarly, the second optimization problem is:<br />
<br />
::<math>\max_{\mathbf{W}} \mathbf{W}^{T} \mathbf{S}_B \mathbf{W} </math><br />
<br />
What is <math>\mathbf{S}_B</math> in this case? It can be shown that <math>\mathbf{S}_T = \mathbf{S}_B + \mathbf{S}_W </math> where <math> \mathbf{S}_T </math> is the covariance matrix of all the data. From this we can compute <math> \mathbf{S}_B </math>. <br />
<br />
Next, if we express <math> \mathbf{W} = ( \mathbf{w}_1 , \mathbf{w}_2 , \dots ,\mathbf{w}_k ) </math> observe that, for <math> \mathbf{A} = \mathbf{S}_B , \mathbf{S}_W </math>: <br />
<br />
::<math> Tr(\mathbf{W}^{T} \mathbf{A} \mathbf{W}) = \mathbf{w}_1^{T} \mathbf{A} \mathbf{w}_1^{T} + \dots + \mathbf{w}_k \mathbf{A} \mathbf{w}_k </math><br />
<br />
where <math>\ Tr()</math> is the trace of a matrix. Thus, following the same steps as in the two-class case, we have the new optimization problem:<br />
<br />
::<math> \max_{\mathbf{W}} \frac{ Tr(\mathbf{W}^{T} \mathbf{S}_B \mathbf{W}) }{Tr(\mathbf{W}^{T} \mathbf{S}_W \mathbf{W})} </math> <br />
<br />
subject to:<br />
<br />
:: <math> Tr( \mathbf{W} \mathbf{S_W} \mathbf{W}^{T}) = \mathbf{I} </math><br />
<br />
Again, in order to solve the above optimization problem, we can use the Lagrange multiplier <ref><br />
http://en.wikipedia.org/wiki/Lagrange_multiplier </ref>:<br />
<br />
:: <math>\begin{align}L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - I \right\}\end{align}</math>.<br />
<br />
where <math>\ \Lambda</math> is a d by d diagonal matrix.<br />
<br />
Then, we differentiating with respect to <math>\mathbf{W}</math>:<br />
<br />
:: <math>\begin{align}\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}\end{align} = 0</math>.<br />
<br />
Thus:<br />
<br />
:: <math>\begin{align}\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}\end{align}</math><br />
<br />
:: <math>\begin{align}\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{W}\end{align}</math><br />
<br />
where, <math> \mathbf{\Lambda} =\begin{pmatrix}\lambda_{1} & & 0\\&\ddots&\\0 & &\lambda_{d}\end{pmatrix}</math><br />
<br />
The above equation is of the form of an eigenvalue problem. Thus, for the solution the k-1 eigenvectors corresponding to the k-1 largest eigenvalues should be chosen as the projection matrix, <math>\mathbf{W}</math>. In fact, there should only by k-1 eigenvectors corresponding to k-1 non-zero eigenvalues using the above equation.<br />
<br />
=== Summary ===<br />
FDA has two optimization problems:<br />
::1) <math> \min_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} </math><br/><br />
::2) <math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} </math> <br />
<br />
where <math>\ S_W = \Sigma_0 + \Sigma_1</math> is called the within class variance and <math>\ S_B = (\mu_0 - \mu_1)(\mu_0 - \mu_1)^T </math> is called the between class variance.<br />
<br />
The two optimization problems are combined as follows:<br />
::<math> \max_{\mathbf{w}} \frac{\mathbf{w} \mathbf{S_B} \mathbf{w}^{T}}{\mathbf{w} \mathbf{S_W} \mathbf{w}^{T}} </math><br />
<br />
By adding a constraint as shown:<br />
::<math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T}</math><br />
<br />
subject to:<br />
:: <math> \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} = 1 </math><br />
<br />
Lagrange multipliers can be used and essentially the problem becomes an eigenvalue problem:<br />
<br />
::<math>\begin{align}\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w} = \lambda\mathbf{w}\end{align}</math><br />
<br />
And <math>\ w </math> can be computed as the k-1 eigenvectors corresponding to the largest k-1 eigenvalues of <math> \mathbf{S}_W^{-1} \mathbf{S}_B </math>.<br />
<br />
=== Variations ===<br />
<br />
Some adaptations and extensions exist for the FDA technique (Source: <ref>R. Gutierrez-Osuna, "Linear Discriminant Analysis" class notes for Intro to Pattern Analysis, Texas A&M University. Available: [http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf]</ref>):<br />
<br />
1) ''Non-Parametric LDA (NPLDA)'' by Fukunaga<br />
<br />
This method does not assume that the Gaussian distribution is unimodal and it is actually possible to extract more than k-1 features (where k is the number of classes).<br />
<br />
2) ''Orthonormal LDA (OLDA)'' by Okada and Tomita<br />
<br />
This method finds projections that are orthonormal in addition to maximizing the FDA objective function. This method can also extract more than k-1 features (where k is the number of classes).<br />
<br />
3) ''Generalized LDA (GLDA)'' by Lowe<br />
<br />
This method incorporates additional cost functions into the FDA objective function. This causes classes with a higher cost to be placed further apart in the lower dimensional representation.<br />
<br />
== Linear and Logistic Regression (Lecture: Oct. 06, 2011) ==<br />
<br />
=== Linear Regression ===<br />
<br />
In regression, <math>\ y </math> is a continuous variable. In classification, <math>\ y </math> is a discrete variable. Regression problems are easier to formulate into functions (since <math>\ y </math> is continuous) and it is possible to solve classification problems by treating them like regression problems. In order to do so, the requirement in classification that <math>\ y </math> is discrete must first be relaxed. Once <math>\ y </math> has been found using regression techniques, it is possible to determine the discrete class corresponding to the <math>\ y </math> that has been found to solve the original classification problem. The discrete class is obtained by defining a threshold where <math>\ y </math> values below the threshold belong to one class and <math>\ y </math> values above the threshold belong to another class.<br />
<br />
<br />
More formally: a more direct approach to classification is to estimate the regression function <math>\ r(\mathbf{x}) = E[Y | X]</math> without bothering to estimate <math>\ f_k(\mathbf{x}) </math>.<br />
<br />
In two-class problems, if <math>\ Y = \{0,1\}</math>, then <math>\, h^*(\mathbf{x})= \left\{\begin{matrix}<br />
1 &\text{,if } \hat r(\mathbf{x})>\frac{1}{2} \\<br />
0 &\mathrm{,otherwise.} \end{matrix}\right.</math><br />
<br />
Basically, we can use a linear function<br />
<math>\ f(x, \beta) = \mathbf{\beta\,}^T \mathbf{x_{i}} + \mathbf{\beta\,_0} </math> and use the least squares approach to fit the function to the given data. This is done by minimizing the following expression:<br />
<br />
<math>\min_{\mathbf{\beta}} \sum_{i=1}^n (y_i - \mathbf{\beta}^T<br />
\mathbf{x_{i}} - \mathbf{\beta_0})^2</math><br />
<br />
where<br />
<br />
<math>\tilde{\mathbf{\beta}} = \left( \begin{array}{c}\mathbf{\beta_{1}}<br />
<br />
\\ \\<br />
\dots \\ \\<br />
\mathbf{\beta}_{d} \\ \\<br />
\mathbf{\beta}_{0} \end{array} \right)</math>.<br />
<br />
For convenience, <math>\mathbf{\beta}</math> and <math>\mathbf{\beta}_0</math> have been combined into a d+1 dimensional vector. And an extra term 1 is appended to to <math>\ x </math>. Thus, the function to be minimized can now be expressed as:<br />
<br />
<math>\ min_{\tilde{\beta}} \sum_{i=1}^{n} (y_i - \tilde{\beta} \tilde{x_i} )^2 </math><br />
<br />
<math>\ = min_{\tilde{\beta}} | y - X \tilde{\beta}^T |^2 </math><br />
<br />
where <math>\ y </math> and <math>\tilde{\beta}</math> are vectors and <math>\ X </math> is a matrix.<br />
<br />
The solution for <math>\ \tilde{\beta} </math> is<br />
<br />
<math>\ {\tilde{\beta}} = (XX^T)^{-1}Xy </math><br />
<br />
Using regression to solve classification problems is not mathematically correct, if we want to be true to classification. However, this method works well in practice, if the problem is not complicated. When we have only two classes (encoded as <math>\ \frac{-n}{n_1} </math> and <math>\ \frac{n}{n_2}) </math>, this method is identical to LDA.<br />
<br />
==== Matlab Example ====<br />
<br />
The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
===Logistic Regression===<br />
<br />
Logistic regression is a more advanced method for classification, and is<br />
more commonly used. <br />
<br />
We can define a function <br /><br />
<math>f_1(x)= P(Y=1| X=x) = (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})</math><br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
<br />
<br />
This is a valid density function. It looks similar to a step function, but<br />
we have relaxed it so that we have a smooth curve, and can therefore take the<br />
derivative.<br />
<br />
The range of this function is (0,1) since<br /> <br />
<math>\lim_{x \to -\infty}f_1(\mathbf{x}) = 0</math> and<br />
<math>\lim_{x \to \infty}f_1(\mathbf{x}) = 1</math>.<br />
<br />
As shown on this graph:<br /><br />
http://www.wolframalpha.com/input/?i=Plot[E^x/%281+%2B+E^x%29,+{x,+-10,+10}]%29<br />
<br />
Then we compute the complement of f1(x), and get<br /><br />
<br />
<math>f_2(x)= P(Y=0| X=x) = 1-f_1(x) = (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})</math>, denoted f2. <br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
<br />
<br />
Function f2 is commonlly called Logistic function, and it behaves like <br /><br />
<math>\lim_{x \to -\infty}f_2(\mathbf{x}) = 1</math> and<br /><br />
<math>\lim_{x \to \infty}f_2(\mathbf{x}) = 0</math>.<br />
<br />
As shown on this graph:<br /><br />
http://www.wolframalpha.com/input/?i=Plot[1/%281+%2B+E^x%29,+{x,+-10,+10}]%29<br />
<br />
From here, we can form the conditional density function. To do this, we must combine<br /><br />
<math>f_1</math> and <math>f_2</math> <br /><br />
such that <br /><br />
<math>f_1=1</math> and<math>f_2=0</math> if y=1 ( which means it’s in class 1), <br /><br />
and <math>f_1=0</math> and <math>f_2=1</math> if y=2 (which means it’s in class 2).<br />
<br />
Eventually, we have our conditional density function formula<br /><br />
<math>f(y|\mathbf{x})= (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y}</math><br />
<br />
To way to use this formula is, with given the training data (x(i), y(i)),to fit the data with <math>f(Y : X)</math>.<br />
<br />
In general, we can think of the problem as having a box with some knobs. Inside the box is our objective function which gives the form to classify our input (xi) to<br />
our output (yi). The knobs in the box are functioning like the parameters of the objective function. Our job is to find the proper parameters that can minimize the error between our output and the true value. So we have turned our machine learning problem intoan optimization problem. <br />
<br />
Since we need to find the parameters that maximize the chance of having our observed data coming from the distribution of f(x|parameter), we need to introduce Maximum Likelihood Estimation.<br />
<br />
====Maximum Likelihood Estimation====<br />
<br />
Given iid data points <math>({\mathbf{x}_i})_{i=1}^n</math> and density function <math>f(\mathbf{x}|\mathbf{\theta})</math>, where the form of f is known but the parameters <math>\theta</math> are our unknown. The maximum likelihood estimation of <math>\theta\,_{ML}</math> is a set of parameters that maximize the probability of observing <math>({\mathbf{x}_i})_{i=1}^n</math> given <math>\theta\,_{ML}</math>.<br />
<br />
<math>\theta_\mathrm{ML} = \underset{\theta}{\operatorname{arg\,max}}\ f(\mathbf{x}|\theta)</math>.<br />
<br />
There was some discussion in class regarding the notation. In literature, Bayesians use <math>f(\mathbf{x}|\mu)</math> while Frequentists use <math>f(\mathbf{x};\mu)</math>. In practice, these two are equivalent.<br />
<br />
Our goal is to find theta to maximize <br />
<math>\mathcal{L}(\theta\,) = f({\mathbf{x}_i})_{i=1}^n|\;\theta) = \prod_{i=1}^n f(\mathbf{x_i}|\theta)</math>. (The second equality holds because data points are iid.)<br />
<br />
In many cases, it’s more convenient to work with the natural logarithm of the likelihood. (Recall that the logarithm preserves minumums and maximums.)<br />
<math>\ell(\theta|x\mathbf)=\ln\mathcal{L}(\theta\,)</math> <br />
<br />
<math>\ell(\theta\,)=\sum_{i=1}^n \ln f(\mathbf{x_i}|\theta)</math><br />
<br />
Applying Maximum Likelihood Estimation to <math>f(y|\mathbf{x})= (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y}</math>, gives<br />
<br />
<math>\mathcal{L}(\mathbf{\beta\,})=\prod_{i=1}^n (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y_i} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y_i}</math><br />
<br />
<math>\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) + (1-y_i) (\ln{1} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}))\right) \\[10pt]&{} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) - (1-y_i) \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \\[10pt] &{} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}) + y_i \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \\[10pt] &{} = \sum_{i=1}^n \left(y_i {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}</math><br />
<br />
<math>\begin{align} {\frac{\partial \ell}{\partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - \frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}} \mathbf{x_i} \right) \\[8pt] & {}= \sum_{i=1}^n \left(y_i \mathbf{x_i} - P(\mathbf{x_i} | \mathbf{\beta\,}) \mathbf{x_i}\right) \end{align}</math><br />
<br />
<math>\frac{\partial \ell}{\partial \mathbf{\beta\,}}</math> can be numerically solved by Newton’s Method.<br />
<br />
====Newton's Method====<br />
<br />
Newton's Method (or Newton-Ralphson method) is a numerical method to find better approximations to the solutions of real-valued function. The function usually does not have an analytical form. <br />
<br />
The goal is to find <math>\mathbf{x}</math> such that <math><br />
f(\mathbf{x})<br />
= 0 </math>. The recursion can be implemented by<br />
<math>\mathbf{x_1} = \mathbf{x_0} - \frac{f(\mathbf{x_0})}{f'(\mathbf{x_0})}.\,\!<br />
</math>.<br />
<br />
It takes an initial guess <math>\mathbf{x_0}</math> and the direction "<math>\mathbf{f(x_{0}) / f' (x_{0})}</math>" that moves toward a better approximation. It then finds a newer and better <math>\mathbf{x_1}</math>. Taking this <math>\mathbf{x_1}</math> as <math>\mathbf{x_0}</math> in the second run, it finds a newer and better <math>\mathbf{x_1}</math> than the previous <math>\mathbf{x_1}</math>. Repeating the same process, the <math>\mathbf{x_1}</math> will be sufficiently accurate to the actual solutions.<br />
<br />
<br />
<br />
===Advantages of Logistic Regression===<br />
<br />
Logistic regression has several advantages over discriminant analysis: <br />
<br />
* it is more robust: the independent variables don't have to be normally distributed, or have equal variance in each group <br />
* It does not assume a linear relationship between the IV and DV <br />
* It may handle nonlinear effects <br />
* You can add explicit interaction and power terms <br />
* The DV need not be normally distributed. <br />
* There is no homogeneity of variance assumption. <br />
* Normally distributed error terms are not assumed. <br />
* It does not require that the independents be interval. <br />
* It does not require that the independents be unbounded.<br />
<br />
==Newton-Raphson Method (Lecture: Oct 11, 2011)==<br />
Previously we had derivated the log likelihood function for the logistic function. <br />
<br />
<math>\begin{align} L(\beta\,) = \prod_{i=1}^n \left( (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y_i}(\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y_i} \right) \end{align}</math><br />
<br />
After taking log, we can have<br />
<br />
<math>\begin{align} \ell(\beta\,) = \sum_{i=1}^n \left( y_i \log{\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}}} + (1 - y_i) \log{\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}}} \right) \end{align}</math><br />
<br />
which implies that<br />
<br />
<math>\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}</math><br />
<br />
Our goal is to find the <math>\beta\,</math> that maximizes <math>{\ell(\mathbf{\beta\,})}</math>. We use calculus to do this ie solve <math>{\frac{\partial \ell}{\partial \mathbf{\beta\,}}}=0</math>. To do this we use the famous numerical method of Newton-Raphson. This is an iterative method were we calculate the first & second derivative at each iteration.<br />
<br />
The first derivative is typically called the score vector.<br />
<br />
<math>\begin{align} S(\beta\,) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - \frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}} \mathbf{x_i} \right) \\[8pt] \end{align}</math><br />
<br />
The negative of the second derivative is typically called the information matrix.<br />
<br />
<math>\begin{align} I(\beta\,) {}= -{\frac{\partial \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \sum_{i=1}^n \left(\mathbf{x_i}\mathbf{x_i}^T (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})(\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}}) \right) \\[8pt] \end{align}</math><br />
<br />
We then use the following update formula to calcalute continually better estimates of the optimal <math>\beta\,</math>. It is not typically important what you use as your initial estimate <math>\beta\,^{(1)}</math> is.<br />
<br />
<math> \beta\,^{(r+1)} {}= \beta\,^{(r)} + I^{-1}(\beta\,^{(r)} )S(\beta\,^{(r)} )</math><br />
<br />
====Matrix Notation====<br />
<br />
let <math>\mathbf{y}</math> be a (n x 1) vector of all class labels. This is called the response in other contexts.<br />
<br />
let <math>\mathbb{X}</math> be a (n x (d+1)) matrix of all your features. Each row represents a data point. Each column represents a feature/covariate.<br />
<br />
let <math>\mathbf{p}^{(r)}</math> be a (n x 1) vector with values <math> P(\mathbf{x_i} |\beta\,^{(r)} ) </math><br />
<br />
let <math>\mathbb{W}^{(r)}</math> be a (n x n) diagonal matrix with <math>\mathbb{W}_{ii}^{(r)} {}= P(\mathbf{x_i} |\beta\,^{(r)} )(1 - P(\mathbf{x_i} |\beta\,^{(r)} ))</math><br />
<br />
we can rewrite our score vector, information matrix & update equation in terms of this new matrix notation, so the first derivitive is<br />
<br />
<math>\begin{align} S(\beta\,^{(r)}) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)})\end{align}</math><br />
<br />
and the second derivitive is<br />
<br />
<math>\begin{align} I(\beta\,^{(r)}) {}= -{\frac{\partial \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X} \end{align}</math><br />
<br />
Therfore, we can fit a regression problem as follows<br />
<br />
<math> \beta\,^{(r+1)} {}= \beta\,^{(r)} + I^{-1}(\beta\,^{(r)} )S(\beta\,^{(r)} ) {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}\mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)})</math><br />
<br />
====Iteratively Re-weighted Least Squares====<br />
If we reorganize this updating formula we can see it is really a iteratively solving a least squares problem each time with a new weighting.<br />
<br />
<math>\beta\,^{(r+1)} {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}(\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X}\beta\,^{(r)} + \mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)}))</math><br />
<br />
<math>\beta\,^{(r+1)} {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}\mathbb{X}^T\mathbb{W}^{(r)}\mathbf(z)^{(r)}</math><br />
<br />
where <math> \mathbf{z}^{(r)} = \mathbb{X}\beta\,^{(r)} + (\mathbb{W}^{(r)})^{-1}(\mathbf{y}-\mathbf{p}^{(r)}) </math><br />
<br />
<br />
<br />
Recall that linear regression by least squares finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-X^T \underline{\beta})^T(\underline{y}-X^T \underline{\beta})</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{(r+1)} \leftarrow arg \min_{\underline{\beta}}(Z-X \underline{\beta})^T W (Z-X \underline{\beta})</math><br />
<br />
====Fisher Scoring Method==== <br />
<br />
Fisher Scoring is a method very similiar to Newton-Raphson. It uses the expected Information Matrix as opposed to the observed information matrix. This distinction simplifies the problem and in perticular the computational complexity. To learn more about this method & logistic regression in general you can take Stat431/831 at the University of Waterloo.<br />
<br />
===Multi-class Logistic Regression===<br />
<br />
In a multi-class logistic regression we have K classes. For 2 classes ''k'' and ''l''<br />
<br />
<math>\frac{P(Y=l|X=x)}{P(Y=k|X=x)} = e^{\beta_l^T x}</math><br />
<br />
We call <math>log(\frac{P(Y=l|X=x)}{P(Y=k|X=x)}) = \beta_l^T x</math> as the logit transformation. The decision boundary between the 2 classes is the set of points where the logit transformation is 0.<br />
<br />
For each class from 1 to K-1 we then have:<br />
<br />
<math>log(\frac{P(Y=1|X=x)}{P(Y=K|X=x)}) = e^{\beta_1^T x}</math><br />
<br />
<math>log(\frac{P(Y=2|X=x)}{P(Y=K|X=x)}) = e^{\beta_2^T x}</math><br />
<br />
<math>log(\frac{P(Y=K-1|X=x)}{P(Y=K|X=x)}) = e^{\beta_{K-1}^T x}</math><br />
<br />
Note that choosing ''Y=K'' is arbitrary and any other choice is equally valid.<br />
<br />
Based on the above the posterior probabilities are given by: <math>P(Y=k|X=x) = \frac{e^{\beta_k^T x}}{1 + \sum_{i=1}^{K-1}{e^{\beta_i^T x}}}</math><br />
<br />
===Sample Size Requirements===<br />
<br />
The number of adjustable components in linear discriminant analysis is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math> where d is the number of parameters. Similarly, the number of adjustable components in logistic regression is <math>\, d+1</math>. The number of components also corresponds to the minimum number of observations needed to compute the coefficients of each function. Techniques do exist though for handling high dimensional problems where the number of parameters exceeds the number of observations.<br />
<br />
Linear discriminant analysis involves the inversion of a d x d covariance matrix. When d is bigger than n, the number of observations, this matrix is large and is rank n < d and thus singular. When this is the case, we can either use the pseudo inverse or perform regularized discriminant analysis which solves this problem. In RDA, we define a new covariance matrix <math>\, \Sigma(\gamma) = \gamma\Sigma + (1 - \gamma)diag(\Sigma)</math> with <math>\gamma \in [0,1]</math>. Cross validation can be used to calculate the best <math>\, \gamma</math>. More details on RDA can be found in Guo et al. (2006).<br />
<br />
Logistic regression can also be modified using shrinkage methods to deal with the problem of having less observations than parameters. When maximizing the log likelihood, we can add a <math>-\frac{\lambda}{2}\sum^{K}_{k=1}\|\beta_k\|_{2}^{2}</math> penalization term where K is the number of classes. This resulting optimization problem is convex and can be solved using Newton's method as given in Zhu and hastie (2004).<br />
<br />
===Comparison Between Logistic Regression And Linear Discriminant Analysis (LDA)===<br />
<br />
Logistic Regression Model and Linear Discriminant Analysis are widely used to analyze data which has categorical outcome variables. Both of the models are to build linear boundaries to classify different groups. Also, the categorical outcome variables (i.e. the dependent variables) must be mutually exclusive. <br />
<br />
However, these two models differ in their basic idea. While Logistic Regression is more relaxed and flexible in its assumptions, Linear Discriminant Analysis has the requirement that its explanatory variables must be normally distributed, linearly related and have equal covariance matrix within each class. Therefore, it can be expected that linear Discriminant Analysis should be more appropriate if the normality assumptions and equal covariance assumption are fulfilled in its explanatory variables. But in all other situations Logistic Regression should be appropriate. Besides, the total number of estimates to compute between these models is different. If the explanatory variables have d dimensions, we need to estimate <math>d+1</math> parameters in Logistic Regression and the number of parameters grows linearly w.r.t. dimension, while we need to estimate <math>2d+\frac{d*(d+1)}{2}+2</math> parameters in Linear Discriminant Analysis and the number of parameters grows quadratically w.r.t. dimension.<br />
<br />
Because the logistic regression model has the form <math>log\frac{f_1(x)}{f_0(x)} = \beta{x}</math>, we can clearly see the role of each input variable in explaining the outcome. This is one advantage that logistic regression has other other classification methods and is why it is so popular in data analysis.<br />
<br />
== Perceptron (Lecture: Oct. 11, 2011) ==<br />
<br />
[[Image:Perceptron1.png|right|thumb|300px|Simple perceptron]]<br />
[[Image:Perceptron2.png|right|thumb|300px|Simple perceptron where <math>\beta_0</math> is defined as 1]]<br />
<br />
<br />
The perceptron is the building block for neural networks. It was invented by Rosenblatt in 1957 at Cornell Labs, and first mentioned in the paper "The Perceptron - a perceiving and recognizing automaton". The perceptron is used on linearly separable data sets.<br />
<br />
For a 2 class problem, and a set of inputs with ''d'' features, a perceptron will use a weighted sum and it will classify the information using the sign of the result. The figures on the right give an example of a perceptron. In these examples, <math>x^i</math> is the ''i''-th feature of a sample and <math>\beta_i</math> is the ''i''-th weight. <math>\beta_0</math> is defined as the bias. The bias alters the position of the decision boundary between the 2 classes.<br />
<br />
Perceptrons are generally trained using [http://en.wikipedia.org/wiki/Gradient_descent gradient descent]. This type of learning can have 2 side effects:<br />
* If the data sets are well separated, the training of the perceptron can lead to multiple valid solutions,<br />
* If the data sets are not linearly separable, the learning algorithm will never finish.<br />
<br />
Perceptrons are the simplest kind of a feedforward neural network. A perceptron is the building block for other neural networks such as:<br />
* Multi-layer perceptron<br />
* ADALINE<br />
* MADALINE<br />
<br />
==References==<br />
<references /><br />
<br />
24. Balakrishnama, S., Ganapathiraju, A. LINEAR DISCRIMINANT ANALYSIS - A BRIEF TUTORIAL. http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf [[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf]]</div>S9huhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f11&diff=12532stat841f112011-10-13T08:03:57Z<p>S9hu: /* Sample Size Requirements */</p>
<hr />
<div>==[[f11Stat841proposal| Proposal for Final Project]]==<br />
<br />
==[[f11Stat841EditorSignUp| Editor Sign Up]]==<br />
<br />
= STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20 =<br />
== Wiki Course Notes ==<br />
Students will need to contribute to the wiki for 20% of their grade.<br />
Access via wikicoursenote.com<br />
Go to editor sign-up, and use your UW userid for your account name, and use your UW email.<br />
<br />
primary (10%)<br />
Post a draft of lecture notes within 48 hours. <br />
You will need to do this 1 or 2 times, depending on class size.<br />
<br />
secondary (10%)<br />
Make improvements to the notes for at least 60% of the lectures.<br />
More than half of your contributions should be technical rather than editorial.<br />
There will be a spreadsheet where students can indicate what they've done and when.<br />
The instructor will conduct random spot checks to ensure that students have contributed what they claim.<br />
<br />
<br />
== Classification (Lecture: Sep. 20, 2011) ==<br />
=== Definitions ===<br />
'''classification''': Predict a discrete random variable <math>Y</math> (a label) by using another random variable <math>X</math><br />
(new data point) picked iid from a distribution<br />
<br />
<math>X_i = (X_{i1}, X_{i2}, ... X_{id}) \in \mathcal{X} \subset \mathbb{R}^d</math> (<math>d</math>-dimensional vector)<br />
<math>Y_i</math> in some finite set <math>\mathcal{Y}</math><br />
<br />
<br />
'''classification rule''':<br />
<math>h : \mathcal{X} \rightarrow \mathcal{Y}</math><br />
Take new observation <math>X</math> and use a classification function <math>h(x)</math> to generate a label <math>Y</math>. In other words, If we fit the function <math>h(x)</math> with a random variable <math>X</math>, it generates the label <math>Y</math> which is the class to which we predict <math>X</math> belongs.<br />
<br />
Example: Let <math> \mathcal{X}</math> be a set of 2D images and <math>\mathcal{Y}</math> be a finite set of people. We want to learn a classification rule <math>h:\mathcal{X}\rightarrow\mathcal{Y}</math> that with small ''true'' error predicts the person who appears in the image. <br />
<br />
<br />
'''true error rate''' for classifier <math>h</math> is the error with respect to the underlying distribution (that we do not know).<br />
<br />
<math>L(h) = P(h(X) \neq Y )</math><br />
<br />
<br />
'''empirical error rate''' (or training error rate) is the amount of error that our classification function <math>h(x)</math> makes on the training data.<br />
<br />
<math>\hat{L}_n(h) = (1/n) \sum_{i=1}^{n} \mathbf{I}(h(X_i) \neq Y_i)</math><br />
<br />
where <math>\mathbf{I}()</math> is an indicator function. Indicator function is defined by <br />
<br />
<math>\mathbf{I}(x) = \begin{cases} <br />
1 & \text{if } x \text{ is true} \\<br />
0 & \text{if } x \text{ is false}<br />
\end{cases}</math><br />
<br />
So in this case,<br />
<math>\mathbf{I}(h(X_i)\neq Y_i) = \begin{cases}<br />
1 & \text{if } h(X_i)\neq Y_i \text{ (i.e. when misclassification happens)} \\<br />
0 & \text{if } h(X_i)=Y_i \text{ (i.e. classified properly)}<br />
\end{cases}</math><br />
<br />
e.g., 100 new data points with known (true) labels<br />
<br />
<math>y_1 = h(x_1)</math><br />
<br />
...<br />
<br />
<math>y_{100} = h(x_{100})</math><br />
<br />
To calculate the empirical error we count how many labels our function <math>h(x)</math> assigned incorrectly and divide by n=100<br />
<br />
=== Bayes Classifier ===<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability<ref> http://www.wikicoursenote.com/wiki/Stat841#Bayes_Classifier </ref>.<br />
<br />
First recall Bayes' Rule, in the format<br />
<math>P(Y|X) = \frac{P(X|Y) P(Y)} {P(X)} </math> <br />
<br />
P(Y|X) : ''posterior'' , ''probability of <math>Y</math> given <math>X</math>''<br />
<br />
P(X|Y) : ''likelihood'', ''probability of <math>X</math> being generated by <math>Y</math>''<br />
<br />
P(Y) : ''prior'', ''probability of <math>Y</math> being selected''<br />
<br />
P(X) : ''marginal'', ''probability of obtaining <math>X</math>''<br />
<br />
<br />
We will start with the simplest case: <math>\mathcal{Y} = \{0,1\}</math><br />
<br />
<math> r(x) <br />
= P(Y=1|X=x) <br />
= \frac{P(X=x|Y=1) P(Y=1)} {P(X=x)}<br />
= \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}</math><br />
<br />
Bayes' rule can be approached by computing either:<br />
<br />
1) '''The posterior''': <math>\ P(Y=1|X=x) </math> and <math>\ P(Y=0|X=x) </math> or <br />
<br />
2) '''The likelihood''': <math>\ P(X=x|Y=1) </math> and <math>\ P(X=x|Y=0) </math><br />
<br />
<br />
The former reflects a '''Bayesian''' approach. The Bayesian approach uses previous beliefs and observed data (e.g., the random variable <math>\ X </math>) to determine the probability distribution of the parameter of interest (e.g., the random variable <math>\ Y </math>). The probability, according to Bayesians, is a ''degree of belief'' in the parameter of interest taking on a particular value (e.g., <math>\ Y=1 </math>), given a particular observation (e.g., <math>\ X=x </math>). Historically, the difficulty in this approach lies with determining the posterior distribution, however, more recent methods such as '''Markov Chain Monte Carlo (MCMC)''' allow the Bayesian approach to be implemented <ref name="PCAustin">P. C. Austin, C. D. Naylor, and J. V. Tu, "A comparison of a Bayesian vs. a frequentist method for profiling hospital performance," ''Journal of Evaluation in Clinical Practice'', 2001</ref>.<br />
<br />
The latter reflects a '''Frequentist''' approach. The Frequentist approach assumes that the probability distribution, including the mean, variance, etc., is fixed for the parameter of interest (e.g., the variable <math>\ Y </math>, which is ''not'' random). The observed data (e.g., the random variable <math>\ X </math>) is simply a ''sampling'' of a far larger population of possible observations. Thus, a certain repeatability or ''frequency'' is expected in the observed data. If it were possible to make an infinite number of observations, then the true probability distribution of the parameter of interest can be found. In general, frequentists use a technique called '''hypothesis testing''' to compare a ''null hypothesis'' (e.g. an assumption that the mean of the probability distribution is <math>\ \mu_0 </math>) to an alternative hypothesis (e.g. assuming that the mean of the probability distribution is larger than <math>\ \mu_0 </math>) <ref name="PCAustin"/>. For more information on hypothesis testing see <ref>R. Levy, "Frequency hypothesis testing, and contingency tables" class notes for LING251, Department of Linguistics, University of California, 2007. Available: [http://idiom.ucsd.edu/~rlevy/lign251/fall2007/lecture_8.pdf http://idiom.ucsd.edu/~rlevy/lign251/fall2007/lecture_8.pdf] </ref>. <br />
<br />
There was some class discussion on which approach should be used. Both the ease of computation and the validity of both approaches were discussed. A main point that was brought up in class is that Frequentists consider X to be a random variable, but they do not consider Y to be a random variable because it has to take on one of the values from a fixed set (in the above case it would be either 0 or 1 and there is only one ''correct'' label for a given value X=x). Thus, from a Frequentist's perspective it does not make sense to talk about the probability of Y. This is actually a grey area and sometimes ''Bayesians'' and ''Frequentists'' use each others' approaches. So using ''Bayes' rule'' doesn't necessarily mean you're a ''Bayesian''. Overall, the question remains unresolved.<br />
<br />
<br />
The '''Bayes Classifier''' uses <math>\ P(Y=1|X=x)</math><br />
<br />
<math> P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}</math><br />
<br />
P(Y=1) : the prior, based on belief/evidence beforehand<br />
<br />
denominator : marginalized by summation<br />
<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
The set <math>\mathcal{D}(h) = \{ x : P(Y=1|X=x) = P(Y=0|X=x)... \} </math><br />
<br />
which defines a ''decision boundary''.<br />
<br />
<math>h^*(x) = <br />
\begin{cases}<br />
1 \ \ if \ \ P(Y=1|X=x) > P(Y=0|X=x) \\<br />
0 \ \ \ \ \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
''Theorem'': Bayes rule is optimal. I.e., if h is any other classification rule, <br />
then <math>L(h^*) <= L(h)</math><br />
(This is to be proved in homework.)<br />
<br />
Why then do we need other classfication methods?<br />
A: Because X densities are often/typically unknown. I.e., <math>f_k(x)</math> and/or <math>\pi_k</math> unknown.<br />
<br />
<math>P(Y=k|X=x) = \frac{P(X=x|Y=k)P(Y=k)} {P(X=x)} = \frac{f_k(x) \pi_k} {\sum_k f_k(x) \pi_k}</math><br />
f_k(x) is referred to as the class conditional distribution (~likelihood).<br />
<br />
Therefore, we rely on some data to estimate quantities.<br />
<br />
=== Three Main Approaches ===<br />
<br />
'''1. Empirical Risk Minimization''':<br />
Choose a set of classifiers H (e.g., line, neural network) and find <math>h^* \in H</math><br />
that minimizes (some estimate of) L(h).<br />
<br />
'''2. Regression''':<br />
Find an estimate (<math>\hat{r}</math>) of function <math>r</math> and define<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
The <math> 1/2 </math> in the expression above is a threshold set for the regression prediction output. <br />
<br />
In general ''regression'' refers to finding a continuous, real valued y. The problem here is more difficult, because of the restricted domain (y is a set of discrete label values).<br />
<br />
'''3. Density Estimation''':<br />
Estimate <math>P(X=x|Y=0)</math> from <math>X_i</math>'s for which <math>Y_i = 0</math><br />
Estimate <math>P(X=x|Y=1)</math> from <math>X_i</math>'s for which <math>Y_i = 1</math><br />
and let <math>P(Y=?) = (1/n) \sum_{i=1}^{n} Y_i</math><br />
<br />
Define <math>\hat{r}(x) = \hat{P}(Y=1|X=x)</math> and<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
It is possible that there may not be enough data to estimate from for ''density estimation''. But the main problem lies with high dimensional spaces, as the estimation results may not be good (high error rate) and sometimes even infeasible. The term ''curse of dimensionality'' was coined by Bellman <ref>R. E. Bellman, ''Dynamic Programming''. Princeton University Press,<br />
1957</ref> to describe this problem.<br />
<br />
As the dimension of the space goes up, the learning requirements go up exponentially.<br />
<br />
To Learn more about methods for handling high-dimensional data <ref> https://docs.google.com/viewer?url=http%3A%2F%2Fwww.bios.unc.edu%2F~dzeng%2FBIOS740%2Flecture_notes.pdf</ref><br />
<br />
=== Multi-Class Classification ===<br />
Generalize to case Y takes on k>2 values.<br />
<br />
<br />
''Theorem'': <math>Y \in \mathcal{Y} = \{1,2,..., k\} </math> optimal rule<br />
<br />
<math>\ h^{*}(x) = argmax_k P(Y=k|X=x) </math> <br />
<br />
where <math>P(Y=k|X=x) = \frac{f_k(x) \pi_k} {\sum_r f_r \pi_r}</math><br />
<br />
===Examples of Classification===<br />
<br />
* Face detection in images.<br />
* Medical diagnosis.<br />
* Detecting credit card fraud (fraudulent or legitimate).<br />
* Speech recognition.<br />
* Handwriting recognition.<br />
<br />
== LDA and QDA ==<br />
<br />
'''Discriminant function analysis''' finds features that best allow discrimination between two or more classes. The approach is similar to '''analysis of Variance (ANOVA)''' in that discriminant function analysis looks at the mean values to determine if two or more classes are very different and should be separated. Once the discriminant functions (that separate two or more classes) have been determined, new data points can be classified (i.e. placed in one of the classes) based on the discriminant functions <ref> StatSoft, Inc. (2011). ''Electronic Statistics Textbook.'' [Online]. Available: [http://www.statsoft.com/textbook/discriminant-function-analysis/ http://www.statsoft.com/textbook/discriminant-function-analysis/.] </ref>. '''Linear discriminant analysis (LDA)''' and '''Quadratic discriminant analysis (QDA)''' are methods of discriminant analysis that are best applied to linearly and quadradically separable classes, respectively. '''Fisher discriminant analysis (FDA)''' another method of discriminant analysis that is different from linear discriminant analysis, but oftentimes both terms are used interchangeably.<br />
<br />
=== LDA ===<br />
<br />
The simplest method is to use approach 3 (above) and assume a parametric model for densities. Assume class conditional is Gaussian.<br />
<br />
<math>\mathcal{Y} = \{ 0,1 \}</math> assumed (i.e., 2 labels)<br />
<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ P(Y=1|X=x) > P(Y=0|X=x) \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
<math>P(Y=1|X=x) = \frac{f_1(x) \pi_1} {\sum_k f_k \pi_k} \ \ </math> (denom = P(x))<br />
<br />
1) Assume Gaussian distributions<br />
<br />
<math>f_k(x) = \frac{1}{(2\pi)^{d/2} |\Sigma_k|^{1/2}} exp(-(1/2)(\mathbf{x} - \mathbf{\mu_k}) \Sigma_k^{-1}(\mathbf{x}-\mathbf{\mu_k}) )</math><br />
<br />
must compare <br />
<math>\frac{f_1(x) \pi_1} {p(x)}</math> with <math>\frac{f_0(x) \pi_0} {p(x)}</math><br />
Note that the p(x) denom can be ignored:<br />
<math>f_1(x) \pi_1</math> with <math>f_0(x) \pi_0 </math><br />
<br />
To find the decision boundary, set <br />
<math>f_1(x) \pi_1 = f_0(x) \pi_0 </math><br />
<br />
2) Assume <math>\Sigma_1 = \Sigma_0</math>, we can use <math>\Sigma = \Sigma_0 = \Sigma_1</math>.<br />
<br />
Cancel <math>(2\pi)^{-d/2} |\Sigma_k|^{-1/2}</math> from both sides.<br />
<br />
Take log of both sides.<br />
<br />
Subtract one side from both sides, leaving zero on one side.<br />
<br />
<br />
<math>-(1/2)(\mathbf{x} - \mathbf{\mu_1})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_1}) + log(\pi_1) - [-(1/2)(\mathbf{x} - \mathbf{\mu_0})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_0}) + log(\pi_0)] = 0 </math><br />
<br />
<br />
<math>(1/2)[-\mathbf{x}^T \Sigma^{-1}\mathbf{x} - \mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + 2\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{x}<br />
+ \mathbf{x}^T \Sigma^{-1}\mathbf{x} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} - 2\mathbf{\mu_0}^T \Sigma^{-1} \mathbf{x} ]<br />
+ log(\pi_1/\pi_0) = 0 </math><br />
<br />
<br />
Cancelling out the terms quadratic in <math>\mathbf{x}</math> and rearranging results in <br />
<br />
<math>(1/2)[-\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0}<br />
+ (2\mathbf{\mu_1}^T \Sigma^{-1} - 2\mathbf{\mu_0}^T \Sigma^{-1}) \mathbf{x}]<br />
+ log(\pi_1/\pi_0) = 0 </math><br />
<br />
<br />
We can see that the first pair of terms is constant, and the second pair is linear in x.<br />
Therefore, we end up with something of the form <br />
<math>ax + b = 0</math>.<br />
For more about LDA <ref>http://sites.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf</ref><br />
<br />
== LDA and QDA Continued (Lecture: Sep. 22, 2011) == <br />
<br />
If we relax assumption 2 (i.e. <math>\Sigma_1 \neq \Sigma_0</math>) then we get a quadratic equation that can be written as<br />
<math>{x}^Ta{x}+b{x} + c = 0</math><br />
<br />
===Generalizing LDA and QDA===<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,K\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h^*(x) = \arg\max_{k} \delta_k(x)</math><br />
<br />
Where<br />
<br />
<math> \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math><br />
<br />
When the Gaussian variances are equal <math>\Sigma_1 = \Sigma_0</math> (e.g. LDA), then<br />
<br />
<math> \,\delta_k(x) = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math><br />
<br />
(To compute this, we need to calculate the value of <math>\,\delta </math> for each class, and then take the one with the max. value).<br />
<br />
===In practice===<br />
We estimate the prior to be the chance that a random item from the collection belongs to class k, e.g.<br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
The mean to be the average item in set k, e.g.<br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
and calculate the covariance of each class e.g.<br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
If we wish to use LDA we must calculate a common covariance, so we average all the covariances e.g.<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{r=1}^{k}n_r} </math><br />
<br />
Where: <math>\,n_r</math> is the number of data points in class <math>\,r</math>, <math>\,\Sigma_r</math> is the covariance of class <math>\,r</math>, <math>\,n</math> is the total number of data points, and <math>\,k</math> is the number of classes.<br />
<br />
===Computation===<br />
<br />
For QDA we need to calculate: <math> \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math><br />
<br />
Lets first consider when <math>\, \Sigma_k = I, \forall k </math>. This is the case where each distribution is spherical, around the mean point.<br />
<br />
====Case 1====<br />
When <math>\, \Sigma_k = I </math><br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
but <math>\ \log(|I|)=\log(1)=0 </math><br />
<br />
and <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math> is the [http://en.wikipedia.org/wiki/Euclidean_distance#Squared_Euclidean_Distance squared Euclidean distance] between two points <math>\,x</math> and <math>\,\mu_k</math><br />
<br />
Thus in this condition, a new point can be classified by its distance away from the center of a class, adjusted by some prior.<br />
<br />
Further, for two-class problem with equal prior, the discriminating function would be the bisector of the 2-class's means.<br />
<br />
====Case 2==== <br />
When <math>\, \Sigma_k \neq I </math><br />
<br />
Using the [[Singular Value Decomposition(SVD) | Singular Value Decomposition (SVD)]] of <math>\, \Sigma_k</math><br />
we get <math> \, \Sigma_k = U_kS_kV_k^\top</math>. In particular, <math>\, U_k</math> is a collection of eigenvectors of <math>\, \Sigma_k\Sigma_k^*</math>, and <math>\, V_k</math> is a collection of eigenvectors of <math>\,\Sigma_k^*\Sigma_k</math>.<br />
Since <math>\, \Sigma_k</math> is a symmetric matrix<ref> http://en.wikipedia.org/wiki/Covariance_matrix#Properties </ref>, <math>\, \Sigma_k = \Sigma_k^*</math>, so we have <math> \, \Sigma_k = U_kS_kU_k^\top </math>.<br />
<br />
For <math>\,\delta_k</math>, the second term becomes what is also known as the Mahalanobis distance <ref>P. C. Mahalanobis, "On The Generalised Distance in Statistics," ''Proceedings of the National Institute of Sciences of India'', 1936</ref> :<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top U_kS_k^{-1}U_k^T(x-\mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-1}(U_k^\top x-U_k^\top \mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-\frac{1}{2}}S_k^{-\frac{1}{2}}(U_k^\top x-U_k^\top\mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top I(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
If we think of <math> \, S_k^{-\frac{1}{2}}U_k^\top </math> as a linear transformation that takes points in class <math>\,k</math> and distributes them spherically around a point, like in case 1. Thus when we are given a new point, we can apply the modified <math>\,\delta_k</math> values to calculate <math>\ h^*(\,x)</math>. After applying the singular value decomposition, <math>\,\Sigma_k^{-1}</math> is considered to be an identity matrix such that<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}[(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k)] + log (\pi_k) </math><br />
<br />
and,<br />
<br />
<math>\ \log(|I|)=\log(1)=0 </math><br />
<br />
For applying the above method with classes that have different covariance matrices (for example the covariance matrices <math>\ \Sigma_0 </math> and <math>\ \Sigma_1 </math> for the two class case), each of the covariance matrices has to be decomposed using SVD to find the according transformation. Then, each new data point has to be transformed using each transformation to compare its distance to the mean of each class (for example for the two class case, the new data point would have to be transformed by the class 1 transformation and then compared to <math>\ \mu_0 </math> and the new data point would also have to be transformed by the class 2 transformation and then compared to <math>\ \mu_1 </math>).<br />
<br />
<br />
The difference between [[#Case 1 | Case 1]] and [[#Case 2 | Case 2]] (i.e. the difference between using the Euclidean and Mahalanobis distance) can be seen in the illustration below. <br />
<br />
[[File:EuclideanVsMahalonobisDistance2.PNG|frame|center|Illustration of Euclidean distance (a) and Mahalanobis distance (b) where the contours represent equidistant points from the center using each distance metric. Source: <ref>R. De Maesschalck, D. Jouan-Rimbaud and D. L. Massart, "Tutorial - The Mahalanobis distance," ''Chemometrics and Intelligent Laboratory Systems'', 2000 </ref>]]<br />
<br />
As can be seen from the illustration above, the Mahalanobis distance takes into account the distribution of the data points, whereas the Euclidean distance would treat the data as though it has a spherical distribution. Thus, the Mahalanobis distance applies for the more general classification in [[#Case 2 | Case 2]], whereas the Euclidean distance applies to the special case in [[#Case 1 | Case 1]] where the data distribution is assumed to be spherical.<br />
<br />
Generally, we can conclude that QDA provides a better classifier for the data then LDA because LDA assumes that the covariance matrix is identical for each class, but QDA does not. QDA still uses Gaussian distribution as a class conditional distribution. In our real life, this distribution can not be happened each time, so we have to use other distribution as a complement.<br />
<br />
== Principal Component Analysis (PCA) (Lecture: Sep. 27, 2011) ==<br />
<br />
'''Principal Component Analysis (PCA)''' is a method of dimensionality reduction/feature extraction that transforms the data from a D dimensional space into a new coordinate system of dimension d, where d <= D ( the worst case would be to have d=D). The goal is to preserve as much of the variance in the original data as possible when switching the coordinate systems. Give data on D variables, the hope is that the data points will lie mainly in a linear subspace of dimension lower than D. In practice, the data will usually not lie precisely in some lower dimensional subspace.<br />
<br />
<br />
The new variables that form a new coordinate system are called '''principal components''' (PCs). PCs are denoted by <math>\ u_1, u_2, ... , u_D </math>. The principal components form a basis for the data. Since PCs are orthogonal linear transformations of the original variables there is at most D PCs. Normally, not all of the D PCs are used but rather a subset of d PCs, <math>\ u_1, u_2, ... , u_d </math>, to approximate the space spanned by the original data points <math>\ x_1, x_2, ... , x_D </math>. We can choose d based on what percentage of the original data we would like to maintain. <br />
<br />
Let <math>\ PC_j</math> be a linear combination of <math>\ x_1, x_2, ... , x_D </math> defined by the coefficients <br />
<math>\ w^{(j)}</math> = <math> ( {w_1}^{(j)}, {w_2}^{(j)},...,{w_D}^{(j)} )^T </math><br />
<br />
Thus, <math> u_j = {w_1}^{(j)} x_1 + {w_2}^{(j)} x_2 + ... + {w_D}^{(j)} x_D = w^{(j)^T} X </math><br />
<br />
<br />
This is a unique configuration since it sets up the PCs in order from maximum to minimum variances. The first PC, <math>\ u_1 </math> is called '''first principal component''' and has the maximum variance, thus it accounts for the most significant variance in the data <math>\ x_1, x_2, ... , x_D </math>. The second PC, <math>\ u_2 </math> is called '''second principal component''' and has the second highest variance and so on until PC, <math>\ u_D </math> which has the minimum variance. <br />
<br />
<br />
To get the first principal component, we would like to use the following equation:<br />
<br />
<math>\ max (Var(w^T X)) = max (w^T S w) </math> <br />
<br />
Where <math>\ S </math> is the covariance matrix. And we solve for <math>\ w </math>.<br />
<br />
<br />
Note: we require the constraint <math>\ w^T w = 1 </math> because if there is no constraint on the length of <math>\ w </math> then there is no upper bound. With the constraint, the direction and not the length that maximizes the variance can be found. <br />
<br />
<br />
====Lagrange Multiplier====<br />
<br />
Before we proceed, we should review Lagrange multipliers.<br />
<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
<br />
Lagrange multipliers are used to find the maximum or minimum of a function <math>\displaystyle f(x,y)</math> subject to constraint <math>\displaystyle g(x,y)=0</math> <br />
<br />
we define a new constant <math> \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle f(x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example :====<br />
Suppose we want to maximize the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method to find the maximum value for the function <math>\displaystyle f </math>; the Lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain two stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
===Determining w :===<br />
<br />
Use the Lagrange multiplier conversion to obtain:<br />
<math>\displaystyle L(w, \lambda) = w^T Sw - \lambda (w^T w - 1)</math> where <math>\displaystyle \lambda </math> is a constant <br />
<br />
Take the derivative and set it to zero:<br />
<math>\displaystyle{\partial L \over{\partial w}} = 0 </math><br />
<br />
<br />
To obtain: <br />
<math>\displaystyle 2Sw - 2 \lambda w = 0</math><br />
<br />
<br />
Rearrange to obtain:<br />
<math>\displaystyle Sw = \lambda w</math><br />
<br />
<br />
where <math>\displaystyle w</math> is eigenvector of <math>\displaystyle S </math> and <math>\ \lambda </math> is the eigenvalue of <math>\displaystyle S </math> as <math>\displaystyle Sw= \lambda w </math> , and <math>\displaystyle w^T w=1</math> , then we can write<br />
<br />
<math>\displaystyle w^T Sw= w^T\lambda w= \lambda w^T w =\lambda </math> <br />
<br />
Note that the PCs decompose the total variance in the data in the following way :<br />
<br />
<math> \sum_{i=1}^{D} Var(u_i) </math><br />
<br />
<math>= \sum_{i=1}^{D} (\lambda_i) </math> <br />
<br />
<math>\ = Tr(S) </math><br />
<br />
<math>= \sum_{i=1}^{D} Var(x_i)</math><br />
<br />
== Principal Component Analysis (PCA) Continued (Lecture: Sep. 29, 2011) == <br />
As can be seen from the above expressions, <math>\ Var(W^\top X) = W^\top S W= \lambda </math> where lambda is an eigenvalue of the sample covariance matrix <math>\ S </math> and <math>\ W</math> is its corresponding eigenvector. So <math>\ Var(u_i) </math> is maximized if <math>\ \lambda_i </math> is the maximum eigenvalue of <math>\ S </math> and the first principal component (PC) is the corresponding eigenvector. Each successive PC can be generated in the above manner by taking the eigenvectors of <math>\ S</math><ref>www.wikipedia.org/wiki/Eigenvalues_and_eigenvectors</ref> that correspond to the eigenvalues:<br />
<br />
<math>\ \lambda_1 \geq ... \geq \lambda_D </math> <br />
<br />
such that <br />
<br />
<math>\ Var(u_1) \geq ... \geq Var(u_D) </math><br />
<br />
=== Alternative Derivation ===<br />
Another way of looking at PCA is to consider PCA as a projection from a higher D-dimension space to a lower d-dimensional subspace that minimizes the squared ''reconstruction error''. The squared reconstruction error is the difference between the original data set <math>\ X </math> and the new data set <math> \hat{X} </math> obtained by first projecting the original data set into a lower d-dimensional subspace and then projecting it back into the the original higher D-dimension space. Since information is (normally) lost by compressing the the original data into a lower d-dimensional subspace, the new data set will (normally) differ from the original data even though both are part of the higher D-dimension space. The reconstruction error is computed as shown below.<br />
<br />
====Reconstruction Error====<br />
<br />
<math> e = \sum_{i=1}^{n} || x_i - \hat{x}_i ||^2 </math><br />
<br />
====Minimize Reconstruction Error====<br />
<br />
Suppose <math> \bar{x} = 0 </math> where <math> \hat{x}_i = x_i - \bar{x} </math><br />
<br />
Let <math>\ f(y) = U_d y </math> where <math>\ U_d </math> is a D by d matrix with d orthogonal unit vectors as columns.<br />
<br />
Fit the model to the data and minimize the reconstruction error:<br />
<br />
<math>\ min_{U_d, y_i} \sum_{i=1}^n || x_i - U_d y_i ||^2 </math><br />
<br />
Differentiate with respect to <math>\ y_i </math>:<br />
<br />
<math> \frac{\partial e}{\partial y_i} = 0 </math><br />
<br />
we can rewrite reconstruction-error as : <math>\ e = \sum_{i=1}^n(x_i - U_d y_i)^T(x_i - U_d y_i) </math><br />
<br />
<math>\ \frac{\partial e}{\partial y_i} = 2(-U_d)(x_i - U_d y_i) = 0 </math><br />
<br />
since <math>\ U_d(x_i - U_d y_i) </math> is a linear combination of the columns of <math>\ U_d </math>,<br />
<br />
which are independent (orthogonal to each other) we can conclude that:<br />
<br />
<math>\ x_i - U_d y_i = 0 </math> or equivalently,<br />
<br />
<math>\ x_i = U_d y_i </math><br />
<br />
<math>\ y_i = U_d^T x_i </math><br />
<br />
Find the orthogonal matrix <math>\ U_d </math>:<br />
<br />
<math>\ min_{U_d} \sum_{i=1}^n || x_i - U_d U_d^T x_i||^2 </math><br />
<br />
====Using SVD====<br />
<br />
A unique solution can be obtained by finding the [[Singular Value Decomposition(SVD) | Singular Value Decomposition (SVD)]] of <math>\ X </math>:<br />
<br />
<math>\ X = U S V^T </math><br />
<br />
For each rank d, <math>\ U_d </math> consists of the first d columns of <math>\ U </math>. Also, the covariance matrix can be expressed as follows <math>\ S = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)(x_i - \mu)^T </math>.<br />
<br />
Simply put, by subtracting the mean of each of the data point features and then applying SVD, one can find the principal components:<br />
<br />
<math> \tilde{X} = X - \mu </math><br />
<br />
<math>\ \tilde{X} = U S V^T </math><br />
<br />
Where <math>\ X </math> is a d by n matrix of data points and the features of each data point form a column in <math>\ X </math>. Also, <math>\ \mu </math> is a d by n matrix with identical columns each equal to the mean of the <math>\ x_i</math>'s, ie <math>\mu_{:,j}=\frac{1}{n}\sum_{i=1}^n x_i </math>. Note that the arrangement of data points is a convention and indeed in Matlab or conventional statistics, the transpose of the matrices in the above formulae is used.<br />
<br />
As the <math>\ S </math> matrix from the SVD has the eigenvalues arranged from largest to smallest, the corresponding eigenvectors in the <math>\ U </math> matrix from the SVD will be such that the first column of <math>\ U </math> is the first principal component and the second column is the second principal component and so on.<br />
<br />
=== Examples ===<br />
<br />
Note that in the Matlab code in the examples below, the mean was not subtracted from the datapoints before performing SVD. This is what was shown in class. However, to properly perform PCA, the mean should be subtracted from the datapoints.<br />
<br />
==== Example 1 ====<br />
Consider a matrix of data points <math>\ X </math> with the dimensions 560 by 1965. 560 is the number of elements in each column. Each column is a vector representation of a 20x28 grayscale pixel image of a face (see image below) and there is a total of 1965 different images of faces. Each of the images are corrupted by noise, but the noise can be removed by projecting the data back to the original space taking as many dimensions as one likes (e.g, 2, 3 4 0r 5). The corresponding Matlab commands are shown below:<br />
[[File:FreyFaceExample.PNG|thumb|185px|An example of the face images used in [[#Example 1 | Example 1]] with noise removed. Source: <ref>S. Roweis (2011). ''Data for MATLAB.'' [Online]. Available: [http://cs.nyu.edu/~roweis/data.html http://cs.nyu.edu/~roweis/data.html.] |</ref>]]<br />
<pre style="align:left; width: 75%; padding: 2% 2%"><br />
>> % start with a 560 by 1965 matrix X that contains the data points<br />
>> load(noisy.mat);<br />
>> <br />
>> % set the colors to grayscale <br />
>> colormap gray<br />
>> <br />
>> % show image in column 10 by reshaping column 10 into a 20 by 28 matrix<br />
>> imagesc(reshape(X(:,10),20,28)')<br />
>> <br />
>> % perform SVD, if X matrix if full rank, will obtain 560 PCs<br />
>> [S U V] = svd(X);<br />
>> <br />
>> % reconstruct X ( project X onto the original space) using only the first ten principal components<br />
>> Y_pca = U(:, 1:10)'*X;<br />
>> <br />
>> % show image in column 10 of X_hat which is now a 560 by 1965 matrix<br />
>> imagesc(reshape(X_hat(:,10),20,28)')<br />
</pre><br />
The reason why the noise is removed in the reconstructed image is because the noise does not create a major variation in a single direction in the original data. Hence, the first ten PCs taken from <math>\ U </math> matrix are not in the direction of the noise. Thus, reconstructing the image using the first ten PCs, will remove the noise.<br />
<br />
==== Example 2 ====<br />
Consider a matrix of data points <math>\ X </math> with the dimensions 64 by 400. 64 is the number of elements in each column. Each column is a vector representation of a 8x8 grayscale pixel image of either a handwritten number ''2'' or a handwritten number ''3'' (see image below) and there are a total of 400 different images, where the first 200 images show a handwritten number ''2'' and the last 200 images show a handwritten number ''3''. <br />
[[File:Handwritten23.PNG|frame|center|An example of the handwritten number images used in [[#Example 2 | Example 2]]. Source: <ref>A. Ghodsi, "PCA" class notes for STAT841, Department of Statistics and Actuarial Science, University of Waterloo, 2011. </ref>]]<br />
<br />
The corresponding Matlab commands for performing PCA on the data points are shown below:<br />
<pre><br />
>> % start with a 64 by 400 matrix X that contains the data points<br />
>> load 2_3.mat;<br />
>> <br />
>> % set the colors to grayscale <br />
>> colormap gray<br />
>> <br />
>> % show image in column 2 by reshaping column 2 into a 8 by 8 matrix<br />
>> imagesc(reshape(X(:,2),8,8))<br />
>> <br />
>> % perform SVD, if X matrix if full rank, will obtain 64 PCs<br />
>> [U S V] = svd(X);<br />
>> <br />
>> % project data down onto the first two PCs<br />
>> Y = U(:,1:2)'*X;<br />
>> <br />
>> % show Y as an image (can see the change in the first PC at column 200,<br />
>> % when the handwritten number changes from 2 to 3)<br />
>> imagesc(Y)<br />
>> <br />
>> % perform PCA using Matlab build-in function (do not use for assignment)<br />
>> % also note that due to the Matlab convention, the transpose of X is used<br />
>> [COEFF, Y] = princomp(X');<br />
>> <br />
>> % again, use the first two PCs<br />
>> Y = Y(:,1:2);<br />
>> <br />
>> % use plot digits to show the distribution of images on the first two PCs<br />
>> images = reshape(X, 8, 8, 400);<br />
>> plotdigits(images, Y, .1, 1);<br />
</pre><br />
Using the ''plotdigits'' function in Matlab, clearly illustrates that the first PC captured the differences between the numbers ''2'' and ''3'' as they are projected onto different regions of the axis for the first PC. Also, the second PC captured the ''tilt'' of the handwritten numbers as numbers tilted to the left or right were projected onto different regions of the axis for the second PC.<br />
<br />
==== Example 3 ====<br />
(Not discussed in class) In the news recently was a story that captures some of the ideas behind PCA. Over the past two years, Scott Golder and Michael Macy, researchers from Cornell University, collected 509 million Twitter messages from 2.4 million users in 84 different countries. The data they used were words collected at various times of day and they classified the data into two different categories: positive emotion words and negative emotion words. Then, they were able to study this new data to evaluate subjects' moods at different times of day, while the subjects were in different parts of the world. They found that the subjects generally exhibited positive emotions in the mornings and late evenings, and negative emotions mid-day. They were able to "project their data onto a smaller dimensional space" using PCS. Their paper, "Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures," is available in the journal Science.<ref>http://www.pcworld.com/article/240831/twitter_analysis_reveals_global_human_moodiness.html</ref>.<br />
<br />
Assumptions Underlying Principal Component Analysis can be found here<ref>http://support.sas.com/publishing/pubcat/chaps/55129.pdf</ref><br />
<br />
==== Example 4 ====<br />
(Not discussed in class) A somewhat well known learning rule in the field of neural networks called Oja's rule can be used to train networks of neurons to compute the principal component directions of data sets. <ref>A Simplified Neuron Model as a Principal Component Analyzer. Erkki Oja. 1982. Journal of Mathematical Biology. 15: 267-273</ref> This rule is formulated as follows<br />
<br />
<math>\,\Delta w = \eta yx -\eta y^2w </math><br />
<br />
where <math>\,\Delta w </math> is the neuron weight change, <math>\,\eta</math> is the learning rate, <math>\,y</math> is the neuron output given the current input, <math>\,x</math> is the current input and <math>\,w</math> is the current neuron weight. This learning rule shares some similarities with another method for calculating principal components: power iteration. The basic algorithm for power iteration (taken from wikipedia: <ref>Wikipedia. http://en.wikipedia.org/wiki/Principal_component_analysis#Computing_principal_components_iteratively</ref>) is shown below <br />
<br />
<br />
<math>\mathbf{p} =</math> a random vector<br />
do ''c'' times:<br />
<math>\mathbf{t} = 0</math> (a vector of length ''m'')<br />
for each row <math>\mathbf{x} \in \mathbf{X^T}</math><br />
<math>\mathbf{t} = \mathbf{t} + (\mathbf{x} \cdot \mathbf{p})\mathbf{x}</math><br />
<math>\mathbf{p} = \frac{\mathbf{t}}{|\mathbf{t}|}</math><br />
return <math>\mathbf{p}</math><br />
<br />
Comparing this with the neuron learning rule we can see that the term <math>\, \eta y x </math> is very similar to the <math>\,\mathbf{t}</math> update equation in the power iteration method, and identical if the neuron model is assumed to be linear (<math>\,y(x)=x\mathbf{p}</math>) and the learning rate is set to 1. Additionally, the <math>\, -\eta y^2w </math> term performs the normalization, the same function as the <math>\,\mathbf{p}</math> update equation in the power iteration method.<br />
<br />
=== Observations ===<br />
Some observations about the PCA were brought up in class:<br />
<br />
* '''PCA''' assumes that data is on a ''linear subspace'' or close to a linear subspace. For non-linear dimensionality reduction, other techniques are used. Amongst the first proposed techniques for non-linear dimensionality reduction are '''Locally Linear Embedding (LLE)''' and '''Isomap'''. More recent techniques include '''Maximum Variance Unfolding (MVU)''' and '''t-Distributed Stochastic Neighbor Embedding (t-SNE)'''. '''Kernel PCAs''' may also be used, but they depend on the type of kernel used and generally do not work well in practice. (Kernels will be covered in more detail later in the course.)<br />
<br />
* Finding the number of PCs to use is not straightforward. It requires knowledge about the ''instrinsic dimentionality of data''. In practice, oftentimes a heuristic approach is adopted by looking at the eigenvalues ordered from largest to smallest. If there is a "dip" in the magnitude of the eigenvalues, the "dip" is used as a cut off point and only the large eigenvalues before the "dip" are used. Otherwise, it is possible to add up the eigenvalues from largest to smallest until a certain percentage value is reached. This percentage value represents the percentage of variance that is preserved when projecting onto the PCs corresponding to the eigenvalues that have been added together to achieve the percentage. <br />
<br />
* It is a good idea to normalize the variance of the data before applying PCA. This will avoid PCA finding PCs in certain directions due to the scaling of the data, rather than the real variance of the data.<br />
<br />
* PCA can be considered as an unsupervised approach, since the main direction of variation is not known beforehand, i.e. it is not completely certain which dimension the first PC will capture. The PCs found may not correspond to the desired labels for the data set. There are, however, alternate methods for performing supervised dimensionality reduction.<br />
<br />
* (Not in class) Even though the traditional PCA method does not work well on data set that lies on a non-linear manifold. A revised PCA method, called c-PCA, has been introduced to improve the stability and convergence of intrinsic dimension estimation. The approach first finds a minimal cover (a cover of a set X is a collection of sets whose union contains X as a subset<ref>http://en.wikipedia.org/wiki/Cover_(topology)</ref>) of the data set. Since set covering is an NP-hard problem, the approach only finds an approximation of minimal cover to reduce the complexity of the run time. In each subset of the minimal cover, it applies PCA and filters out the noise in the data. Finally the global intrinsic dimension can be determined from the variance results from all the subsets. The algorithm produces robust results.<ref>Mingyu Fan, Nannan Gu, Hong Qiao, Bo Zhang, Intrinsic dimension estimation of data by principal component analysis, 2010. Available: http://arxiv.org/abs/1002.2050</ref><br />
<br />
*(Not in class) While PCA finds the mathematically optimal method (as in minimizing the squared error), it is sensitive to outliers in the data that produce large errors PCA tries to avoid. It therefore is common practice to remove outliers before computing PCA. However, in some contexts, outliers can be difficult to identify. For example in data mining algorithms like correlation clustering, the assignment of points to clusters and outliers is not known beforehand. A recently proposed generalization of PCA based on a '''Weighted PCA''' increases robustness by assigning different weights to data objects based on their estimated relevancy.<ref>http://en.wikipedia.org/wiki/Principal_component_analysis</ref><br />
<br />
* (Not in class) Comparison between PCA and LDA: Principal Component Analysis (PCA)and Linear Discriminant Analysis (LDA) are two commonly used techniques for data classification and dimensionality reduction. Linear Discriminant Analysis easily handles the case where the within-class frequencies are unequal and their performances has been examined on randomly generated test data. This method maximizes the ratio of between-class variance to the within-class variance in any particular data set thereby guaranteeing maximal separability. ... The prime difference between LDA and PCA is that PCA does more of feature classification and LDA does data classification. In PCA, the shape and location of the original data sets changes when transformed to a different space whereas LDA doesn’t change the location but only tries to provide more class separability and draw a decision region between the given classes. This method also helps to better understand the distribution of the feature data." [24] [[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf]]<br />
<br />
=== Summary ===<br />
The PCA algorithm can be summarized into the following steps:<br />
<br />
# '''Recover basis'''<br />
#: <math>\ \text{ Calculate } XX^T=\Sigma_{i=1}^{t}x_ix_{i}^{T} \text{ and let } U=\text{ eigenvectors of } XX^T \text{ corresponding to the largest } d \text{ eigenvalues.} </math><br />
# '''Encode training data'''<br />
#: <math>\ \text{Let } Y=U^TX \text{, where } Y \text{ is a } d \times t \text{ matrix of encodings of the original data.} </math><br />
# '''Reconstruct training data'''<br />
#: <math> \hat{X}=UY=UU^TX </math>.<br />
# '''Encode test example'''<br />
#: <math>\ y = U^Tx \text{ where } y \text{ is a } d\text{-dimensional encoding of } x </math>.<br />
# '''Reconstruct test example'''<br />
#: <math> \hat{x}=Uy=UU^Tx </math>.<br />
<br />
== Fisher Discriminant Analysis (FDA) (Lecture: Sep. 29, 2011) ==<br />
<br />
'''Fisher Discriminant Analysis (FDA)''' is sometimes called ''Fisher Linear Discriminant Analysis (FLDA)'' or just ''Linear Discriminant Analysis (LDA)''. This causes confusion with the [[#LDA | ''Linear Discriminant Analysis (LDA)'']] technique covered earlier in the course. The LDA technique covered earlier in the course has a normality assumption and is a boundary finding technique. The FDA technique outlined here is a supervised feature extraction technique. FDA differs from PCA as well because PCA does not use the class labels, <math>\ y_i</math>, of the data <math>\ (x_i,y_i)</math> while FDA organizes data into their ''classes'' by finding the direction of maximum separation between classes.<br />
<br />
== Fisher Discriminant Analysis (FDA) Continued (Lecture: Oct. 04, 2011) ==<br />
<br />
One main drawback of the PCA technique is that the direction of greatest variation may not be the classification we desire. For example, imagine if the [[#Example 2 | data set]] above had a lightening filter applied to a random subset of the images. Then the greatest variation would be the brightness and not the more important variations we wish to classify. FDA circumvents this problem by using the labels, <math>\ y_i</math>, of the data <math>\ (x_i,y_i)</math> i.e. the FDA uses ''supervised learning''. An elementary way to see the algorithm is to imagine two classes of data projected onto a suitably chosen line that minimizes the within class variance, and maximizes the distance between the two classes i.e. group similar data together and spread different data apart. This way, new data acquired can be compared, after a transformation, to where these projections, using some well-chosen metric.<br />
<br />
<br />
We first consider the cases of two-classes. Denote the mean and covariance matrix of class <math>i=0,1</math> by <math>\mathbf{\mu}_i</math> and <math>\mathbf{\Sigma}_i</math> respectively. We transform the data so that it is projected into 1 dimension i.e. a scalar value. To do this, we compute the inner product of our <math>dx1</math>-dimensional data, <math>\mathbf{x}</math>, by a to-be-determined <math>dx1</math>-dimensional vector <math>\mathbf{w}</math>. The new means and covariances of the transformed data:<br />
<br />
::<math> \mu'_i:\rightarrow \mathbf{w}^{T}\mathbf{\mu}_i </math> <br/><br />
::<math> \Sigma'_i :\rightarrow \mathbf{w}^{T}\mathbf{\sigma}_i \mathbf{w}</math><br />
<br />
The new means and variances are actually scalar values now, but we will use vector and matrix notation and arguments throughout the following derivation as the multi-class case is then just a simpler extension. <br />
<br />
===Goals of FDA===<br />
<br />
As will be shown in the objective function, the goal of FDA is to maximize the separation of the classes (between class variance) and minimize the scatter within each class (within class variance). That is, our ideal situation is that the individual classes are as far away from each other as possible and at the same time the data within each class are as close to each other as possible (collapsed to a single point in the most extreme case). An interesting note is that R. A. Fisher who FDA is named after, used the FDA technique for purposes of taxonomy, in particular for categorizing different species of iris flowers. <ref name="RAFisher">R. A. Fisher, "The Use of Multiple measurements in Taxonomic Problems," ''Annals of Eugenics'', 1936</ref>. It is very easy to visualize what is meant by within class variance (i.e. differences between the iris flowers of the same species) and between class variance (i.e. the differences between the iris flowers of different species) in that case.<br />
<br />
<br />
'''1)''' Our '''first''' goal is to minimize the individual classes' covariance. This will help to collapse the data together. <br />
We have two minimization problems<br />
<br />
::<math>\min_{\mathbf{w}} \mathbf{w} \mathbf{\Sigma}_0 \mathbf{w}^{T}</math> <br />
and <br />
::<math>\min_{\mathbf{w}} \mathbf{w} \mathbf{\Sigma}_1 \mathbf{w}^{T}</math>.<br />
<br />
But these can be combined:<br />
::<math> \min_{\mathbf{w}} \mathbf{w} \mathbf{\Sigma}_0 \mathbf{w}^{T} + \mathbf{w} \mathbf{\Sigma}_1 \mathbf{w}^{T}</math> <br />
:: <math> = \min_{\mathbf{w}} \mathbf{w} ( \mathbf{\Sigma_0} + \mathbf{\Sigma_1} ) \mathbf{w}^{T} </math><br />
<br />
Define <math> \mathbf{S}_W =\mathbf{\Sigma_0} + \mathbf{\Sigma_1} </math>, called the ''within class variance matrix''. <br />
<br />
'''2)''' Our '''second''' goal is to move the minimized classes as far away from each other as possible. One way to accomplish this is to maximize the distances between the means of the transformed data i.e.<br />
<br />
<math> \max_{\mathbf{w}} |\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1|^2 </math><br />
<br />
Simplifying:<br />
::<math> \max_{\mathbf{w}} \,(\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1)^T (\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{mu}_1) </math> <br/><br />
::<math> = \max_{\mathbf{w}}\, (\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}\mathbf{w}^{T} \mathbf{w} (\mathbf{\mu}_0-\mathbf{\mu}_1)</math> <br/><br />
::<math> = \max_{\mathbf{w}} \,\mathbf{w}^{T}(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}\mathbf{w}</math><br />
<br />
Recall that <math> \mathbf{\mu}_i </math> are known. Denote<br />
<br />
::<math> \mathbf{S}_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}</math> <br />
<br />
This matrix, called the ''between class variance matrix'', is a rank 1 matrix, so an inverse does not exist. Altogether, we have two optimization problems we must solve simultaneously:<br />
<br />
::1) <math> \min_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} </math><br/><br />
::2) <math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} </math><br />
<br />
There are other metrics one can use to both minimize the data's variance and maximizes the distance between classes, and other goals we can try to accomplish (see metric learning, below...one day), but Fisher used this elegant method, hence his recognition in the name, and we will follow his method.<br />
<br />
We can combine the two optimization problems into one after noting that the negative of max is min:<br />
<br />
::<math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} - \alpha \mathbf{w} \mathbf{S_B} \mathbf{w}^{T}</math><br/><br />
<br />
The <math>\alpha</math> coefficient is a necessary scaling factor: if the scale of one of the terms is much larger than the other, the optimization problem will be dominated by the larger term. This means we have another unknown, <math>\alpha</math>, to solve for. Instead, we can circumvent the scaling problem by looking at the ratio of the quantities, the original solution Fisher proposed:<br />
<br />
::<math> \max_{\mathbf{w}} \frac{\mathbf{w} \mathbf{S_B} \mathbf{w}^{T}}{\mathbf{w} \mathbf{S_W} \mathbf{w}^{T}} </math><br />
<br />
This optimization problem can be shown<ref><br />
http://www.socher.org/uploads/Main/optimizationTutorial01.pdf<br />
</ref> to be equivalent to the following optimization problem:<br />
<br />
:: <math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T}</math> <br />
<br />
subject to:<br />
<br />
:: <math> \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} = 1 </math><br />
<br />
A heuristic understanding of this equivalence is that we have two degrees of freedom: direction and scalar. The scalar value is irrelevant to our discussion. Thus, we can set one of the values to be a constant. We can use Lagrange multipliers to solve this optimization problem:<br />
<br />
::<math>L( \mathbf{w}, \lambda) = \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} - \lambda(\mathbf{w} \mathbf{S_W} \mathbf{w}^{T}-1)</math><br />
:: <math> \Rightarrow \frac{\partial L}{\partial \mathbf{w}} = 2 \mathbf{S}_B \mathbf{w} - 2\lambda \mathbf{S}_W\mathbf{w} </math><br />
<br />
Setting the partial derivative to 0 gives us a ''generalized eigenvalue problem'':<br />
<br />
::<math> \mathbf{S}_B \mathbf{w} = \lambda \mathbf{S}_W \mathbf{w} </math><br />
:: <math> \Rightarrow \mathbf{S}_W^{-1} \mathbf{S}_B \mathbf{w} = \lambda \mathbf{w} </math><br />
<br />
This is a generalized eigenvalue problem and <math>\ W </math> can be computed as the eigenvector corresponds to the largest eigenvalue of <br />
:: <math> \mathbf{S}_W^{-1} \mathbf{S}_B </math><br />
<br />
It is very likely that <math> \mathbf{S}_W </math> has an inverse. If not, the pseudo-inverse<ref><br />
http://en.wikipedia.org/wiki/Generalized_inverse<br />
</ref><ref><br />
http://www.mathworks.com/help/techdoc/ref/pinv.html<br />
</ref> can be used. In Matlab the pseudo-inverse function is named ''pinv''. Thus, we should choose <math>\mathbf{w}</math> to equal the eigenvector of the largest eigenvalue as our projection vector. <br />
<br />
In fact we can simplify the above expression further in the of two classes. Recall the definition of <math>\mathbf{S}_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}</math>. Substituting this into our expression:<br />
<br />
::<math> \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T} \mathbf{w} = \lambda \mathbf{w} </math><br />
::<math> (\mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) ) ((\mathbf{\mu}_0-\mathbf{\mu}_1)^{T} \mathbf{w}) = \lambda \mathbf{w} </math><br />
<br />
This second term is a scalar value, let's denote it <math>\beta</math>. Then<br />
::<math> \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) = \frac{\lambda}{\beta} \mathbf{w} </math><br />
::<math> \Rightarrow \, \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) \propto \mathbf{w} </math><br />
<br />
All we are interested in the direction of <math>\mathbf{w}</math>, so to compute this is sufficient to finding our projection vector. Though this will not work in higher dimensions, as <math>\mathbf{w}</math> would be a matrix and not a vector in higher dimensions.<br />
<br />
=== Extensions to Multiclass Case ===<br />
If we have <math>\ k</math> classes, we need <math>\ k-1</math> directions i.e. we need to project <math>\ k</math> 'points' onto a <math>\ k-1</math> dimensional hyperplane. What does this change in our above derivation? The most significant difference is that our projection vector,<math>\mathbf{w}</math>, is no longer a vector but instead is a matrix <math>\mathbf{W}</math>. We transform the data as:<br />
<br />
::<math> \mathbf{x}' :\rightarrow \mathbf{W}^{T} \mathbf{x}</math><br />
so our new mean and covariances for class k are:<br />
::<math> \mathbf{\mu_k}' :\rightarrow \mathbf{W}^{T} \mathbf{\mu_k}</math><br />
::<math> \mathbf{\Sigma_k}' :\rightarrow \mathbf{W}^{T} \mathbf{\Sigma_k} \mathbf{W}</math><br />
<br />
What are our new optimization sub-problems? As before, we wish to minimize the within class variance. This can be formulated as:<br />
::<math>\min_{\mathbf{W}} \mathbf{W}^{T} \mathbf{\Sigma_1} \mathbf{W} + \dots + \mathbf{W}^{T} \mathbf{\Sigma_k} \mathbf{W} </math><br />
<br />
Again, denoting <math>\mathbf{S}_W = \mathbf{\Sigma_1} + \dots + \mathbf{\Sigma_k}</math>, we can simplify above expression:<br />
<br />
::<math>\min_{\mathbf{W}} \mathbf{W}^{T} \mathbf{S}_W \mathbf{W} </math><br />
<br />
Similarly, the second optimization problem is:<br />
<br />
::<math>\max_{\mathbf{W}} \mathbf{W}^{T} \mathbf{S}_B \mathbf{W} </math><br />
<br />
What is <math>\mathbf{S}_B</math> in this case? It can be shown that <math>\mathbf{S}_T = \mathbf{S}_B + \mathbf{S}_W </math> where <math> \mathbf{S}_T </math> is the covariance matrix of all the data. From this we can compute <math> \mathbf{S}_B </math>. <br />
<br />
Next, if we express <math> \mathbf{W} = ( \mathbf{w}_1 , \mathbf{w}_2 , \dots ,\mathbf{w}_k ) </math> observe that, for <math> \mathbf{A} = \mathbf{S}_B , \mathbf{S}_W </math>: <br />
<br />
::<math> Tr(\mathbf{W}^{T} \mathbf{A} \mathbf{W}) = \mathbf{w}_1^{T} \mathbf{A} \mathbf{w}_1^{T} + \dots + \mathbf{w}_k \mathbf{A} \mathbf{w}_k </math><br />
<br />
where <math>\ Tr()</math> is the trace of a matrix. Thus, following the same steps as in the two-class case, we have the new optimization problem:<br />
<br />
::<math> \max_{\mathbf{W}} \frac{ Tr(\mathbf{W}^{T} \mathbf{S}_B \mathbf{W}) }{Tr(\mathbf{W}^{T} \mathbf{S}_W \mathbf{W})} </math> <br />
<br />
subject to:<br />
<br />
:: <math> Tr( \mathbf{W} \mathbf{S_W} \mathbf{W}^{T}) = \mathbf{I} </math><br />
<br />
Again, in order to solve the above optimization problem, we can use the Lagrange multiplier <ref><br />
http://en.wikipedia.org/wiki/Lagrange_multiplier </ref>:<br />
<br />
:: <math>\begin{align}L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - I \right\}\end{align}</math>.<br />
<br />
where <math>\ \Lambda</math> is a d by d diagonal matrix.<br />
<br />
Then, we differentiating with respect to <math>\mathbf{W}</math>:<br />
<br />
:: <math>\begin{align}\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}\end{align} = 0</math>.<br />
<br />
Thus:<br />
<br />
:: <math>\begin{align}\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}\end{align}</math><br />
<br />
:: <math>\begin{align}\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{W}\end{align}</math><br />
<br />
where, <math> \mathbf{\Lambda} =\begin{pmatrix}\lambda_{1} & & 0\\&\ddots&\\0 & &\lambda_{d}\end{pmatrix}</math><br />
<br />
The above equation is of the form of an eigenvalue problem. Thus, for the solution the k-1 eigenvectors corresponding to the k-1 largest eigenvalues should be chosen as the projection matrix, <math>\mathbf{W}</math>. In fact, there should only by k-1 eigenvectors corresponding to k-1 non-zero eigenvalues using the above equation.<br />
<br />
=== Summary ===<br />
FDA has two optimization problems:<br />
::1) <math> \min_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} </math><br/><br />
::2) <math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} </math> <br />
<br />
where <math>\ S_W = \Sigma_0 + \Sigma_1</math> is called the within class variance and <math>\ S_B = (\mu_0 - \mu_1)(\mu_0 - \mu_1)^T </math> is called the between class variance.<br />
<br />
The two optimization problems are combined as follows:<br />
::<math> \max_{\mathbf{w}} \frac{\mathbf{w} \mathbf{S_B} \mathbf{w}^{T}}{\mathbf{w} \mathbf{S_W} \mathbf{w}^{T}} </math><br />
<br />
By adding a constraint as shown:<br />
::<math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T}</math><br />
<br />
subject to:<br />
:: <math> \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} = 1 </math><br />
<br />
Lagrange multipliers can be used and essentially the problem becomes an eigenvalue problem:<br />
<br />
::<math>\begin{align}\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w} = \lambda\mathbf{w}\end{align}</math><br />
<br />
And <math>\ w </math> can be computed as the k-1 eigenvectors corresponding to the largest k-1 eigenvalues of <math> \mathbf{S}_W^{-1} \mathbf{S}_B </math>.<br />
<br />
=== Variations ===<br />
<br />
Some adaptations and extensions exist for the FDA technique (Source: <ref>R. Gutierrez-Osuna, "Linear Discriminant Analysis" class notes for Intro to Pattern Analysis, Texas A&M University. Available: [http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf]</ref>):<br />
<br />
1) ''Non-Parametric LDA (NPLDA)'' by Fukunaga<br />
<br />
This method does not assume that the Gaussian distribution is unimodal and it is actually possible to extract more than k-1 features (where k is the number of classes).<br />
<br />
2) ''Orthonormal LDA (OLDA)'' by Okada and Tomita<br />
<br />
This method finds projections that are orthonormal in addition to maximizing the FDA objective function. This method can also extract more than k-1 features (where k is the number of classes).<br />
<br />
3) ''Generalized LDA (GLDA)'' by Lowe<br />
<br />
This method incorporates additional cost functions into the FDA objective function. This causes classes with a higher cost to be placed further apart in the lower dimensional representation.<br />
<br />
== Linear and Logistic Regression (Lecture: Oct. 06, 2011) ==<br />
<br />
=== Linear Regression ===<br />
<br />
In regression, <math>\ y </math> is a continuous variable. In classification, <math>\ y </math> is a discrete variable. Regression problems are easier to formulate into functions (since <math>\ y </math> is continuous) and it is possible to solve classification problems by treating them like regression problems. In order to do so, the requirement in classification that <math>\ y </math> is discrete must first be relaxed. Once <math>\ y </math> has been found using regression techniques, it is possible to determine the discrete class corresponding to the <math>\ y </math> that has been found to solve the original classification problem. The discrete class is obtained by defining a threshold where <math>\ y </math> values below the threshold belong to one class and <math>\ y </math> values above the threshold belong to another class.<br />
<br />
<br />
More formally: a more direct approach to classification is to estimate the regression function <math>\ r(\mathbf{x}) = E[Y | X]</math> without bothering to estimate <math>\ f_k(\mathbf{x}) </math>.<br />
<br />
In two-class problems, if <math>\ Y = \{0,1\}</math>, then <math>\, h^*(\mathbf{x})= \left\{\begin{matrix}<br />
1 &\text{,if } \hat r(\mathbf{x})>\frac{1}{2} \\<br />
0 &\mathrm{,otherwise.} \end{matrix}\right.</math><br />
<br />
Basically, we can use a linear function<br />
<math>\ f(x, \beta) = \mathbf{\beta\,}^T \mathbf{x_{i}} + \mathbf{\beta\,_0} </math> and use the least squares approach to fit the function to the given data. This is done by minimizing the following expression:<br />
<br />
<math>\min_{\mathbf{\beta}} \sum_{i=1}^n (y_i - \mathbf{\beta}^T<br />
\mathbf{x_{i}} - \mathbf{\beta_0})^2</math><br />
<br />
where<br />
<br />
<math>\tilde{\mathbf{\beta}} = \left( \begin{array}{c}\mathbf{\beta_{1}}<br />
<br />
\\ \\<br />
\dots \\ \\<br />
\mathbf{\beta}_{d} \\ \\<br />
\mathbf{\beta}_{0} \end{array} \right)</math>.<br />
<br />
For convenience, <math>\mathbf{\beta}</math> and <math>\mathbf{\beta}_0</math> have been combined into a d+1 dimensional vector. And an extra term 1 is appended to to <math>\ x </math>. Thus, the function to be minimized can now be expressed as:<br />
<br />
<math>\ min_{\tilde{\beta}} \sum_{i=1}^{n} (y_i - \tilde{\beta} \tilde{x_i} )^2 </math><br />
<br />
<math>\ = min_{\tilde{\beta}} | y - X \tilde{\beta}^T |^2 </math><br />
<br />
where <math>\ y </math> and <math>\tilde{\beta}</math> are vectors and <math>\ X </math> is a matrix.<br />
<br />
The solution for <math>\ \tilde{\beta} </math> is<br />
<br />
<math>\ {\tilde{\beta}} = (XX^T)^{-1}Xy </math><br />
<br />
Using regression to solve classification problems is not mathematically correct, if we want to be true to classification. However, this method works well in practice, if the problem is not complicated. When we have only two classes (encoded as <math>\ \frac{-n}{n_1} </math> and <math>\ \frac{n}{n_2}) </math>, this method is identical to LDA.<br />
<br />
==== Matlab Example ====<br />
<br />
The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
===Logistic Regression===<br />
<br />
Logistic regression is a more advanced method for classification, and is<br />
more commonly used. <br />
<br />
We can define a function <br /><br />
<math>f_1(x)= P(Y=1| X=x) = (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})</math><br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
<br />
<br />
This is a valid density function. It looks similar to a step function, but<br />
we have relaxed it so that we have a smooth curve, and can therefore take the<br />
derivative.<br />
<br />
The range of this function is (0,1) since<br /> <br />
<math>\lim_{x \to -\infty}f_1(\mathbf{x}) = 0</math> and<br />
<math>\lim_{x \to \infty}f_1(\mathbf{x}) = 1</math>.<br />
<br />
As shown on this graph:<br /><br />
http://www.wolframalpha.com/input/?i=Plot[E^x/%281+%2B+E^x%29,+{x,+-10,+10}]%29<br />
<br />
Then we compute the complement of f1(x), and get<br /><br />
<br />
<math>f_2(x)= P(Y=0| X=x) = 1-f_1(x) = (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})</math>, denoted f2. <br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
<br />
<br />
Function f2 is commonlly called Logistic function, and it behaves like <br /><br />
<math>\lim_{x \to -\infty}f_2(\mathbf{x}) = 1</math> and<br /><br />
<math>\lim_{x \to \infty}f_2(\mathbf{x}) = 0</math>.<br />
<br />
As shown on this graph:<br /><br />
http://www.wolframalpha.com/input/?i=Plot[1/%281+%2B+E^x%29,+{x,+-10,+10}]%29<br />
<br />
From here, we can form the conditional density function. To do this, we must combine<br /><br />
<math>f_1</math> and <math>f_2</math> <br /><br />
such that <br /><br />
<math>f_1=1</math> and<math>f_2=0</math> if y=1 ( which means it’s in class 1), <br /><br />
and <math>f_1=0</math> and <math>f_2=1</math> if y=2 (which means it’s in class 2).<br />
<br />
Eventually, we have our conditional density function formula<br /><br />
<math>f(y|\mathbf{x})= (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y}</math><br />
<br />
To way to use this formula is, with given the training data (x(i), y(i)),to fit the data with <math>f(Y : X)</math>.<br />
<br />
In general, we can think of the problem as having a box with some knobs. Inside the box is our objective function which gives the form to classify our input (xi) to<br />
our output (yi). The knobs in the box are functioning like the parameters of the objective function. Our job is to find the proper parameters that can minimize the error between our output and the true value. So we have turned our machine learning problem intoan optimization problem. <br />
<br />
Since we need to find the parameters that maximize the chance of having our observed data coming from the distribution of f(x|parameter), we need to introduce Maximum Likelihood Estimation.<br />
<br />
====Maximum Likelihood Estimation====<br />
<br />
Given iid data points <math>({\mathbf{x}_i})_{i=1}^n</math> and density function <math>f(\mathbf{x}|\mathbf{\theta})</math>, where the form of f is known but the parameters <math>\theta</math> are our unknown. The maximum likelihood estimation of <math>\theta\,_{ML}</math> is a set of parameters that maximize the probability of observing <math>({\mathbf{x}_i})_{i=1}^n</math> given <math>\theta\,_{ML}</math>.<br />
<br />
<math>\theta_\mathrm{ML} = \underset{\theta}{\operatorname{arg\,max}}\ f(\mathbf{x}|\theta)</math>.<br />
<br />
There was some discussion in class regarding the notation. In literature, Bayesians use <math>f(\mathbf{x}|\mu)</math> while Frequentists use <math>f(\mathbf{x};\mu)</math>. In practice, these two are equivalent.<br />
<br />
Our goal is to find theta to maximize <br />
<math>\mathcal{L}(\theta\,) = f({\mathbf{x}_i})_{i=1}^n|\;\theta) = \prod_{i=1}^n f(\mathbf{x_i}|\theta)</math>. (The second equality holds because data points are iid.)<br />
<br />
In many cases, it’s more convenient to work with the natural logarithm of the likelihood. (Recall that the logarithm preserves minumums and maximums.)<br />
<math>\ell(\theta|x\mathbf)=\ln\mathcal{L}(\theta\,)</math> <br />
<br />
<math>\ell(\theta\,)=\sum_{i=1}^n \ln f(\mathbf{x_i}|\theta)</math><br />
<br />
Applying Maximum Likelihood Estimation to <math>f(y|\mathbf{x})= (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y}</math>, gives<br />
<br />
<math>\mathcal{L}(\mathbf{\beta\,})=\prod_{i=1}^n (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y_i} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y_i}</math><br />
<br />
<math>\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) + (1-y_i) (\ln{1} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}))\right) \\[10pt]&{} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) - (1-y_i) \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \\[10pt] &{} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}) + y_i \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \\[10pt] &{} = \sum_{i=1}^n \left(y_i {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}</math><br />
<br />
<math>\begin{align} {\frac{\partial \ell}{\partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - \frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}} \mathbf{x_i} \right) \\[8pt] & {}= \sum_{i=1}^n \left(y_i \mathbf{x_i} - P(\mathbf{x_i} | \mathbf{\beta\,}) \mathbf{x_i}\right) \end{align}</math><br />
<br />
<math>\frac{\partial \ell}{\partial \mathbf{\beta\,}}</math> can be numerically solved by Newton’s Method.<br />
<br />
====Newton's Method====<br />
<br />
Newton's Method (or Newton-Ralphson method) is a numerical method to find better approximations to the solutions of real-valued function. The function usually does not have an analytical form. <br />
<br />
The goal is to find <math>\mathbf{x}</math> such that <math><br />
f(\mathbf{x})<br />
= 0 </math>. The recursion can be implemented by<br />
<math>\mathbf{x_1} = \mathbf{x_0} - \frac{f(\mathbf{x_0})}{f'(\mathbf{x_0})}.\,\!<br />
</math>.<br />
<br />
It takes an initial guess <math>\mathbf{x_0}</math> and the direction "<math>\mathbf{f(x_{0}) / f' (x_{0})}</math>" that moves toward a better approximation. It then finds a newer and better <math>\mathbf{x_1}</math>. Taking this <math>\mathbf{x_1}</math> as <math>\mathbf{x_0}</math> in the second run, it finds a newer and better <math>\mathbf{x_1}</math> than the previous <math>\mathbf{x_1}</math>. Repeating the same process, the <math>\mathbf{x_1}</math> will be sufficiently accurate to the actual solutions.<br />
<br />
<br />
<br />
===Advantages of Logistic Regression===<br />
<br />
Logistic regression has several advantages over discriminant analysis: <br />
<br />
* it is more robust: the independent variables don't have to be normally distributed, or have equal variance in each group <br />
* It does not assume a linear relationship between the IV and DV <br />
* It may handle nonlinear effects <br />
* You can add explicit interaction and power terms <br />
* The DV need not be normally distributed. <br />
* There is no homogeneity of variance assumption. <br />
* Normally distributed error terms are not assumed. <br />
* It does not require that the independents be interval. <br />
* It does not require that the independents be unbounded.<br />
<br />
==Newton-Raphson Method (Lecture: Oct 11, 2011)==<br />
Previously we had derivated the log likelihood function for the logistic function. <br />
<br />
<math>\begin{align} L(\beta\,) = \prod_{i=1}^n \left( (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y_i}(\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y_i} \right) \end{align}</math><br />
<br />
After taking log, we can have<br />
<br />
<math>\begin{align} \ell(\beta\,) = \sum_{i=1}^n \left( y_i \log{\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}}} + (1 - y_i) \log{\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}}} \right) \end{align}</math><br />
<br />
which implies that<br />
<br />
<math>\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}</math><br />
<br />
Our goal is to find the <math>\beta\,</math> that maximizes <math>{\ell(\mathbf{\beta\,})}</math>. We use calculus to do this ie solve <math>{\frac{\partial \ell}{\partial \mathbf{\beta\,}}}=0</math>. To do this we use the famous numerical method of Newton-Raphson. This is an iterative method were we calculate the first & second derivative at each iteration.<br />
<br />
The first derivative is typically called the score vector.<br />
<br />
<math>\begin{align} S(\beta\,) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - \frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}} \mathbf{x_i} \right) \\[8pt] \end{align}</math><br />
<br />
The negative of the second derivative is typically called the information matrix.<br />
<br />
<math>\begin{align} I(\beta\,) {}= -{\frac{\partial \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \sum_{i=1}^n \left(\mathbf{x_i}\mathbf{x_i}^T (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})(\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}}) \right) \\[8pt] \end{align}</math><br />
<br />
We then use the following update formula to calcalute continually better estimates of the optimal <math>\beta\,</math>. It is not typically important what you use as your initial estimate <math>\beta\,^{(1)}</math> is.<br />
<br />
<math> \beta\,^{(r+1)} {}= \beta\,^{(r)} + I^{-1}(\beta\,^{(r)} )S(\beta\,^{(r)} )</math><br />
<br />
====Matrix Notation====<br />
<br />
let <math>\mathbf{y}</math> be a (n x 1) vector of all class labels. This is called the response in other contexts.<br />
<br />
let <math>\mathbb{X}</math> be a (n x (d+1)) matrix of all your features. Each row represents a data point. Each column represents a feature/covariate.<br />
<br />
let <math>\mathbf{p}^{(r)}</math> be a (n x 1) vector with values <math> P(\mathbf{x_i} |\beta\,^{(r)} ) </math><br />
<br />
let <math>\mathbb{W}^{(r)}</math> be a (n x n) diagonal matrix with <math>\mathbb{W}_{ii}^{(r)} {}= P(\mathbf{x_i} |\beta\,^{(r)} )(1 - P(\mathbf{x_i} |\beta\,^{(r)} ))</math><br />
<br />
we can rewrite our score vector, information matrix & update equation in terms of this new matrix notation, so the first derivitive is<br />
<br />
<math>\begin{align} S(\beta\,^{(r)}) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)})\end{align}</math><br />
<br />
and the second derivitive is<br />
<br />
<math>\begin{align} I(\beta\,^{(r)}) {}= -{\frac{\partial \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X} \end{align}</math><br />
<br />
Therfore, we can fit a regression problem as follows<br />
<br />
<math> \beta\,^{(r+1)} {}= \beta\,^{(r)} + I^{-1}(\beta\,^{(r)} )S(\beta\,^{(r)} ) {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}\mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)})</math><br />
<br />
====Iteratively Re-weighted Least Squares====<br />
If we reorganize this updating formula we can see it is really a iteratively solving a least squares problem each time with a new weighting.<br />
<br />
<math>\beta\,^{(r+1)} {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}(\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X}\beta\,^{(r)} + \mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)}))</math><br />
<br />
<math>\beta\,^{(r+1)} {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}\mathbb{X}^T\mathbb{W}^{(r)}\mathbf(z)^{(r)}</math><br />
<br />
where <math> \mathbf{z}^{(r)} = \mathbb{X}\beta\,^{(r)} + (\mathbb{W}^{(r)})^{-1}(\mathbf{y}-\mathbf{p}^{(r)}) </math><br />
<br />
<br />
<br />
Recall that linear regression by least squares finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-X^T \underline{\beta})^T(\underline{y}-X^T \underline{\beta})</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{(r+1)} \leftarrow arg \min_{\underline{\beta}}(Z-X \underline{\beta})^T W (Z-X \underline{\beta})</math><br />
<br />
====Fisher Scoring Method==== <br />
<br />
Fisher Scoring is a method very similiar to Newton-Raphson. It uses the expected Information Matrix as opposed to the observed information matrix. This distinction simplifies the problem and in perticular the computational complexity. To learn more about this method & logistic regression in general you can take Stat431/831 at the University of Waterloo.<br />
<br />
===Multi-class Logistic Regression===<br />
<br />
In a multi-class logistic regression we have K classes. For 2 classes ''k'' and ''l''<br />
<br />
<math>\frac{P(Y=l|X=x)}{P(Y=k|X=x)} = e^{\beta_l^T x}</math><br />
<br />
We call <math>log(\frac{P(Y=l|X=x)}{P(Y=k|X=x)}) = \beta_l^T x</math> as the logit transformation. The decision boundary between the 2 classes is the set of points where the logit transformation is 0.<br />
<br />
For each class from 1 to K-1 we then have:<br />
<br />
<math>log(\frac{P(Y=1|X=x)}{P(Y=K|X=x)}) = e^{\beta_1^T x}</math><br />
<br />
<math>log(\frac{P(Y=2|X=x)}{P(Y=K|X=x)}) = e^{\beta_2^T x}</math><br />
<br />
<math>log(\frac{P(Y=K-1|X=x)}{P(Y=K|X=x)}) = e^{\beta_{K-1}^T x}</math><br />
<br />
Note that choosing ''Y=K'' is arbitrary and any other choice is equally valid.<br />
<br />
Based on the above the posterior probabilities are given by: <math>P(Y=k|X=x) = \frac{e^{\beta_k^T x}}{1 + \sum_{i=1}^{K-1}{e^{\beta_i^T x}}}</math><br />
<br />
===Sample Size Requirements===<br />
<br />
The number of adjustable components in linear discriminant analysis is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math> where d is the number of parameters. Similarly, the number of adjustable components in logistic regression is <math>\, d+1</math>. The number of components also corresponds to the minimum number of observations needed to compute the coefficients of each function. Techniques do exist though for handling high dimensional problems where the number of parameters exceeds the number of observations.<br />
<br />
Linear discriminant analysis involves the inversion of a d x d covariance matrix. When d is bigger than n, the number of observations, this matrix is large and is rank n < d and thus singular. When this is the case, we can either use the pseudo inverse or perform regularized discriminant analysis which solves this problem. In RDA, we define a new covariance matrix <math>\, \Sigma(\gamma) = \gamma\Sigma + (1 - \gamma)diag(\Sigma)</math> with <math>\gamma \in [0,1]</math>. Cross validation can be used to calculate the best <math>\, \gamma</math>. More details on RDA can be found in Guo et al. (2006).<br />
<br />
Logistic regression can also be modified using shrinkage methods to deal with the problem of having less observations than parameters. When maximizing the log likelihood, we can add a <math>-\frac{\lambda}{2}\sum^{K}_{k=1}\|\beta_k\|_{2}^{2}</math> penalization term where K is the number of classes. This resulting optimization problem is convex and can be solved using Newton's method as given in Zhu and hastie (2004).<br />
<br />
===Comparison Between Logistic Regression And Linear Discriminant Analysis (LDA)===<br />
<br />
Logistic Regression Model and Linear Discriminant Analysis are widely used to analyze data which has categorical outcome variables. Both of the models are to build linear boundaries to classify different groups. Also, the categorical outcome variables (i.e. the dependent variables) must be mutually exclusive. <br />
<br />
However, these two models differ in their basic idea. While Logistic Regression is more relaxed and flexible in its assumptions, Linear Discriminant Analysis has the requirement that its explanatory variables must be normally distributed, linearly related and have equal covariance matrix within each class. Therefore, it can be expected that linear Discriminant Analysis should be more appropriate if the normality assumptions and equal covariance assumption are fulfilled in its explanatory variables. But in all other situations Logistic Regression should be appropriate. Besides, the total number of estimates to compute between these models is different. If the explanatory variables have d dimensions, we need to estimate <math>d+1</math> parameters in Logistic Regression and the number of parameters grows linearly w.r.t. dimension, while we need to estimate <math>2d+\frac{d*(d+1)}{2}+2</math> parameters in Linear Discriminant Analysis and the number of parameters grows quadratically w.r.t. dimension. <br />
<br />
== Perceptron (Lecture: Oct. 11, 2011) ==<br />
<br />
[[Image:Perceptron1.png|right|thumb|300px|Simple perceptron]]<br />
[[Image:Perceptron2.png|right|thumb|300px|Simple perceptron where <math>\beta_0</math> is defined as 1]]<br />
<br />
<br />
The perceptron is the building block for neural networks. It was invented by Rosenblatt in 1957 at Cornell Labs, and first mentioned in the paper "The Perceptron - a perceiving and recognizing automaton". The perceptron is used on linearly separable data sets.<br />
<br />
For a 2 class problem, and a set of inputs with ''d'' features, a perceptron will use a weighted sum and it will classify the information using the sign of the result. The figures on the right give an example of a perceptron. In these examples, <math>x^i</math> is the ''i''-th feature of a sample and <math>\beta_i</math> is the ''i''-th weight. <math>\beta_0</math> is defined as the bias. The bias alters the position of the decision boundary between the 2 classes.<br />
<br />
Perceptrons are generally trained using [http://en.wikipedia.org/wiki/Gradient_descent gradient descent]. This type of learning can have 2 side effects:<br />
* If the data sets are well separated, the training of the perceptron can lead to multiple valid solutions,<br />
* If the data sets are not linearly separable, the learning algorithm will never finish.<br />
<br />
Perceptrons are the simplest kind of a feedforward neural network. A perceptron is the building block for other neural networks such as:<br />
* Multi-layer perceptron<br />
* ADALINE<br />
* MADALINE<br />
<br />
==References==<br />
<references /><br />
<br />
24. Balakrishnama, S., Ganapathiraju, A. LINEAR DISCRIMINANT ANALYSIS - A BRIEF TUTORIAL. http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf [[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf]]</div>S9huhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f11&diff=12531stat841f112011-10-13T07:52:18Z<p>S9hu: /* Sample Size Requirements */</p>
<hr />
<div>==[[f11Stat841proposal| Proposal for Final Project]]==<br />
<br />
==[[f11Stat841EditorSignUp| Editor Sign Up]]==<br />
<br />
= STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20 =<br />
== Wiki Course Notes ==<br />
Students will need to contribute to the wiki for 20% of their grade.<br />
Access via wikicoursenote.com<br />
Go to editor sign-up, and use your UW userid for your account name, and use your UW email.<br />
<br />
primary (10%)<br />
Post a draft of lecture notes within 48 hours. <br />
You will need to do this 1 or 2 times, depending on class size.<br />
<br />
secondary (10%)<br />
Make improvements to the notes for at least 60% of the lectures.<br />
More than half of your contributions should be technical rather than editorial.<br />
There will be a spreadsheet where students can indicate what they've done and when.<br />
The instructor will conduct random spot checks to ensure that students have contributed what they claim.<br />
<br />
<br />
== Classification (Lecture: Sep. 20, 2011) ==<br />
=== Definitions ===<br />
'''classification''': Predict a discrete random variable <math>Y</math> (a label) by using another random variable <math>X</math><br />
(new data point) picked iid from a distribution<br />
<br />
<math>X_i = (X_{i1}, X_{i2}, ... X_{id}) \in \mathcal{X} \subset \mathbb{R}^d</math> (<math>d</math>-dimensional vector)<br />
<math>Y_i</math> in some finite set <math>\mathcal{Y}</math><br />
<br />
<br />
'''classification rule''':<br />
<math>h : \mathcal{X} \rightarrow \mathcal{Y}</math><br />
Take new observation <math>X</math> and use a classification function <math>h(x)</math> to generate a label <math>Y</math>. In other words, If we fit the function <math>h(x)</math> with a random variable <math>X</math>, it generates the label <math>Y</math> which is the class to which we predict <math>X</math> belongs.<br />
<br />
Example: Let <math> \mathcal{X}</math> be a set of 2D images and <math>\mathcal{Y}</math> be a finite set of people. We want to learn a classification rule <math>h:\mathcal{X}\rightarrow\mathcal{Y}</math> that with small ''true'' error predicts the person who appears in the image. <br />
<br />
<br />
'''true error rate''' for classifier <math>h</math> is the error with respect to the underlying distribution (that we do not know).<br />
<br />
<math>L(h) = P(h(X) \neq Y )</math><br />
<br />
<br />
'''empirical error rate''' (or training error rate) is the amount of error that our classification function <math>h(x)</math> makes on the training data.<br />
<br />
<math>\hat{L}_n(h) = (1/n) \sum_{i=1}^{n} \mathbf{I}(h(X_i) \neq Y_i)</math><br />
<br />
where <math>\mathbf{I}()</math> is an indicator function. Indicator function is defined by <br />
<br />
<math>\mathbf{I}(x) = \begin{cases} <br />
1 & \text{if } x \text{ is true} \\<br />
0 & \text{if } x \text{ is false}<br />
\end{cases}</math><br />
<br />
So in this case,<br />
<math>\mathbf{I}(h(X_i)\neq Y_i) = \begin{cases}<br />
1 & \text{if } h(X_i)\neq Y_i \text{ (i.e. when misclassification happens)} \\<br />
0 & \text{if } h(X_i)=Y_i \text{ (i.e. classified properly)}<br />
\end{cases}</math><br />
<br />
e.g., 100 new data points with known (true) labels<br />
<br />
<math>y_1 = h(x_1)</math><br />
<br />
...<br />
<br />
<math>y_{100} = h(x_{100})</math><br />
<br />
To calculate the empirical error we count how many labels our function <math>h(x)</math> assigned incorrectly and divide by n=100<br />
<br />
=== Bayes Classifier ===<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability<ref> http://www.wikicoursenote.com/wiki/Stat841#Bayes_Classifier </ref>.<br />
<br />
First recall Bayes' Rule, in the format<br />
<math>P(Y|X) = \frac{P(X|Y) P(Y)} {P(X)} </math> <br />
<br />
P(Y|X) : ''posterior'' , ''probability of <math>Y</math> given <math>X</math>''<br />
<br />
P(X|Y) : ''likelihood'', ''probability of <math>X</math> being generated by <math>Y</math>''<br />
<br />
P(Y) : ''prior'', ''probability of <math>Y</math> being selected''<br />
<br />
P(X) : ''marginal'', ''probability of obtaining <math>X</math>''<br />
<br />
<br />
We will start with the simplest case: <math>\mathcal{Y} = \{0,1\}</math><br />
<br />
<math> r(x) <br />
= P(Y=1|X=x) <br />
= \frac{P(X=x|Y=1) P(Y=1)} {P(X=x)}<br />
= \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}</math><br />
<br />
Bayes' rule can be approached by computing either:<br />
<br />
1) '''The posterior''': <math>\ P(Y=1|X=x) </math> and <math>\ P(Y=0|X=x) </math> or <br />
<br />
2) '''The likelihood''': <math>\ P(X=x|Y=1) </math> and <math>\ P(X=x|Y=0) </math><br />
<br />
<br />
The former reflects a '''Bayesian''' approach. The Bayesian approach uses previous beliefs and observed data (e.g., the random variable <math>\ X </math>) to determine the probability distribution of the parameter of interest (e.g., the random variable <math>\ Y </math>). The probability, according to Bayesians, is a ''degree of belief'' in the parameter of interest taking on a particular value (e.g., <math>\ Y=1 </math>), given a particular observation (e.g., <math>\ X=x </math>). Historically, the difficulty in this approach lies with determining the posterior distribution, however, more recent methods such as '''Markov Chain Monte Carlo (MCMC)''' allow the Bayesian approach to be implemented <ref name="PCAustin">P. C. Austin, C. D. Naylor, and J. V. Tu, "A comparison of a Bayesian vs. a frequentist method for profiling hospital performance," ''Journal of Evaluation in Clinical Practice'', 2001</ref>.<br />
<br />
The latter reflects a '''Frequentist''' approach. The Frequentist approach assumes that the probability distribution, including the mean, variance, etc., is fixed for the parameter of interest (e.g., the variable <math>\ Y </math>, which is ''not'' random). The observed data (e.g., the random variable <math>\ X </math>) is simply a ''sampling'' of a far larger population of possible observations. Thus, a certain repeatability or ''frequency'' is expected in the observed data. If it were possible to make an infinite number of observations, then the true probability distribution of the parameter of interest can be found. In general, frequentists use a technique called '''hypothesis testing''' to compare a ''null hypothesis'' (e.g. an assumption that the mean of the probability distribution is <math>\ \mu_0 </math>) to an alternative hypothesis (e.g. assuming that the mean of the probability distribution is larger than <math>\ \mu_0 </math>) <ref name="PCAustin"/>. For more information on hypothesis testing see <ref>R. Levy, "Frequency hypothesis testing, and contingency tables" class notes for LING251, Department of Linguistics, University of California, 2007. Available: [http://idiom.ucsd.edu/~rlevy/lign251/fall2007/lecture_8.pdf http://idiom.ucsd.edu/~rlevy/lign251/fall2007/lecture_8.pdf] </ref>. <br />
<br />
There was some class discussion on which approach should be used. Both the ease of computation and the validity of both approaches were discussed. A main point that was brought up in class is that Frequentists consider X to be a random variable, but they do not consider Y to be a random variable because it has to take on one of the values from a fixed set (in the above case it would be either 0 or 1 and there is only one ''correct'' label for a given value X=x). Thus, from a Frequentist's perspective it does not make sense to talk about the probability of Y. This is actually a grey area and sometimes ''Bayesians'' and ''Frequentists'' use each others' approaches. So using ''Bayes' rule'' doesn't necessarily mean you're a ''Bayesian''. Overall, the question remains unresolved.<br />
<br />
<br />
The '''Bayes Classifier''' uses <math>\ P(Y=1|X=x)</math><br />
<br />
<math> P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}</math><br />
<br />
P(Y=1) : the prior, based on belief/evidence beforehand<br />
<br />
denominator : marginalized by summation<br />
<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
The set <math>\mathcal{D}(h) = \{ x : P(Y=1|X=x) = P(Y=0|X=x)... \} </math><br />
<br />
which defines a ''decision boundary''.<br />
<br />
<math>h^*(x) = <br />
\begin{cases}<br />
1 \ \ if \ \ P(Y=1|X=x) > P(Y=0|X=x) \\<br />
0 \ \ \ \ \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
''Theorem'': Bayes rule is optimal. I.e., if h is any other classification rule, <br />
then <math>L(h^*) <= L(h)</math><br />
(This is to be proved in homework.)<br />
<br />
Why then do we need other classfication methods?<br />
A: Because X densities are often/typically unknown. I.e., <math>f_k(x)</math> and/or <math>\pi_k</math> unknown.<br />
<br />
<math>P(Y=k|X=x) = \frac{P(X=x|Y=k)P(Y=k)} {P(X=x)} = \frac{f_k(x) \pi_k} {\sum_k f_k(x) \pi_k}</math><br />
f_k(x) is referred to as the class conditional distribution (~likelihood).<br />
<br />
Therefore, we rely on some data to estimate quantities.<br />
<br />
=== Three Main Approaches ===<br />
<br />
'''1. Empirical Risk Minimization''':<br />
Choose a set of classifiers H (e.g., line, neural network) and find <math>h^* \in H</math><br />
that minimizes (some estimate of) L(h).<br />
<br />
'''2. Regression''':<br />
Find an estimate (<math>\hat{r}</math>) of function <math>r</math> and define<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
The <math> 1/2 </math> in the expression above is a threshold set for the regression prediction output. <br />
<br />
In general ''regression'' refers to finding a continuous, real valued y. The problem here is more difficult, because of the restricted domain (y is a set of discrete label values).<br />
<br />
'''3. Density Estimation''':<br />
Estimate <math>P(X=x|Y=0)</math> from <math>X_i</math>'s for which <math>Y_i = 0</math><br />
Estimate <math>P(X=x|Y=1)</math> from <math>X_i</math>'s for which <math>Y_i = 1</math><br />
and let <math>P(Y=?) = (1/n) \sum_{i=1}^{n} Y_i</math><br />
<br />
Define <math>\hat{r}(x) = \hat{P}(Y=1|X=x)</math> and<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
It is possible that there may not be enough data to estimate from for ''density estimation''. But the main problem lies with high dimensional spaces, as the estimation results may not be good (high error rate) and sometimes even infeasible. The term ''curse of dimensionality'' was coined by Bellman <ref>R. E. Bellman, ''Dynamic Programming''. Princeton University Press,<br />
1957</ref> to describe this problem.<br />
<br />
As the dimension of the space goes up, the learning requirements go up exponentially.<br />
<br />
To Learn more about methods for handling high-dimensional data <ref> https://docs.google.com/viewer?url=http%3A%2F%2Fwww.bios.unc.edu%2F~dzeng%2FBIOS740%2Flecture_notes.pdf</ref><br />
<br />
=== Multi-Class Classification ===<br />
Generalize to case Y takes on k>2 values.<br />
<br />
<br />
''Theorem'': <math>Y \in \mathcal{Y} = \{1,2,..., k\} </math> optimal rule<br />
<br />
<math>\ h^{*}(x) = argmax_k P(Y=k|X=x) </math> <br />
<br />
where <math>P(Y=k|X=x) = \frac{f_k(x) \pi_k} {\sum_r f_r \pi_r}</math><br />
<br />
===Examples of Classification===<br />
<br />
* Face detection in images.<br />
* Medical diagnosis.<br />
* Detecting credit card fraud (fraudulent or legitimate).<br />
* Speech recognition.<br />
* Handwriting recognition.<br />
<br />
== LDA and QDA ==<br />
<br />
'''Discriminant function analysis''' finds features that best allow discrimination between two or more classes. The approach is similar to '''analysis of Variance (ANOVA)''' in that discriminant function analysis looks at the mean values to determine if two or more classes are very different and should be separated. Once the discriminant functions (that separate two or more classes) have been determined, new data points can be classified (i.e. placed in one of the classes) based on the discriminant functions <ref> StatSoft, Inc. (2011). ''Electronic Statistics Textbook.'' [Online]. Available: [http://www.statsoft.com/textbook/discriminant-function-analysis/ http://www.statsoft.com/textbook/discriminant-function-analysis/.] </ref>. '''Linear discriminant analysis (LDA)''' and '''Quadratic discriminant analysis (QDA)''' are methods of discriminant analysis that are best applied to linearly and quadradically separable classes, respectively. '''Fisher discriminant analysis (FDA)''' another method of discriminant analysis that is different from linear discriminant analysis, but oftentimes both terms are used interchangeably.<br />
<br />
=== LDA ===<br />
<br />
The simplest method is to use approach 3 (above) and assume a parametric model for densities. Assume class conditional is Gaussian.<br />
<br />
<math>\mathcal{Y} = \{ 0,1 \}</math> assumed (i.e., 2 labels)<br />
<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ P(Y=1|X=x) > P(Y=0|X=x) \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
<math>P(Y=1|X=x) = \frac{f_1(x) \pi_1} {\sum_k f_k \pi_k} \ \ </math> (denom = P(x))<br />
<br />
1) Assume Gaussian distributions<br />
<br />
<math>f_k(x) = \frac{1}{(2\pi)^{d/2} |\Sigma_k|^{1/2}} exp(-(1/2)(\mathbf{x} - \mathbf{\mu_k}) \Sigma_k^{-1}(\mathbf{x}-\mathbf{\mu_k}) )</math><br />
<br />
must compare <br />
<math>\frac{f_1(x) \pi_1} {p(x)}</math> with <math>\frac{f_0(x) \pi_0} {p(x)}</math><br />
Note that the p(x) denom can be ignored:<br />
<math>f_1(x) \pi_1</math> with <math>f_0(x) \pi_0 </math><br />
<br />
To find the decision boundary, set <br />
<math>f_1(x) \pi_1 = f_0(x) \pi_0 </math><br />
<br />
2) Assume <math>\Sigma_1 = \Sigma_0</math>, we can use <math>\Sigma = \Sigma_0 = \Sigma_1</math>.<br />
<br />
Cancel <math>(2\pi)^{-d/2} |\Sigma_k|^{-1/2}</math> from both sides.<br />
<br />
Take log of both sides.<br />
<br />
Subtract one side from both sides, leaving zero on one side.<br />
<br />
<br />
<math>-(1/2)(\mathbf{x} - \mathbf{\mu_1})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_1}) + log(\pi_1) - [-(1/2)(\mathbf{x} - \mathbf{\mu_0})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_0}) + log(\pi_0)] = 0 </math><br />
<br />
<br />
<math>(1/2)[-\mathbf{x}^T \Sigma^{-1}\mathbf{x} - \mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + 2\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{x}<br />
+ \mathbf{x}^T \Sigma^{-1}\mathbf{x} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} - 2\mathbf{\mu_0}^T \Sigma^{-1} \mathbf{x} ]<br />
+ log(\pi_1/\pi_0) = 0 </math><br />
<br />
<br />
Cancelling out the terms quadratic in <math>\mathbf{x}</math> and rearranging results in <br />
<br />
<math>(1/2)[-\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0}<br />
+ (2\mathbf{\mu_1}^T \Sigma^{-1} - 2\mathbf{\mu_0}^T \Sigma^{-1}) \mathbf{x}]<br />
+ log(\pi_1/\pi_0) = 0 </math><br />
<br />
<br />
We can see that the first pair of terms is constant, and the second pair is linear in x.<br />
Therefore, we end up with something of the form <br />
<math>ax + b = 0</math>.<br />
For more about LDA <ref>http://sites.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf</ref><br />
<br />
== LDA and QDA Continued (Lecture: Sep. 22, 2011) == <br />
<br />
If we relax assumption 2 (i.e. <math>\Sigma_1 \neq \Sigma_0</math>) then we get a quadratic equation that can be written as<br />
<math>{x}^Ta{x}+b{x} + c = 0</math><br />
<br />
===Generalizing LDA and QDA===<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,K\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h^*(x) = \arg\max_{k} \delta_k(x)</math><br />
<br />
Where<br />
<br />
<math> \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math><br />
<br />
When the Gaussian variances are equal <math>\Sigma_1 = \Sigma_0</math> (e.g. LDA), then<br />
<br />
<math> \,\delta_k(x) = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math><br />
<br />
(To compute this, we need to calculate the value of <math>\,\delta </math> for each class, and then take the one with the max. value).<br />
<br />
===In practice===<br />
We estimate the prior to be the chance that a random item from the collection belongs to class k, e.g.<br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
The mean to be the average item in set k, e.g.<br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
and calculate the covariance of each class e.g.<br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
If we wish to use LDA we must calculate a common covariance, so we average all the covariances e.g.<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{r=1}^{k}n_r} </math><br />
<br />
Where: <math>\,n_r</math> is the number of data points in class <math>\,r</math>, <math>\,\Sigma_r</math> is the covariance of class <math>\,r</math>, <math>\,n</math> is the total number of data points, and <math>\,k</math> is the number of classes.<br />
<br />
===Computation===<br />
<br />
For QDA we need to calculate: <math> \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math><br />
<br />
Lets first consider when <math>\, \Sigma_k = I, \forall k </math>. This is the case where each distribution is spherical, around the mean point.<br />
<br />
====Case 1====<br />
When <math>\, \Sigma_k = I </math><br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
but <math>\ \log(|I|)=\log(1)=0 </math><br />
<br />
and <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math> is the [http://en.wikipedia.org/wiki/Euclidean_distance#Squared_Euclidean_Distance squared Euclidean distance] between two points <math>\,x</math> and <math>\,\mu_k</math><br />
<br />
Thus in this condition, a new point can be classified by its distance away from the center of a class, adjusted by some prior.<br />
<br />
Further, for two-class problem with equal prior, the discriminating function would be the bisector of the 2-class's means.<br />
<br />
====Case 2==== <br />
When <math>\, \Sigma_k \neq I </math><br />
<br />
Using the [[Singular Value Decomposition(SVD) | Singular Value Decomposition (SVD)]] of <math>\, \Sigma_k</math><br />
we get <math> \, \Sigma_k = U_kS_kV_k^\top</math>. In particular, <math>\, U_k</math> is a collection of eigenvectors of <math>\, \Sigma_k\Sigma_k^*</math>, and <math>\, V_k</math> is a collection of eigenvectors of <math>\,\Sigma_k^*\Sigma_k</math>.<br />
Since <math>\, \Sigma_k</math> is a symmetric matrix<ref> http://en.wikipedia.org/wiki/Covariance_matrix#Properties </ref>, <math>\, \Sigma_k = \Sigma_k^*</math>, so we have <math> \, \Sigma_k = U_kS_kU_k^\top </math>.<br />
<br />
For <math>\,\delta_k</math>, the second term becomes what is also known as the Mahalanobis distance <ref>P. C. Mahalanobis, "On The Generalised Distance in Statistics," ''Proceedings of the National Institute of Sciences of India'', 1936</ref> :<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top U_kS_k^{-1}U_k^T(x-\mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-1}(U_k^\top x-U_k^\top \mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-\frac{1}{2}}S_k^{-\frac{1}{2}}(U_k^\top x-U_k^\top\mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top I(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
If we think of <math> \, S_k^{-\frac{1}{2}}U_k^\top </math> as a linear transformation that takes points in class <math>\,k</math> and distributes them spherically around a point, like in case 1. Thus when we are given a new point, we can apply the modified <math>\,\delta_k</math> values to calculate <math>\ h^*(\,x)</math>. After applying the singular value decomposition, <math>\,\Sigma_k^{-1}</math> is considered to be an identity matrix such that<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}[(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k)] + log (\pi_k) </math><br />
<br />
and,<br />
<br />
<math>\ \log(|I|)=\log(1)=0 </math><br />
<br />
For applying the above method with classes that have different covariance matrices (for example the covariance matrices <math>\ \Sigma_0 </math> and <math>\ \Sigma_1 </math> for the two class case), each of the covariance matrices has to be decomposed using SVD to find the according transformation. Then, each new data point has to be transformed using each transformation to compare its distance to the mean of each class (for example for the two class case, the new data point would have to be transformed by the class 1 transformation and then compared to <math>\ \mu_0 </math> and the new data point would also have to be transformed by the class 2 transformation and then compared to <math>\ \mu_1 </math>).<br />
<br />
<br />
The difference between [[#Case 1 | Case 1]] and [[#Case 2 | Case 2]] (i.e. the difference between using the Euclidean and Mahalanobis distance) can be seen in the illustration below. <br />
<br />
[[File:EuclideanVsMahalonobisDistance2.PNG|frame|center|Illustration of Euclidean distance (a) and Mahalanobis distance (b) where the contours represent equidistant points from the center using each distance metric. Source: <ref>R. De Maesschalck, D. Jouan-Rimbaud and D. L. Massart, "Tutorial - The Mahalanobis distance," ''Chemometrics and Intelligent Laboratory Systems'', 2000 </ref>]]<br />
<br />
As can be seen from the illustration above, the Mahalanobis distance takes into account the distribution of the data points, whereas the Euclidean distance would treat the data as though it has a spherical distribution. Thus, the Mahalanobis distance applies for the more general classification in [[#Case 2 | Case 2]], whereas the Euclidean distance applies to the special case in [[#Case 1 | Case 1]] where the data distribution is assumed to be spherical.<br />
<br />
Generally, we can conclude that QDA provides a better classifier for the data then LDA because LDA assumes that the covariance matrix is identical for each class, but QDA does not. QDA still uses Gaussian distribution as a class conditional distribution. In our real life, this distribution can not be happened each time, so we have to use other distribution as a complement.<br />
<br />
== Principal Component Analysis (PCA) (Lecture: Sep. 27, 2011) ==<br />
<br />
'''Principal Component Analysis (PCA)''' is a method of dimensionality reduction/feature extraction that transforms the data from a D dimensional space into a new coordinate system of dimension d, where d <= D ( the worst case would be to have d=D). The goal is to preserve as much of the variance in the original data as possible when switching the coordinate systems. Give data on D variables, the hope is that the data points will lie mainly in a linear subspace of dimension lower than D. In practice, the data will usually not lie precisely in some lower dimensional subspace.<br />
<br />
<br />
The new variables that form a new coordinate system are called '''principal components''' (PCs). PCs are denoted by <math>\ u_1, u_2, ... , u_D </math>. The principal components form a basis for the data. Since PCs are orthogonal linear transformations of the original variables there is at most D PCs. Normally, not all of the D PCs are used but rather a subset of d PCs, <math>\ u_1, u_2, ... , u_d </math>, to approximate the space spanned by the original data points <math>\ x_1, x_2, ... , x_D </math>. We can choose d based on what percentage of the original data we would like to maintain. <br />
<br />
Let <math>\ PC_j</math> be a linear combination of <math>\ x_1, x_2, ... , x_D </math> defined by the coefficients <br />
<math>\ w^{(j)}</math> = <math> ( {w_1}^{(j)}, {w_2}^{(j)},...,{w_D}^{(j)} )^T </math><br />
<br />
Thus, <math> u_j = {w_1}^{(j)} x_1 + {w_2}^{(j)} x_2 + ... + {w_D}^{(j)} x_D = w^{(j)^T} X </math><br />
<br />
<br />
This is a unique configuration since it sets up the PCs in order from maximum to minimum variances. The first PC, <math>\ u_1 </math> is called '''first principal component''' and has the maximum variance, thus it accounts for the most significant variance in the data <math>\ x_1, x_2, ... , x_D </math>. The second PC, <math>\ u_2 </math> is called '''second principal component''' and has the second highest variance and so on until PC, <math>\ u_D </math> which has the minimum variance. <br />
<br />
<br />
To get the first principal component, we would like to use the following equation:<br />
<br />
<math>\ max (Var(w^T X)) = max (w^T S w) </math> <br />
<br />
Where <math>\ S </math> is the covariance matrix. And we solve for <math>\ w </math>.<br />
<br />
<br />
Note: we require the constraint <math>\ w^T w = 1 </math> because if there is no constraint on the length of <math>\ w </math> then there is no upper bound. With the constraint, the direction and not the length that maximizes the variance can be found. <br />
<br />
<br />
====Lagrange Multiplier====<br />
<br />
Before we proceed, we should review Lagrange multipliers.<br />
<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
<br />
Lagrange multipliers are used to find the maximum or minimum of a function <math>\displaystyle f(x,y)</math> subject to constraint <math>\displaystyle g(x,y)=0</math> <br />
<br />
we define a new constant <math> \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle f(x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example :====<br />
Suppose we want to maximize the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method to find the maximum value for the function <math>\displaystyle f </math>; the Lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain two stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
===Determining w :===<br />
<br />
Use the Lagrange multiplier conversion to obtain:<br />
<math>\displaystyle L(w, \lambda) = w^T Sw - \lambda (w^T w - 1)</math> where <math>\displaystyle \lambda </math> is a constant <br />
<br />
Take the derivative and set it to zero:<br />
<math>\displaystyle{\partial L \over{\partial w}} = 0 </math><br />
<br />
<br />
To obtain: <br />
<math>\displaystyle 2Sw - 2 \lambda w = 0</math><br />
<br />
<br />
Rearrange to obtain:<br />
<math>\displaystyle Sw = \lambda w</math><br />
<br />
<br />
where <math>\displaystyle w</math> is eigenvector of <math>\displaystyle S </math> and <math>\ \lambda </math> is the eigenvalue of <math>\displaystyle S </math> as <math>\displaystyle Sw= \lambda w </math> , and <math>\displaystyle w^T w=1</math> , then we can write<br />
<br />
<math>\displaystyle w^T Sw= w^T\lambda w= \lambda w^T w =\lambda </math> <br />
<br />
Note that the PCs decompose the total variance in the data in the following way :<br />
<br />
<math> \sum_{i=1}^{D} Var(u_i) </math><br />
<br />
<math>= \sum_{i=1}^{D} (\lambda_i) </math> <br />
<br />
<math>\ = Tr(S) </math><br />
<br />
<math>= \sum_{i=1}^{D} Var(x_i)</math><br />
<br />
== Principal Component Analysis (PCA) Continued (Lecture: Sep. 29, 2011) == <br />
As can be seen from the above expressions, <math>\ Var(W^\top X) = W^\top S W= \lambda </math> where lambda is an eigenvalue of the sample covariance matrix <math>\ S </math> and <math>\ W</math> is its corresponding eigenvector. So <math>\ Var(u_i) </math> is maximized if <math>\ \lambda_i </math> is the maximum eigenvalue of <math>\ S </math> and the first principal component (PC) is the corresponding eigenvector. Each successive PC can be generated in the above manner by taking the eigenvectors of <math>\ S</math><ref>www.wikipedia.org/wiki/Eigenvalues_and_eigenvectors</ref> that correspond to the eigenvalues:<br />
<br />
<math>\ \lambda_1 \geq ... \geq \lambda_D </math> <br />
<br />
such that <br />
<br />
<math>\ Var(u_1) \geq ... \geq Var(u_D) </math><br />
<br />
=== Alternative Derivation ===<br />
Another way of looking at PCA is to consider PCA as a projection from a higher D-dimension space to a lower d-dimensional subspace that minimizes the squared ''reconstruction error''. The squared reconstruction error is the difference between the original data set <math>\ X </math> and the new data set <math> \hat{X} </math> obtained by first projecting the original data set into a lower d-dimensional subspace and then projecting it back into the the original higher D-dimension space. Since information is (normally) lost by compressing the the original data into a lower d-dimensional subspace, the new data set will (normally) differ from the original data even though both are part of the higher D-dimension space. The reconstruction error is computed as shown below.<br />
<br />
====Reconstruction Error====<br />
<br />
<math> e = \sum_{i=1}^{n} || x_i - \hat{x}_i ||^2 </math><br />
<br />
====Minimize Reconstruction Error====<br />
<br />
Suppose <math> \bar{x} = 0 </math> where <math> \hat{x}_i = x_i - \bar{x} </math><br />
<br />
Let <math>\ f(y) = U_d y </math> where <math>\ U_d </math> is a D by d matrix with d orthogonal unit vectors as columns.<br />
<br />
Fit the model to the data and minimize the reconstruction error:<br />
<br />
<math>\ min_{U_d, y_i} \sum_{i=1}^n || x_i - U_d y_i ||^2 </math><br />
<br />
Differentiate with respect to <math>\ y_i </math>:<br />
<br />
<math> \frac{\partial e}{\partial y_i} = 0 </math><br />
<br />
we can rewrite reconstruction-error as : <math>\ e = \sum_{i=1}^n(x_i - U_d y_i)^T(x_i - U_d y_i) </math><br />
<br />
<math>\ \frac{\partial e}{\partial y_i} = 2(-U_d)(x_i - U_d y_i) = 0 </math><br />
<br />
since <math>\ U_d(x_i - U_d y_i) </math> is a linear combination of the columns of <math>\ U_d </math>,<br />
<br />
which are independent (orthogonal to each other) we can conclude that:<br />
<br />
<math>\ x_i - U_d y_i = 0 </math> or equivalently,<br />
<br />
<math>\ x_i = U_d y_i </math><br />
<br />
<math>\ y_i = U_d^T x_i </math><br />
<br />
Find the orthogonal matrix <math>\ U_d </math>:<br />
<br />
<math>\ min_{U_d} \sum_{i=1}^n || x_i - U_d U_d^T x_i||^2 </math><br />
<br />
====Using SVD====<br />
<br />
A unique solution can be obtained by finding the [[Singular Value Decomposition(SVD) | Singular Value Decomposition (SVD)]] of <math>\ X </math>:<br />
<br />
<math>\ X = U S V^T </math><br />
<br />
For each rank d, <math>\ U_d </math> consists of the first d columns of <math>\ U </math>. Also, the covariance matrix can be expressed as follows <math>\ S = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)(x_i - \mu)^T </math>.<br />
<br />
Simply put, by subtracting the mean of each of the data point features and then applying SVD, one can find the principal components:<br />
<br />
<math> \tilde{X} = X - \mu </math><br />
<br />
<math>\ \tilde{X} = U S V^T </math><br />
<br />
Where <math>\ X </math> is a d by n matrix of data points and the features of each data point form a column in <math>\ X </math>. Also, <math>\ \mu </math> is a d by n matrix with identical columns each equal to the mean of the <math>\ x_i</math>'s, ie <math>\mu_{:,j}=\frac{1}{n}\sum_{i=1}^n x_i </math>. Note that the arrangement of data points is a convention and indeed in Matlab or conventional statistics, the transpose of the matrices in the above formulae is used.<br />
<br />
As the <math>\ S </math> matrix from the SVD has the eigenvalues arranged from largest to smallest, the corresponding eigenvectors in the <math>\ U </math> matrix from the SVD will be such that the first column of <math>\ U </math> is the first principal component and the second column is the second principal component and so on.<br />
<br />
=== Examples ===<br />
<br />
Note that in the Matlab code in the examples below, the mean was not subtracted from the datapoints before performing SVD. This is what was shown in class. However, to properly perform PCA, the mean should be subtracted from the datapoints.<br />
<br />
==== Example 1 ====<br />
Consider a matrix of data points <math>\ X </math> with the dimensions 560 by 1965. 560 is the number of elements in each column. Each column is a vector representation of a 20x28 grayscale pixel image of a face (see image below) and there is a total of 1965 different images of faces. Each of the images are corrupted by noise, but the noise can be removed by projecting the data back to the original space taking as many dimensions as one likes (e.g, 2, 3 4 0r 5). The corresponding Matlab commands are shown below:<br />
[[File:FreyFaceExample.PNG|thumb|185px|An example of the face images used in [[#Example 1 | Example 1]] with noise removed. Source: <ref>S. Roweis (2011). ''Data for MATLAB.'' [Online]. Available: [http://cs.nyu.edu/~roweis/data.html http://cs.nyu.edu/~roweis/data.html.] |</ref>]]<br />
<pre style="align:left; width: 75%; padding: 2% 2%"><br />
>> % start with a 560 by 1965 matrix X that contains the data points<br />
>> load(noisy.mat);<br />
>> <br />
>> % set the colors to grayscale <br />
>> colormap gray<br />
>> <br />
>> % show image in column 10 by reshaping column 10 into a 20 by 28 matrix<br />
>> imagesc(reshape(X(:,10),20,28)')<br />
>> <br />
>> % perform SVD, if X matrix if full rank, will obtain 560 PCs<br />
>> [S U V] = svd(X);<br />
>> <br />
>> % reconstruct X ( project X onto the original space) using only the first ten principal components<br />
>> Y_pca = U(:, 1:10)'*X;<br />
>> <br />
>> % show image in column 10 of X_hat which is now a 560 by 1965 matrix<br />
>> imagesc(reshape(X_hat(:,10),20,28)')<br />
</pre><br />
The reason why the noise is removed in the reconstructed image is because the noise does not create a major variation in a single direction in the original data. Hence, the first ten PCs taken from <math>\ U </math> matrix are not in the direction of the noise. Thus, reconstructing the image using the first ten PCs, will remove the noise.<br />
<br />
==== Example 2 ====<br />
Consider a matrix of data points <math>\ X </math> with the dimensions 64 by 400. 64 is the number of elements in each column. Each column is a vector representation of a 8x8 grayscale pixel image of either a handwritten number ''2'' or a handwritten number ''3'' (see image below) and there are a total of 400 different images, where the first 200 images show a handwritten number ''2'' and the last 200 images show a handwritten number ''3''. <br />
[[File:Handwritten23.PNG|frame|center|An example of the handwritten number images used in [[#Example 2 | Example 2]]. Source: <ref>A. Ghodsi, "PCA" class notes for STAT841, Department of Statistics and Actuarial Science, University of Waterloo, 2011. </ref>]]<br />
<br />
The corresponding Matlab commands for performing PCA on the data points are shown below:<br />
<pre><br />
>> % start with a 64 by 400 matrix X that contains the data points<br />
>> load 2_3.mat;<br />
>> <br />
>> % set the colors to grayscale <br />
>> colormap gray<br />
>> <br />
>> % show image in column 2 by reshaping column 2 into a 8 by 8 matrix<br />
>> imagesc(reshape(X(:,2),8,8))<br />
>> <br />
>> % perform SVD, if X matrix if full rank, will obtain 64 PCs<br />
>> [U S V] = svd(X);<br />
>> <br />
>> % project data down onto the first two PCs<br />
>> Y = U(:,1:2)'*X;<br />
>> <br />
>> % show Y as an image (can see the change in the first PC at column 200,<br />
>> % when the handwritten number changes from 2 to 3)<br />
>> imagesc(Y)<br />
>> <br />
>> % perform PCA using Matlab build-in function (do not use for assignment)<br />
>> % also note that due to the Matlab convention, the transpose of X is used<br />
>> [COEFF, Y] = princomp(X');<br />
>> <br />
>> % again, use the first two PCs<br />
>> Y = Y(:,1:2);<br />
>> <br />
>> % use plot digits to show the distribution of images on the first two PCs<br />
>> images = reshape(X, 8, 8, 400);<br />
>> plotdigits(images, Y, .1, 1);<br />
</pre><br />
Using the ''plotdigits'' function in Matlab, clearly illustrates that the first PC captured the differences between the numbers ''2'' and ''3'' as they are projected onto different regions of the axis for the first PC. Also, the second PC captured the ''tilt'' of the handwritten numbers as numbers tilted to the left or right were projected onto different regions of the axis for the second PC.<br />
<br />
==== Example 3 ====<br />
(Not discussed in class) In the news recently was a story that captures some of the ideas behind PCA. Over the past two years, Scott Golder and Michael Macy, researchers from Cornell University, collected 509 million Twitter messages from 2.4 million users in 84 different countries. The data they used were words collected at various times of day and they classified the data into two different categories: positive emotion words and negative emotion words. Then, they were able to study this new data to evaluate subjects' moods at different times of day, while the subjects were in different parts of the world. They found that the subjects generally exhibited positive emotions in the mornings and late evenings, and negative emotions mid-day. They were able to "project their data onto a smaller dimensional space" using PCS. Their paper, "Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures," is available in the journal Science.<ref>http://www.pcworld.com/article/240831/twitter_analysis_reveals_global_human_moodiness.html</ref>.<br />
<br />
Assumptions Underlying Principal Component Analysis can be found here<ref>http://support.sas.com/publishing/pubcat/chaps/55129.pdf</ref><br />
<br />
==== Example 4 ====<br />
(Not discussed in class) A somewhat well known learning rule in the field of neural networks called Oja's rule can be used to train networks of neurons to compute the principal component directions of data sets. <ref>A Simplified Neuron Model as a Principal Component Analyzer. Erkki Oja. 1982. Journal of Mathematical Biology. 15: 267-273</ref> This rule is formulated as follows<br />
<br />
<math>\,\Delta w = \eta yx -\eta y^2w </math><br />
<br />
where <math>\,\Delta w </math> is the neuron weight change, <math>\,\eta</math> is the learning rate, <math>\,y</math> is the neuron output given the current input, <math>\,x</math> is the current input and <math>\,w</math> is the current neuron weight. This learning rule shares some similarities with another method for calculating principal components: power iteration. The basic algorithm for power iteration (taken from wikipedia: <ref>Wikipedia. http://en.wikipedia.org/wiki/Principal_component_analysis#Computing_principal_components_iteratively</ref>) is shown below <br />
<br />
<br />
<math>\mathbf{p} =</math> a random vector<br />
do ''c'' times:<br />
<math>\mathbf{t} = 0</math> (a vector of length ''m'')<br />
for each row <math>\mathbf{x} \in \mathbf{X^T}</math><br />
<math>\mathbf{t} = \mathbf{t} + (\mathbf{x} \cdot \mathbf{p})\mathbf{x}</math><br />
<math>\mathbf{p} = \frac{\mathbf{t}}{|\mathbf{t}|}</math><br />
return <math>\mathbf{p}</math><br />
<br />
Comparing this with the neuron learning rule we can see that the term <math>\, \eta y x </math> is very similar to the <math>\,\mathbf{t}</math> update equation in the power iteration method, and identical if the neuron model is assumed to be linear (<math>\,y(x)=x\mathbf{p}</math>) and the learning rate is set to 1. Additionally, the <math>\, -\eta y^2w </math> term performs the normalization, the same function as the <math>\,\mathbf{p}</math> update equation in the power iteration method.<br />
<br />
=== Observations ===<br />
Some observations about the PCA were brought up in class:<br />
<br />
* '''PCA''' assumes that data is on a ''linear subspace'' or close to a linear subspace. For non-linear dimensionality reduction, other techniques are used. Amongst the first proposed techniques for non-linear dimensionality reduction are '''Locally Linear Embedding (LLE)''' and '''Isomap'''. More recent techniques include '''Maximum Variance Unfolding (MVU)''' and '''t-Distributed Stochastic Neighbor Embedding (t-SNE)'''. '''Kernel PCAs''' may also be used, but they depend on the type of kernel used and generally do not work well in practice. (Kernels will be covered in more detail later in the course.)<br />
<br />
* Finding the number of PCs to use is not straightforward. It requires knowledge about the ''instrinsic dimentionality of data''. In practice, oftentimes a heuristic approach is adopted by looking at the eigenvalues ordered from largest to smallest. If there is a "dip" in the magnitude of the eigenvalues, the "dip" is used as a cut off point and only the large eigenvalues before the "dip" are used. Otherwise, it is possible to add up the eigenvalues from largest to smallest until a certain percentage value is reached. This percentage value represents the percentage of variance that is preserved when projecting onto the PCs corresponding to the eigenvalues that have been added together to achieve the percentage. <br />
<br />
* It is a good idea to normalize the variance of the data before applying PCA. This will avoid PCA finding PCs in certain directions due to the scaling of the data, rather than the real variance of the data.<br />
<br />
* PCA can be considered as an unsupervised approach, since the main direction of variation is not known beforehand, i.e. it is not completely certain which dimension the first PC will capture. The PCs found may not correspond to the desired labels for the data set. There are, however, alternate methods for performing supervised dimensionality reduction.<br />
<br />
* (Not in class) Even though the traditional PCA method does not work well on data set that lies on a non-linear manifold. A revised PCA method, called c-PCA, has been introduced to improve the stability and convergence of intrinsic dimension estimation. The approach first finds a minimal cover (a cover of a set X is a collection of sets whose union contains X as a subset<ref>http://en.wikipedia.org/wiki/Cover_(topology)</ref>) of the data set. Since set covering is an NP-hard problem, the approach only finds an approximation of minimal cover to reduce the complexity of the run time. In each subset of the minimal cover, it applies PCA and filters out the noise in the data. Finally the global intrinsic dimension can be determined from the variance results from all the subsets. The algorithm produces robust results.<ref>Mingyu Fan, Nannan Gu, Hong Qiao, Bo Zhang, Intrinsic dimension estimation of data by principal component analysis, 2010. Available: http://arxiv.org/abs/1002.2050</ref><br />
<br />
*(Not in class) While PCA finds the mathematically optimal method (as in minimizing the squared error), it is sensitive to outliers in the data that produce large errors PCA tries to avoid. It therefore is common practice to remove outliers before computing PCA. However, in some contexts, outliers can be difficult to identify. For example in data mining algorithms like correlation clustering, the assignment of points to clusters and outliers is not known beforehand. A recently proposed generalization of PCA based on a '''Weighted PCA''' increases robustness by assigning different weights to data objects based on their estimated relevancy.<ref>http://en.wikipedia.org/wiki/Principal_component_analysis</ref><br />
<br />
* (Not in class) Comparison between PCA and LDA: Principal Component Analysis (PCA)and Linear Discriminant Analysis (LDA) are two commonly used techniques for data classification and dimensionality reduction. Linear Discriminant Analysis easily handles the case where the within-class frequencies are unequal and their performances has been examined on randomly generated test data. This method maximizes the ratio of between-class variance to the within-class variance in any particular data set thereby guaranteeing maximal separability. ... The prime difference between LDA and PCA is that PCA does more of feature classification and LDA does data classification. In PCA, the shape and location of the original data sets changes when transformed to a different space whereas LDA doesn’t change the location but only tries to provide more class separability and draw a decision region between the given classes. This method also helps to better understand the distribution of the feature data." [24] [[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf]]<br />
<br />
=== Summary ===<br />
The PCA algorithm can be summarized into the following steps:<br />
<br />
# '''Recover basis'''<br />
#: <math>\ \text{ Calculate } XX^T=\Sigma_{i=1}^{t}x_ix_{i}^{T} \text{ and let } U=\text{ eigenvectors of } XX^T \text{ corresponding to the largest } d \text{ eigenvalues.} </math><br />
# '''Encode training data'''<br />
#: <math>\ \text{Let } Y=U^TX \text{, where } Y \text{ is a } d \times t \text{ matrix of encodings of the original data.} </math><br />
# '''Reconstruct training data'''<br />
#: <math> \hat{X}=UY=UU^TX </math>.<br />
# '''Encode test example'''<br />
#: <math>\ y = U^Tx \text{ where } y \text{ is a } d\text{-dimensional encoding of } x </math>.<br />
# '''Reconstruct test example'''<br />
#: <math> \hat{x}=Uy=UU^Tx </math>.<br />
<br />
== Fisher Discriminant Analysis (FDA) (Lecture: Sep. 29, 2011) ==<br />
<br />
'''Fisher Discriminant Analysis (FDA)''' is sometimes called ''Fisher Linear Discriminant Analysis (FLDA)'' or just ''Linear Discriminant Analysis (LDA)''. This causes confusion with the [[#LDA | ''Linear Discriminant Analysis (LDA)'']] technique covered earlier in the course. The LDA technique covered earlier in the course has a normality assumption and is a boundary finding technique. The FDA technique outlined here is a supervised feature extraction technique. FDA differs from PCA as well because PCA does not use the class labels, <math>\ y_i</math>, of the data <math>\ (x_i,y_i)</math> while FDA organizes data into their ''classes'' by finding the direction of maximum separation between classes.<br />
<br />
== Fisher Discriminant Analysis (FDA) Continued (Lecture: Oct. 04, 2011) ==<br />
<br />
One main drawback of the PCA technique is that the direction of greatest variation may not be the classification we desire. For example, imagine if the [[#Example 2 | data set]] above had a lightening filter applied to a random subset of the images. Then the greatest variation would be the brightness and not the more important variations we wish to classify. FDA circumvents this problem by using the labels, <math>\ y_i</math>, of the data <math>\ (x_i,y_i)</math> i.e. the FDA uses ''supervised learning''. An elementary way to see the algorithm is to imagine two classes of data projected onto a suitably chosen line that minimizes the within class variance, and maximizes the distance between the two classes i.e. group similar data together and spread different data apart. This way, new data acquired can be compared, after a transformation, to where these projections, using some well-chosen metric.<br />
<br />
<br />
We first consider the cases of two-classes. Denote the mean and covariance matrix of class <math>i=0,1</math> by <math>\mathbf{\mu}_i</math> and <math>\mathbf{\Sigma}_i</math> respectively. We transform the data so that it is projected into 1 dimension i.e. a scalar value. To do this, we compute the inner product of our <math>dx1</math>-dimensional data, <math>\mathbf{x}</math>, by a to-be-determined <math>dx1</math>-dimensional vector <math>\mathbf{w}</math>. The new means and covariances of the transformed data:<br />
<br />
::<math> \mu'_i:\rightarrow \mathbf{w}^{T}\mathbf{\mu}_i </math> <br/><br />
::<math> \Sigma'_i :\rightarrow \mathbf{w}^{T}\mathbf{\sigma}_i \mathbf{w}</math><br />
<br />
The new means and variances are actually scalar values now, but we will use vector and matrix notation and arguments throughout the following derivation as the multi-class case is then just a simpler extension. <br />
<br />
===Goals of FDA===<br />
<br />
As will be shown in the objective function, the goal of FDA is to maximize the separation of the classes (between class variance) and minimize the scatter within each class (within class variance). That is, our ideal situation is that the individual classes are as far away from each other as possible and at the same time the data within each class are as close to each other as possible (collapsed to a single point in the most extreme case). An interesting note is that R. A. Fisher who FDA is named after, used the FDA technique for purposes of taxonomy, in particular for categorizing different species of iris flowers. <ref name="RAFisher">R. A. Fisher, "The Use of Multiple measurements in Taxonomic Problems," ''Annals of Eugenics'', 1936</ref>. It is very easy to visualize what is meant by within class variance (i.e. differences between the iris flowers of the same species) and between class variance (i.e. the differences between the iris flowers of different species) in that case.<br />
<br />
<br />
'''1)''' Our '''first''' goal is to minimize the individual classes' covariance. This will help to collapse the data together. <br />
We have two minimization problems<br />
<br />
::<math>\min_{\mathbf{w}} \mathbf{w} \mathbf{\Sigma}_0 \mathbf{w}^{T}</math> <br />
and <br />
::<math>\min_{\mathbf{w}} \mathbf{w} \mathbf{\Sigma}_1 \mathbf{w}^{T}</math>.<br />
<br />
But these can be combined:<br />
::<math> \min_{\mathbf{w}} \mathbf{w} \mathbf{\Sigma}_0 \mathbf{w}^{T} + \mathbf{w} \mathbf{\Sigma}_1 \mathbf{w}^{T}</math> <br />
:: <math> = \min_{\mathbf{w}} \mathbf{w} ( \mathbf{\Sigma_0} + \mathbf{\Sigma_1} ) \mathbf{w}^{T} </math><br />
<br />
Define <math> \mathbf{S}_W =\mathbf{\Sigma_0} + \mathbf{\Sigma_1} </math>, called the ''within class variance matrix''. <br />
<br />
'''2)''' Our '''second''' goal is to move the minimized classes as far away from each other as possible. One way to accomplish this is to maximize the distances between the means of the transformed data i.e.<br />
<br />
<math> \max_{\mathbf{w}} |\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1|^2 </math><br />
<br />
Simplifying:<br />
::<math> \max_{\mathbf{w}} \,(\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1)^T (\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{mu}_1) </math> <br/><br />
::<math> = \max_{\mathbf{w}}\, (\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}\mathbf{w}^{T} \mathbf{w} (\mathbf{\mu}_0-\mathbf{\mu}_1)</math> <br/><br />
::<math> = \max_{\mathbf{w}} \,\mathbf{w}^{T}(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}\mathbf{w}</math><br />
<br />
Recall that <math> \mathbf{\mu}_i </math> are known. Denote<br />
<br />
::<math> \mathbf{S}_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}</math> <br />
<br />
This matrix, called the ''between class variance matrix'', is a rank 1 matrix, so an inverse does not exist. Altogether, we have two optimization problems we must solve simultaneously:<br />
<br />
::1) <math> \min_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} </math><br/><br />
::2) <math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} </math><br />
<br />
There are other metrics one can use to both minimize the data's variance and maximizes the distance between classes, and other goals we can try to accomplish (see metric learning, below...one day), but Fisher used this elegant method, hence his recognition in the name, and we will follow his method.<br />
<br />
We can combine the two optimization problems into one after noting that the negative of max is min:<br />
<br />
::<math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} - \alpha \mathbf{w} \mathbf{S_B} \mathbf{w}^{T}</math><br/><br />
<br />
The <math>\alpha</math> coefficient is a necessary scaling factor: if the scale of one of the terms is much larger than the other, the optimization problem will be dominated by the larger term. This means we have another unknown, <math>\alpha</math>, to solve for. Instead, we can circumvent the scaling problem by looking at the ratio of the quantities, the original solution Fisher proposed:<br />
<br />
::<math> \max_{\mathbf{w}} \frac{\mathbf{w} \mathbf{S_B} \mathbf{w}^{T}}{\mathbf{w} \mathbf{S_W} \mathbf{w}^{T}} </math><br />
<br />
This optimization problem can be shown<ref><br />
http://www.socher.org/uploads/Main/optimizationTutorial01.pdf<br />
</ref> to be equivalent to the following optimization problem:<br />
<br />
:: <math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T}</math> <br />
<br />
subject to:<br />
<br />
:: <math> \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} = 1 </math><br />
<br />
A heuristic understanding of this equivalence is that we have two degrees of freedom: direction and scalar. The scalar value is irrelevant to our discussion. Thus, we can set one of the values to be a constant. We can use Lagrange multipliers to solve this optimization problem:<br />
<br />
::<math>L( \mathbf{w}, \lambda) = \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} - \lambda(\mathbf{w} \mathbf{S_W} \mathbf{w}^{T}-1)</math><br />
:: <math> \Rightarrow \frac{\partial L}{\partial \mathbf{w}} = 2 \mathbf{S}_B \mathbf{w} - 2\lambda \mathbf{S}_W\mathbf{w} </math><br />
<br />
Setting the partial derivative to 0 gives us a ''generalized eigenvalue problem'':<br />
<br />
::<math> \mathbf{S}_B \mathbf{w} = \lambda \mathbf{S}_W \mathbf{w} </math><br />
:: <math> \Rightarrow \mathbf{S}_W^{-1} \mathbf{S}_B \mathbf{w} = \lambda \mathbf{w} </math><br />
<br />
This is a generalized eigenvalue problem and <math>\ W </math> can be computed as the eigenvector corresponds to the largest eigenvalue of <br />
:: <math> \mathbf{S}_W^{-1} \mathbf{S}_B </math><br />
<br />
It is very likely that <math> \mathbf{S}_W </math> has an inverse. If not, the pseudo-inverse<ref><br />
http://en.wikipedia.org/wiki/Generalized_inverse<br />
</ref><ref><br />
http://www.mathworks.com/help/techdoc/ref/pinv.html<br />
</ref> can be used. In Matlab the pseudo-inverse function is named ''pinv''. Thus, we should choose <math>\mathbf{w}</math> to equal the eigenvector of the largest eigenvalue as our projection vector. <br />
<br />
In fact we can simplify the above expression further in the of two classes. Recall the definition of <math>\mathbf{S}_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}</math>. Substituting this into our expression:<br />
<br />
::<math> \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T} \mathbf{w} = \lambda \mathbf{w} </math><br />
::<math> (\mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) ) ((\mathbf{\mu}_0-\mathbf{\mu}_1)^{T} \mathbf{w}) = \lambda \mathbf{w} </math><br />
<br />
This second term is a scalar value, let's denote it <math>\beta</math>. Then<br />
::<math> \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) = \frac{\lambda}{\beta} \mathbf{w} </math><br />
::<math> \Rightarrow \, \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) \propto \mathbf{w} </math><br />
<br />
All we are interested in the direction of <math>\mathbf{w}</math>, so to compute this is sufficient to finding our projection vector. Though this will not work in higher dimensions, as <math>\mathbf{w}</math> would be a matrix and not a vector in higher dimensions.<br />
<br />
=== Extensions to Multiclass Case ===<br />
If we have <math>\ k</math> classes, we need <math>\ k-1</math> directions i.e. we need to project <math>\ k</math> 'points' onto a <math>\ k-1</math> dimensional hyperplane. What does this change in our above derivation? The most significant difference is that our projection vector,<math>\mathbf{w}</math>, is no longer a vector but instead is a matrix <math>\mathbf{W}</math>. We transform the data as:<br />
<br />
::<math> \mathbf{x}' :\rightarrow \mathbf{W}^{T} \mathbf{x}</math><br />
so our new mean and covariances for class k are:<br />
::<math> \mathbf{\mu_k}' :\rightarrow \mathbf{W}^{T} \mathbf{\mu_k}</math><br />
::<math> \mathbf{\Sigma_k}' :\rightarrow \mathbf{W}^{T} \mathbf{\Sigma_k} \mathbf{W}</math><br />
<br />
What are our new optimization sub-problems? As before, we wish to minimize the within class variance. This can be formulated as:<br />
::<math>\min_{\mathbf{W}} \mathbf{W}^{T} \mathbf{\Sigma_1} \mathbf{W} + \dots + \mathbf{W}^{T} \mathbf{\Sigma_k} \mathbf{W} </math><br />
<br />
Again, denoting <math>\mathbf{S}_W = \mathbf{\Sigma_1} + \dots + \mathbf{\Sigma_k}</math>, we can simplify above expression:<br />
<br />
::<math>\min_{\mathbf{W}} \mathbf{W}^{T} \mathbf{S}_W \mathbf{W} </math><br />
<br />
Similarly, the second optimization problem is:<br />
<br />
::<math>\max_{\mathbf{W}} \mathbf{W}^{T} \mathbf{S}_B \mathbf{W} </math><br />
<br />
What is <math>\mathbf{S}_B</math> in this case? It can be shown that <math>\mathbf{S}_T = \mathbf{S}_B + \mathbf{S}_W </math> where <math> \mathbf{S}_T </math> is the covariance matrix of all the data. From this we can compute <math> \mathbf{S}_B </math>. <br />
<br />
Next, if we express <math> \mathbf{W} = ( \mathbf{w}_1 , \mathbf{w}_2 , \dots ,\mathbf{w}_k ) </math> observe that, for <math> \mathbf{A} = \mathbf{S}_B , \mathbf{S}_W </math>: <br />
<br />
::<math> Tr(\mathbf{W}^{T} \mathbf{A} \mathbf{W}) = \mathbf{w}_1^{T} \mathbf{A} \mathbf{w}_1^{T} + \dots + \mathbf{w}_k \mathbf{A} \mathbf{w}_k </math><br />
<br />
where <math>\ Tr()</math> is the trace of a matrix. Thus, following the same steps as in the two-class case, we have the new optimization problem:<br />
<br />
::<math> \max_{\mathbf{W}} \frac{ Tr(\mathbf{W}^{T} \mathbf{S}_B \mathbf{W}) }{Tr(\mathbf{W}^{T} \mathbf{S}_W \mathbf{W})} </math> <br />
<br />
subject to:<br />
<br />
:: <math> Tr( \mathbf{W} \mathbf{S_W} \mathbf{W}^{T}) = \mathbf{I} </math><br />
<br />
Again, in order to solve the above optimization problem, we can use the Lagrange multiplier <ref><br />
http://en.wikipedia.org/wiki/Lagrange_multiplier </ref>:<br />
<br />
:: <math>\begin{align}L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - I \right\}\end{align}</math>.<br />
<br />
where <math>\ \Lambda</math> is a d by d diagonal matrix.<br />
<br />
Then, we differentiating with respect to <math>\mathbf{W}</math>:<br />
<br />
:: <math>\begin{align}\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}\end{align} = 0</math>.<br />
<br />
Thus:<br />
<br />
:: <math>\begin{align}\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}\end{align}</math><br />
<br />
:: <math>\begin{align}\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{W}\end{align}</math><br />
<br />
where, <math> \mathbf{\Lambda} =\begin{pmatrix}\lambda_{1} & & 0\\&\ddots&\\0 & &\lambda_{d}\end{pmatrix}</math><br />
<br />
The above equation is of the form of an eigenvalue problem. Thus, for the solution the k-1 eigenvectors corresponding to the k-1 largest eigenvalues should be chosen as the projection matrix, <math>\mathbf{W}</math>. In fact, there should only by k-1 eigenvectors corresponding to k-1 non-zero eigenvalues using the above equation.<br />
<br />
=== Summary ===<br />
FDA has two optimization problems:<br />
::1) <math> \min_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} </math><br/><br />
::2) <math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} </math> <br />
<br />
where <math>\ S_W = \Sigma_0 + \Sigma_1</math> is called the within class variance and <math>\ S_B = (\mu_0 - \mu_1)(\mu_0 - \mu_1)^T </math> is called the between class variance.<br />
<br />
The two optimization problems are combined as follows:<br />
::<math> \max_{\mathbf{w}} \frac{\mathbf{w} \mathbf{S_B} \mathbf{w}^{T}}{\mathbf{w} \mathbf{S_W} \mathbf{w}^{T}} </math><br />
<br />
By adding a constraint as shown:<br />
::<math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T}</math><br />
<br />
subject to:<br />
:: <math> \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} = 1 </math><br />
<br />
Lagrange multipliers can be used and essentially the problem becomes an eigenvalue problem:<br />
<br />
::<math>\begin{align}\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w} = \lambda\mathbf{w}\end{align}</math><br />
<br />
And <math>\ w </math> can be computed as the k-1 eigenvectors corresponding to the largest k-1 eigenvalues of <math> \mathbf{S}_W^{-1} \mathbf{S}_B </math>.<br />
<br />
=== Variations ===<br />
<br />
Some adaptations and extensions exist for the FDA technique (Source: <ref>R. Gutierrez-Osuna, "Linear Discriminant Analysis" class notes for Intro to Pattern Analysis, Texas A&M University. Available: [http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf]</ref>):<br />
<br />
1) ''Non-Parametric LDA (NPLDA)'' by Fukunaga<br />
<br />
This method does not assume that the Gaussian distribution is unimodal and it is actually possible to extract more than k-1 features (where k is the number of classes).<br />
<br />
2) ''Orthonormal LDA (OLDA)'' by Okada and Tomita<br />
<br />
This method finds projections that are orthonormal in addition to maximizing the FDA objective function. This method can also extract more than k-1 features (where k is the number of classes).<br />
<br />
3) ''Generalized LDA (GLDA)'' by Lowe<br />
<br />
This method incorporates additional cost functions into the FDA objective function. This causes classes with a higher cost to be placed further apart in the lower dimensional representation.<br />
<br />
== Linear and Logistic Regression (Lecture: Oct. 06, 2011) ==<br />
<br />
=== Linear Regression ===<br />
<br />
In regression, <math>\ y </math> is a continuous variable. In classification, <math>\ y </math> is a discrete variable. Regression problems are easier to formulate into functions (since <math>\ y </math> is continuous) and it is possible to solve classification problems by treating them like regression problems. In order to do so, the requirement in classification that <math>\ y </math> is discrete must first be relaxed. Once <math>\ y </math> has been found using regression techniques, it is possible to determine the discrete class corresponding to the <math>\ y </math> that has been found to solve the original classification problem. The discrete class is obtained by defining a threshold where <math>\ y </math> values below the threshold belong to one class and <math>\ y </math> values above the threshold belong to another class.<br />
<br />
<br />
More formally: a more direct approach to classification is to estimate the regression function <math>\ r(\mathbf{x}) = E[Y | X]</math> without bothering to estimate <math>\ f_k(\mathbf{x}) </math>.<br />
<br />
In two-class problems, if <math>\ Y = \{0,1\}</math>, then <math>\, h^*(\mathbf{x})= \left\{\begin{matrix}<br />
1 &\text{,if } \hat r(\mathbf{x})>\frac{1}{2} \\<br />
0 &\mathrm{,otherwise.} \end{matrix}\right.</math><br />
<br />
Basically, we can use a linear function<br />
<math>\ f(x, \beta) = \mathbf{\beta\,}^T \mathbf{x_{i}} + \mathbf{\beta\,_0} </math> and use the least squares approach to fit the function to the given data. This is done by minimizing the following expression:<br />
<br />
<math>\min_{\mathbf{\beta}} \sum_{i=1}^n (y_i - \mathbf{\beta}^T<br />
\mathbf{x_{i}} - \mathbf{\beta_0})^2</math><br />
<br />
where<br />
<br />
<math>\tilde{\mathbf{\beta}} = \left( \begin{array}{c}\mathbf{\beta_{1}}<br />
<br />
\\ \\<br />
\dots \\ \\<br />
\mathbf{\beta}_{d} \\ \\<br />
\mathbf{\beta}_{0} \end{array} \right)</math>.<br />
<br />
For convenience, <math>\mathbf{\beta}</math> and <math>\mathbf{\beta}_0</math> have been combined into a d+1 dimensional vector. And an extra term 1 is appended to to <math>\ x </math>. Thus, the function to be minimized can now be expressed as:<br />
<br />
<math>\ min_{\tilde{\beta}} \sum_{i=1}^{n} (y_i - \tilde{\beta} \tilde{x_i} )^2 </math><br />
<br />
<math>\ = min_{\tilde{\beta}} | y - X \tilde{\beta}^T |^2 </math><br />
<br />
where <math>\ y </math> and <math>\tilde{\beta}</math> are vectors and <math>\ X </math> is a matrix.<br />
<br />
The solution for <math>\ \tilde{\beta} </math> is<br />
<br />
<math>\ {\tilde{\beta}} = (XX^T)^{-1}Xy </math><br />
<br />
Using regression to solve classification problems is not mathematically correct, if we want to be true to classification. However, this method works well in practice, if the problem is not complicated. When we have only two classes (encoded as <math>\ \frac{-n}{n_1} </math> and <math>\ \frac{n}{n_2}) </math>, this method is identical to LDA.<br />
<br />
==== Matlab Example ====<br />
<br />
The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
===Logistic Regression===<br />
<br />
Logistic regression is a more advanced method for classification, and is<br />
more commonly used. <br />
<br />
We can define a function <br /><br />
<math>f_1(x)= P(Y=1| X=x) = (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})</math><br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
<br />
<br />
This is a valid density function. It looks similar to a step function, but<br />
we have relaxed it so that we have a smooth curve, and can therefore take the<br />
derivative.<br />
<br />
The range of this function is (0,1) since<br /> <br />
<math>\lim_{x \to -\infty}f_1(\mathbf{x}) = 0</math> and<br />
<math>\lim_{x \to \infty}f_1(\mathbf{x}) = 1</math>.<br />
<br />
As shown on this graph:<br /><br />
http://www.wolframalpha.com/input/?i=Plot[E^x/%281+%2B+E^x%29,+{x,+-10,+10}]%29<br />
<br />
Then we compute the complement of f1(x), and get<br /><br />
<br />
<math>f_2(x)= P(Y=0| X=x) = 1-f_1(x) = (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})</math>, denoted f2. <br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
<br />
<br />
Function f2 is commonlly called Logistic function, and it behaves like <br /><br />
<math>\lim_{x \to -\infty}f_2(\mathbf{x}) = 1</math> and<br /><br />
<math>\lim_{x \to \infty}f_2(\mathbf{x}) = 0</math>.<br />
<br />
As shown on this graph:<br /><br />
http://www.wolframalpha.com/input/?i=Plot[1/%281+%2B+E^x%29,+{x,+-10,+10}]%29<br />
<br />
From here, we can form the conditional density function. To do this, we must combine<br /><br />
<math>f_1</math> and <math>f_2</math> <br /><br />
such that <br /><br />
<math>f_1=1</math> and<math>f_2=0</math> if y=1 ( which means it’s in class 1), <br /><br />
and <math>f_1=0</math> and <math>f_2=1</math> if y=2 (which means it’s in class 2).<br />
<br />
Eventually, we have our conditional density function formula<br /><br />
<math>f(y|\mathbf{x})= (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y}</math><br />
<br />
To way to use this formula is, with given the training data (x(i), y(i)),to fit the data with <math>f(Y : X)</math>.<br />
<br />
In general, we can think of the problem as having a box with some knobs. Inside the box is our objective function which gives the form to classify our input (xi) to<br />
our output (yi). The knobs in the box are functioning like the parameters of the objective function. Our job is to find the proper parameters that can minimize the error between our output and the true value. So we have turned our machine learning problem intoan optimization problem. <br />
<br />
Since we need to find the parameters that maximize the chance of having our observed data coming from the distribution of f(x|parameter), we need to introduce Maximum Likelihood Estimation.<br />
<br />
====Maximum Likelihood Estimation====<br />
<br />
Given iid data points <math>({\mathbf{x}_i})_{i=1}^n</math> and density function <math>f(\mathbf{x}|\mathbf{\theta})</math>, where the form of f is known but the parameters <math>\theta</math> are our unknown. The maximum likelihood estimation of <math>\theta\,_{ML}</math> is a set of parameters that maximize the probability of observing <math>({\mathbf{x}_i})_{i=1}^n</math> given <math>\theta\,_{ML}</math>.<br />
<br />
<math>\theta_\mathrm{ML} = \underset{\theta}{\operatorname{arg\,max}}\ f(\mathbf{x}|\theta)</math>.<br />
<br />
There was some discussion in class regarding the notation. In literature, Bayesians use <math>f(\mathbf{x}|\mu)</math> while Frequentists use <math>f(\mathbf{x};\mu)</math>. In practice, these two are equivalent.<br />
<br />
Our goal is to find theta to maximize <br />
<math>\mathcal{L}(\theta\,) = f({\mathbf{x}_i})_{i=1}^n|\;\theta) = \prod_{i=1}^n f(\mathbf{x_i}|\theta)</math>. (The second equality holds because data points are iid.)<br />
<br />
In many cases, it’s more convenient to work with the natural logarithm of the likelihood. (Recall that the logarithm preserves minumums and maximums.)<br />
<math>\ell(\theta|x\mathbf)=\ln\mathcal{L}(\theta\,)</math> <br />
<br />
<math>\ell(\theta\,)=\sum_{i=1}^n \ln f(\mathbf{x_i}|\theta)</math><br />
<br />
Applying Maximum Likelihood Estimation to <math>f(y|\mathbf{x})= (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y}</math>, gives<br />
<br />
<math>\mathcal{L}(\mathbf{\beta\,})=\prod_{i=1}^n (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y_i} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y_i}</math><br />
<br />
<math>\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) + (1-y_i) (\ln{1} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}))\right) \\[10pt]&{} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) - (1-y_i) \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \\[10pt] &{} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}) + y_i \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \\[10pt] &{} = \sum_{i=1}^n \left(y_i {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}</math><br />
<br />
<math>\begin{align} {\frac{\partial \ell}{\partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - \frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}} \mathbf{x_i} \right) \\[8pt] & {}= \sum_{i=1}^n \left(y_i \mathbf{x_i} - P(\mathbf{x_i} | \mathbf{\beta\,}) \mathbf{x_i}\right) \end{align}</math><br />
<br />
<math>\frac{\partial \ell}{\partial \mathbf{\beta\,}}</math> can be numerically solved by Newton’s Method.<br />
<br />
====Newton's Method====<br />
<br />
Newton's Method (or Newton-Ralphson method) is a numerical method to find better approximations to the solutions of real-valued function. The function usually does not have an analytical form. <br />
<br />
The goal is to find <math>\mathbf{x}</math> such that <math><br />
f(\mathbf{x})<br />
= 0 </math>. The recursion can be implemented by<br />
<math>\mathbf{x_1} = \mathbf{x_0} - \frac{f(\mathbf{x_0})}{f'(\mathbf{x_0})}.\,\!<br />
</math>.<br />
<br />
It takes an initial guess <math>\mathbf{x_0}</math> and the direction "<math>\mathbf{f(x_{0}) / f' (x_{0})}</math>" that moves toward a better approximation. It then finds a newer and better <math>\mathbf{x_1}</math>. Taking this <math>\mathbf{x_1}</math> as <math>\mathbf{x_0}</math> in the second run, it finds a newer and better <math>\mathbf{x_1}</math> than the previous <math>\mathbf{x_1}</math>. Repeating the same process, the <math>\mathbf{x_1}</math> will be sufficiently accurate to the actual solutions.<br />
<br />
<br />
<br />
===Advantages of Logistic Regression===<br />
<br />
Logistic regression has several advantages over discriminant analysis: <br />
<br />
* it is more robust: the independent variables don't have to be normally distributed, or have equal variance in each group <br />
* It does not assume a linear relationship between the IV and DV <br />
* It may handle nonlinear effects <br />
* You can add explicit interaction and power terms <br />
* The DV need not be normally distributed. <br />
* There is no homogeneity of variance assumption. <br />
* Normally distributed error terms are not assumed. <br />
* It does not require that the independents be interval. <br />
* It does not require that the independents be unbounded.<br />
<br />
==Newton-Raphson Method (Lecture: Oct 11, 2011)==<br />
Previously we had derivated the log likelihood function for the logistic function. <br />
<br />
<math>\begin{align} L(\beta\,) = \prod_{i=1}^n \left( (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y_i}(\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y_i} \right) \end{align}</math><br />
<br />
After taking log, we can have<br />
<br />
<math>\begin{align} \ell(\beta\,) = \sum_{i=1}^n \left( y_i \log{\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}}} + (1 - y_i) \log{\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}}} \right) \end{align}</math><br />
<br />
which implies that<br />
<br />
<math>\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}</math><br />
<br />
Our goal is to find the <math>\beta\,</math> that maximizes <math>{\ell(\mathbf{\beta\,})}</math>. We use calculus to do this ie solve <math>{\frac{\partial \ell}{\partial \mathbf{\beta\,}}}=0</math>. To do this we use the famous numerical method of Newton-Raphson. This is an iterative method were we calculate the first & second derivative at each iteration.<br />
<br />
The first derivative is typically called the score vector.<br />
<br />
<math>\begin{align} S(\beta\,) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - \frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}} \mathbf{x_i} \right) \\[8pt] \end{align}</math><br />
<br />
The negative of the second derivative is typically called the information matrix.<br />
<br />
<math>\begin{align} I(\beta\,) {}= -{\frac{\partial \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \sum_{i=1}^n \left(\mathbf{x_i}\mathbf{x_i}^T (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})(\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}}) \right) \\[8pt] \end{align}</math><br />
<br />
We then use the following update formula to calcalute continually better estimates of the optimal <math>\beta\,</math>. It is not typically important what you use as your initial estimate <math>\beta\,^{(1)}</math> is.<br />
<br />
<math> \beta\,^{(r+1)} {}= \beta\,^{(r)} + I^{-1}(\beta\,^{(r)} )S(\beta\,^{(r)} )</math><br />
<br />
====Matrix Notation====<br />
<br />
let <math>\mathbf{y}</math> be a (n x 1) vector of all class labels. This is called the response in other contexts.<br />
<br />
let <math>\mathbb{X}</math> be a (n x (d+1)) matrix of all your features. Each row represents a data point. Each column represents a feature/covariate.<br />
<br />
let <math>\mathbf{p}^{(r)}</math> be a (n x 1) vector with values <math> P(\mathbf{x_i} |\beta\,^{(r)} ) </math><br />
<br />
let <math>\mathbb{W}^{(r)}</math> be a (n x n) diagonal matrix with <math>\mathbb{W}_{ii}^{(r)} {}= P(\mathbf{x_i} |\beta\,^{(r)} )(1 - P(\mathbf{x_i} |\beta\,^{(r)} ))</math><br />
<br />
we can rewrite our score vector, information matrix & update equation in terms of this new matrix notation, so the first derivitive is<br />
<br />
<math>\begin{align} S(\beta\,^{(r)}) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)})\end{align}</math><br />
<br />
and the second derivitive is<br />
<br />
<math>\begin{align} I(\beta\,^{(r)}) {}= -{\frac{\partial \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X} \end{align}</math><br />
<br />
Therfore, we can fit a regression problem as follows<br />
<br />
<math> \beta\,^{(r+1)} {}= \beta\,^{(r)} + I^{-1}(\beta\,^{(r)} )S(\beta\,^{(r)} ) {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}\mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)})</math><br />
<br />
====Iteratively Re-weighted Least Squares====<br />
If we reorganize this updating formula we can see it is really a iteratively solving a least squares problem each time with a new weighting.<br />
<br />
<math>\beta\,^{(r+1)} {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}(\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X}\beta\,^{(r)} + \mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)}))</math><br />
<br />
<math>\beta\,^{(r+1)} {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}\mathbb{X}^T\mathbb{W}^{(r)}\mathbf(z)^{(r)}</math><br />
<br />
where <math> \mathbf{z}^{(r)} = \mathbb{X}\beta\,^{(r)} + (\mathbb{W}^{(r)})^{-1}(\mathbf{y}-\mathbf{p}^{(r)}) </math><br />
<br />
<br />
<br />
Recall that linear regression by least squares finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-X^T \underline{\beta})^T(\underline{y}-X^T \underline{\beta})</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{(r+1)} \leftarrow arg \min_{\underline{\beta}}(Z-X \underline{\beta})^T W (Z-X \underline{\beta})</math><br />
<br />
====Fisher Scoring Method==== <br />
<br />
Fisher Scoring is a method very similiar to Newton-Raphson. It uses the expected Information Matrix as opposed to the observed information matrix. This distinction simplifies the problem and in perticular the computational complexity. To learn more about this method & logistic regression in general you can take Stat431/831 at the University of Waterloo.<br />
<br />
===Multi-class Logistic Regression===<br />
<br />
In a multi-class logistic regression we have K classes. For 2 classes ''k'' and ''l''<br />
<br />
<math>\frac{P(Y=l|X=x)}{P(Y=k|X=x)} = e^{\beta_l^T x}</math><br />
<br />
We call <math>log(\frac{P(Y=l|X=x)}{P(Y=k|X=x)}) = \beta_l^T x</math> as the logit transformation. The decision boundary between the 2 classes is the set of points where the logit transformation is 0.<br />
<br />
For each class from 1 to K-1 we then have:<br />
<br />
<math>log(\frac{P(Y=1|X=x)}{P(Y=K|X=x)}) = e^{\beta_1^T x}</math><br />
<br />
<math>log(\frac{P(Y=2|X=x)}{P(Y=K|X=x)}) = e^{\beta_2^T x}</math><br />
<br />
<math>log(\frac{P(Y=K-1|X=x)}{P(Y=K|X=x)}) = e^{\beta_{K-1}^T x}</math><br />
<br />
Note that choosing ''Y=K'' is arbitrary and any other choice is equally valid.<br />
<br />
Based on the above the posterior probabilities are given by: <math>P(Y=k|X=x) = \frac{e^{\beta_k^T x}}{1 + \sum_{i=1}^{K-1}{e^{\beta_i^T x}}}</math><br />
<br />
===Sample Size Requirements===<br />
<br />
The number of adjustable components in linear discriminant analysis is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math> where d is the number of parameters. Similarly, the number of adjustable components in logistic regression is <math>\, d+1</math>. The number of components also corresponds to the minimum number of observations needed to compute the coefficients of each function. Techniques do exist though for handling high dimensional problems where the number of parameters exceeds the number of observations.<br />
<br />
Linear discriminant analysis involves the inversion of a d x d covariance matrix. When d is bigger than n, the number of observations, this matrix is large and is rank n < d and thus singular. When this is the case, we can either use the pseudo inverse or perform regularized discriminant analysis which solves this problem. In RDA, we define a new covariance matrix <math>\, \Sigma(\gamma) = \gamma\Sigma + (1 - \gamma)diag(\Sigma)</math> with <math>\gamma \in [0,1]</math>. Cross validation can be used to calculate the best <math>\, \gamma</math>.<br />
<br />
===Comparison Between Logistic Regression And Linear Discriminant Analysis (LDA)===<br />
<br />
Logistic Regression Model and Linear Discriminant Analysis are widely used to analyze data which has categorical outcome variables. Both of the models are to build linear boundaries to classify different groups. Also, the categorical outcome variables (i.e. the dependent variables) must be mutually exclusive. <br />
<br />
However, these two models differ in their basic idea. While Logistic Regression is more relaxed and flexible in its assumptions, Linear Discriminant Analysis has the requirement that its explanatory variables must be normally distributed, linearly related and have equal covariance matrix within each class. Therefore, it can be expected that linear Discriminant Analysis should be more appropriate if the normality assumptions and equal covariance assumption are fulfilled in its explanatory variables. But in all other situations Logistic Regression should be appropriate. Besides, the total number of estimates to compute between these models is different. If the explanatory variables have d dimensions, we need to estimate <math>d+1</math> parameters in Logistic Regression and the number of parameters grows linearly w.r.t. dimension, while we need to estimate <math>2d+\frac{d*(d+1)}{2}+2</math> parameters in Linear Discriminant Analysis and the number of parameters grows quadratically w.r.t. dimension. <br />
<br />
== Perceptron (Lecture: Oct. 11, 2011) ==<br />
<br />
[[Image:Perceptron1.png|right|thumb|300px|Simple perceptron]]<br />
[[Image:Perceptron2.png|right|thumb|300px|Simple perceptron where <math>\beta_0</math> is defined as 1]]<br />
<br />
<br />
The perceptron is the building block for neural networks. It was invented by Rosenblatt in 1957 at Cornell Labs, and first mentioned in the paper "The Perceptron - a perceiving and recognizing automaton". The perceptron is used on linearly separable data sets.<br />
<br />
For a 2 class problem, and a set of inputs with ''d'' features, a perceptron will use a weighted sum and it will classify the information using the sign of the result. The figures on the right give an example of a perceptron. In these examples, <math>x^i</math> is the ''i''-th feature of a sample and <math>\beta_i</math> is the ''i''-th weight. <math>\beta_0</math> is defined as the bias. The bias alters the position of the decision boundary between the 2 classes.<br />
<br />
Perceptrons are generally trained using [http://en.wikipedia.org/wiki/Gradient_descent gradient descent]. This type of learning can have 2 side effects:<br />
* If the data sets are well separated, the training of the perceptron can lead to multiple valid solutions,<br />
* If the data sets are not linearly separable, the learning algorithm will never finish.<br />
<br />
Perceptrons are the simplest kind of a feedforward neural network. A perceptron is the building block for other neural networks such as:<br />
* Multi-layer perceptron<br />
* ADALINE<br />
* MADALINE<br />
<br />
==References==<br />
<references /><br />
<br />
24. Balakrishnama, S., Ganapathiraju, A. LINEAR DISCRIMINANT ANALYSIS - A BRIEF TUTORIAL. http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf [[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf]]</div>S9huhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f11&diff=12530stat841f112011-10-13T07:50:26Z<p>S9hu: /* Sample Size Requirements */</p>
<hr />
<div>==[[f11Stat841proposal| Proposal for Final Project]]==<br />
<br />
==[[f11Stat841EditorSignUp| Editor Sign Up]]==<br />
<br />
= STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20 =<br />
== Wiki Course Notes ==<br />
Students will need to contribute to the wiki for 20% of their grade.<br />
Access via wikicoursenote.com<br />
Go to editor sign-up, and use your UW userid for your account name, and use your UW email.<br />
<br />
primary (10%)<br />
Post a draft of lecture notes within 48 hours. <br />
You will need to do this 1 or 2 times, depending on class size.<br />
<br />
secondary (10%)<br />
Make improvements to the notes for at least 60% of the lectures.<br />
More than half of your contributions should be technical rather than editorial.<br />
There will be a spreadsheet where students can indicate what they've done and when.<br />
The instructor will conduct random spot checks to ensure that students have contributed what they claim.<br />
<br />
<br />
== Classification (Lecture: Sep. 20, 2011) ==<br />
=== Definitions ===<br />
'''classification''': Predict a discrete random variable <math>Y</math> (a label) by using another random variable <math>X</math><br />
(new data point) picked iid from a distribution<br />
<br />
<math>X_i = (X_{i1}, X_{i2}, ... X_{id}) \in \mathcal{X} \subset \mathbb{R}^d</math> (<math>d</math>-dimensional vector)<br />
<math>Y_i</math> in some finite set <math>\mathcal{Y}</math><br />
<br />
<br />
'''classification rule''':<br />
<math>h : \mathcal{X} \rightarrow \mathcal{Y}</math><br />
Take new observation <math>X</math> and use a classification function <math>h(x)</math> to generate a label <math>Y</math>. In other words, If we fit the function <math>h(x)</math> with a random variable <math>X</math>, it generates the label <math>Y</math> which is the class to which we predict <math>X</math> belongs.<br />
<br />
Example: Let <math> \mathcal{X}</math> be a set of 2D images and <math>\mathcal{Y}</math> be a finite set of people. We want to learn a classification rule <math>h:\mathcal{X}\rightarrow\mathcal{Y}</math> that with small ''true'' error predicts the person who appears in the image. <br />
<br />
<br />
'''true error rate''' for classifier <math>h</math> is the error with respect to the underlying distribution (that we do not know).<br />
<br />
<math>L(h) = P(h(X) \neq Y )</math><br />
<br />
<br />
'''empirical error rate''' (or training error rate) is the amount of error that our classification function <math>h(x)</math> makes on the training data.<br />
<br />
<math>\hat{L}_n(h) = (1/n) \sum_{i=1}^{n} \mathbf{I}(h(X_i) \neq Y_i)</math><br />
<br />
where <math>\mathbf{I}()</math> is an indicator function. Indicator function is defined by <br />
<br />
<math>\mathbf{I}(x) = \begin{cases} <br />
1 & \text{if } x \text{ is true} \\<br />
0 & \text{if } x \text{ is false}<br />
\end{cases}</math><br />
<br />
So in this case,<br />
<math>\mathbf{I}(h(X_i)\neq Y_i) = \begin{cases}<br />
1 & \text{if } h(X_i)\neq Y_i \text{ (i.e. when misclassification happens)} \\<br />
0 & \text{if } h(X_i)=Y_i \text{ (i.e. classified properly)}<br />
\end{cases}</math><br />
<br />
e.g., 100 new data points with known (true) labels<br />
<br />
<math>y_1 = h(x_1)</math><br />
<br />
...<br />
<br />
<math>y_{100} = h(x_{100})</math><br />
<br />
To calculate the empirical error we count how many labels our function <math>h(x)</math> assigned incorrectly and divide by n=100<br />
<br />
=== Bayes Classifier ===<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability<ref> http://www.wikicoursenote.com/wiki/Stat841#Bayes_Classifier </ref>.<br />
<br />
First recall Bayes' Rule, in the format<br />
<math>P(Y|X) = \frac{P(X|Y) P(Y)} {P(X)} </math> <br />
<br />
P(Y|X) : ''posterior'' , ''probability of <math>Y</math> given <math>X</math>''<br />
<br />
P(X|Y) : ''likelihood'', ''probability of <math>X</math> being generated by <math>Y</math>''<br />
<br />
P(Y) : ''prior'', ''probability of <math>Y</math> being selected''<br />
<br />
P(X) : ''marginal'', ''probability of obtaining <math>X</math>''<br />
<br />
<br />
We will start with the simplest case: <math>\mathcal{Y} = \{0,1\}</math><br />
<br />
<math> r(x) <br />
= P(Y=1|X=x) <br />
= \frac{P(X=x|Y=1) P(Y=1)} {P(X=x)}<br />
= \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}</math><br />
<br />
Bayes' rule can be approached by computing either:<br />
<br />
1) '''The posterior''': <math>\ P(Y=1|X=x) </math> and <math>\ P(Y=0|X=x) </math> or <br />
<br />
2) '''The likelihood''': <math>\ P(X=x|Y=1) </math> and <math>\ P(X=x|Y=0) </math><br />
<br />
<br />
The former reflects a '''Bayesian''' approach. The Bayesian approach uses previous beliefs and observed data (e.g., the random variable <math>\ X </math>) to determine the probability distribution of the parameter of interest (e.g., the random variable <math>\ Y </math>). The probability, according to Bayesians, is a ''degree of belief'' in the parameter of interest taking on a particular value (e.g., <math>\ Y=1 </math>), given a particular observation (e.g., <math>\ X=x </math>). Historically, the difficulty in this approach lies with determining the posterior distribution, however, more recent methods such as '''Markov Chain Monte Carlo (MCMC)''' allow the Bayesian approach to be implemented <ref name="PCAustin">P. C. Austin, C. D. Naylor, and J. V. Tu, "A comparison of a Bayesian vs. a frequentist method for profiling hospital performance," ''Journal of Evaluation in Clinical Practice'', 2001</ref>.<br />
<br />
The latter reflects a '''Frequentist''' approach. The Frequentist approach assumes that the probability distribution, including the mean, variance, etc., is fixed for the parameter of interest (e.g., the variable <math>\ Y </math>, which is ''not'' random). The observed data (e.g., the random variable <math>\ X </math>) is simply a ''sampling'' of a far larger population of possible observations. Thus, a certain repeatability or ''frequency'' is expected in the observed data. If it were possible to make an infinite number of observations, then the true probability distribution of the parameter of interest can be found. In general, frequentists use a technique called '''hypothesis testing''' to compare a ''null hypothesis'' (e.g. an assumption that the mean of the probability distribution is <math>\ \mu_0 </math>) to an alternative hypothesis (e.g. assuming that the mean of the probability distribution is larger than <math>\ \mu_0 </math>) <ref name="PCAustin"/>. For more information on hypothesis testing see <ref>R. Levy, "Frequency hypothesis testing, and contingency tables" class notes for LING251, Department of Linguistics, University of California, 2007. Available: [http://idiom.ucsd.edu/~rlevy/lign251/fall2007/lecture_8.pdf http://idiom.ucsd.edu/~rlevy/lign251/fall2007/lecture_8.pdf] </ref>. <br />
<br />
There was some class discussion on which approach should be used. Both the ease of computation and the validity of both approaches were discussed. A main point that was brought up in class is that Frequentists consider X to be a random variable, but they do not consider Y to be a random variable because it has to take on one of the values from a fixed set (in the above case it would be either 0 or 1 and there is only one ''correct'' label for a given value X=x). Thus, from a Frequentist's perspective it does not make sense to talk about the probability of Y. This is actually a grey area and sometimes ''Bayesians'' and ''Frequentists'' use each others' approaches. So using ''Bayes' rule'' doesn't necessarily mean you're a ''Bayesian''. Overall, the question remains unresolved.<br />
<br />
<br />
The '''Bayes Classifier''' uses <math>\ P(Y=1|X=x)</math><br />
<br />
<math> P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}</math><br />
<br />
P(Y=1) : the prior, based on belief/evidence beforehand<br />
<br />
denominator : marginalized by summation<br />
<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
The set <math>\mathcal{D}(h) = \{ x : P(Y=1|X=x) = P(Y=0|X=x)... \} </math><br />
<br />
which defines a ''decision boundary''.<br />
<br />
<math>h^*(x) = <br />
\begin{cases}<br />
1 \ \ if \ \ P(Y=1|X=x) > P(Y=0|X=x) \\<br />
0 \ \ \ \ \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
''Theorem'': Bayes rule is optimal. I.e., if h is any other classification rule, <br />
then <math>L(h^*) <= L(h)</math><br />
(This is to be proved in homework.)<br />
<br />
Why then do we need other classfication methods?<br />
A: Because X densities are often/typically unknown. I.e., <math>f_k(x)</math> and/or <math>\pi_k</math> unknown.<br />
<br />
<math>P(Y=k|X=x) = \frac{P(X=x|Y=k)P(Y=k)} {P(X=x)} = \frac{f_k(x) \pi_k} {\sum_k f_k(x) \pi_k}</math><br />
f_k(x) is referred to as the class conditional distribution (~likelihood).<br />
<br />
Therefore, we rely on some data to estimate quantities.<br />
<br />
=== Three Main Approaches ===<br />
<br />
'''1. Empirical Risk Minimization''':<br />
Choose a set of classifiers H (e.g., line, neural network) and find <math>h^* \in H</math><br />
that minimizes (some estimate of) L(h).<br />
<br />
'''2. Regression''':<br />
Find an estimate (<math>\hat{r}</math>) of function <math>r</math> and define<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
The <math> 1/2 </math> in the expression above is a threshold set for the regression prediction output. <br />
<br />
In general ''regression'' refers to finding a continuous, real valued y. The problem here is more difficult, because of the restricted domain (y is a set of discrete label values).<br />
<br />
'''3. Density Estimation''':<br />
Estimate <math>P(X=x|Y=0)</math> from <math>X_i</math>'s for which <math>Y_i = 0</math><br />
Estimate <math>P(X=x|Y=1)</math> from <math>X_i</math>'s for which <math>Y_i = 1</math><br />
and let <math>P(Y=?) = (1/n) \sum_{i=1}^{n} Y_i</math><br />
<br />
Define <math>\hat{r}(x) = \hat{P}(Y=1|X=x)</math> and<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
It is possible that there may not be enough data to estimate from for ''density estimation''. But the main problem lies with high dimensional spaces, as the estimation results may not be good (high error rate) and sometimes even infeasible. The term ''curse of dimensionality'' was coined by Bellman <ref>R. E. Bellman, ''Dynamic Programming''. Princeton University Press,<br />
1957</ref> to describe this problem.<br />
<br />
As the dimension of the space goes up, the learning requirements go up exponentially.<br />
<br />
To Learn more about methods for handling high-dimensional data <ref> https://docs.google.com/viewer?url=http%3A%2F%2Fwww.bios.unc.edu%2F~dzeng%2FBIOS740%2Flecture_notes.pdf</ref><br />
<br />
=== Multi-Class Classification ===<br />
Generalize to case Y takes on k>2 values.<br />
<br />
<br />
''Theorem'': <math>Y \in \mathcal{Y} = \{1,2,..., k\} </math> optimal rule<br />
<br />
<math>\ h^{*}(x) = argmax_k P(Y=k|X=x) </math> <br />
<br />
where <math>P(Y=k|X=x) = \frac{f_k(x) \pi_k} {\sum_r f_r \pi_r}</math><br />
<br />
===Examples of Classification===<br />
<br />
* Face detection in images.<br />
* Medical diagnosis.<br />
* Detecting credit card fraud (fraudulent or legitimate).<br />
* Speech recognition.<br />
* Handwriting recognition.<br />
<br />
== LDA and QDA ==<br />
<br />
'''Discriminant function analysis''' finds features that best allow discrimination between two or more classes. The approach is similar to '''analysis of Variance (ANOVA)''' in that discriminant function analysis looks at the mean values to determine if two or more classes are very different and should be separated. Once the discriminant functions (that separate two or more classes) have been determined, new data points can be classified (i.e. placed in one of the classes) based on the discriminant functions <ref> StatSoft, Inc. (2011). ''Electronic Statistics Textbook.'' [Online]. Available: [http://www.statsoft.com/textbook/discriminant-function-analysis/ http://www.statsoft.com/textbook/discriminant-function-analysis/.] </ref>. '''Linear discriminant analysis (LDA)''' and '''Quadratic discriminant analysis (QDA)''' are methods of discriminant analysis that are best applied to linearly and quadradically separable classes, respectively. '''Fisher discriminant analysis (FDA)''' another method of discriminant analysis that is different from linear discriminant analysis, but oftentimes both terms are used interchangeably.<br />
<br />
=== LDA ===<br />
<br />
The simplest method is to use approach 3 (above) and assume a parametric model for densities. Assume class conditional is Gaussian.<br />
<br />
<math>\mathcal{Y} = \{ 0,1 \}</math> assumed (i.e., 2 labels)<br />
<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ P(Y=1|X=x) > P(Y=0|X=x) \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
<math>P(Y=1|X=x) = \frac{f_1(x) \pi_1} {\sum_k f_k \pi_k} \ \ </math> (denom = P(x))<br />
<br />
1) Assume Gaussian distributions<br />
<br />
<math>f_k(x) = \frac{1}{(2\pi)^{d/2} |\Sigma_k|^{1/2}} exp(-(1/2)(\mathbf{x} - \mathbf{\mu_k}) \Sigma_k^{-1}(\mathbf{x}-\mathbf{\mu_k}) )</math><br />
<br />
must compare <br />
<math>\frac{f_1(x) \pi_1} {p(x)}</math> with <math>\frac{f_0(x) \pi_0} {p(x)}</math><br />
Note that the p(x) denom can be ignored:<br />
<math>f_1(x) \pi_1</math> with <math>f_0(x) \pi_0 </math><br />
<br />
To find the decision boundary, set <br />
<math>f_1(x) \pi_1 = f_0(x) \pi_0 </math><br />
<br />
2) Assume <math>\Sigma_1 = \Sigma_0</math>, we can use <math>\Sigma = \Sigma_0 = \Sigma_1</math>.<br />
<br />
Cancel <math>(2\pi)^{-d/2} |\Sigma_k|^{-1/2}</math> from both sides.<br />
<br />
Take log of both sides.<br />
<br />
Subtract one side from both sides, leaving zero on one side.<br />
<br />
<br />
<math>-(1/2)(\mathbf{x} - \mathbf{\mu_1})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_1}) + log(\pi_1) - [-(1/2)(\mathbf{x} - \mathbf{\mu_0})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_0}) + log(\pi_0)] = 0 </math><br />
<br />
<br />
<math>(1/2)[-\mathbf{x}^T \Sigma^{-1}\mathbf{x} - \mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + 2\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{x}<br />
+ \mathbf{x}^T \Sigma^{-1}\mathbf{x} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} - 2\mathbf{\mu_0}^T \Sigma^{-1} \mathbf{x} ]<br />
+ log(\pi_1/\pi_0) = 0 </math><br />
<br />
<br />
Cancelling out the terms quadratic in <math>\mathbf{x}</math> and rearranging results in <br />
<br />
<math>(1/2)[-\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0}<br />
+ (2\mathbf{\mu_1}^T \Sigma^{-1} - 2\mathbf{\mu_0}^T \Sigma^{-1}) \mathbf{x}]<br />
+ log(\pi_1/\pi_0) = 0 </math><br />
<br />
<br />
We can see that the first pair of terms is constant, and the second pair is linear in x.<br />
Therefore, we end up with something of the form <br />
<math>ax + b = 0</math>.<br />
For more about LDA <ref>http://sites.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf</ref><br />
<br />
== LDA and QDA Continued (Lecture: Sep. 22, 2011) == <br />
<br />
If we relax assumption 2 (i.e. <math>\Sigma_1 \neq \Sigma_0</math>) then we get a quadratic equation that can be written as<br />
<math>{x}^Ta{x}+b{x} + c = 0</math><br />
<br />
===Generalizing LDA and QDA===<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,K\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h^*(x) = \arg\max_{k} \delta_k(x)</math><br />
<br />
Where<br />
<br />
<math> \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math><br />
<br />
When the Gaussian variances are equal <math>\Sigma_1 = \Sigma_0</math> (e.g. LDA), then<br />
<br />
<math> \,\delta_k(x) = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math><br />
<br />
(To compute this, we need to calculate the value of <math>\,\delta </math> for each class, and then take the one with the max. value).<br />
<br />
===In practice===<br />
We estimate the prior to be the chance that a random item from the collection belongs to class k, e.g.<br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
The mean to be the average item in set k, e.g.<br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
and calculate the covariance of each class e.g.<br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
If we wish to use LDA we must calculate a common covariance, so we average all the covariances e.g.<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{r=1}^{k}n_r} </math><br />
<br />
Where: <math>\,n_r</math> is the number of data points in class <math>\,r</math>, <math>\,\Sigma_r</math> is the covariance of class <math>\,r</math>, <math>\,n</math> is the total number of data points, and <math>\,k</math> is the number of classes.<br />
<br />
===Computation===<br />
<br />
For QDA we need to calculate: <math> \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math><br />
<br />
Lets first consider when <math>\, \Sigma_k = I, \forall k </math>. This is the case where each distribution is spherical, around the mean point.<br />
<br />
====Case 1====<br />
When <math>\, \Sigma_k = I </math><br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
but <math>\ \log(|I|)=\log(1)=0 </math><br />
<br />
and <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math> is the [http://en.wikipedia.org/wiki/Euclidean_distance#Squared_Euclidean_Distance squared Euclidean distance] between two points <math>\,x</math> and <math>\,\mu_k</math><br />
<br />
Thus in this condition, a new point can be classified by its distance away from the center of a class, adjusted by some prior.<br />
<br />
Further, for two-class problem with equal prior, the discriminating function would be the bisector of the 2-class's means.<br />
<br />
====Case 2==== <br />
When <math>\, \Sigma_k \neq I </math><br />
<br />
Using the [[Singular Value Decomposition(SVD) | Singular Value Decomposition (SVD)]] of <math>\, \Sigma_k</math><br />
we get <math> \, \Sigma_k = U_kS_kV_k^\top</math>. In particular, <math>\, U_k</math> is a collection of eigenvectors of <math>\, \Sigma_k\Sigma_k^*</math>, and <math>\, V_k</math> is a collection of eigenvectors of <math>\,\Sigma_k^*\Sigma_k</math>.<br />
Since <math>\, \Sigma_k</math> is a symmetric matrix<ref> http://en.wikipedia.org/wiki/Covariance_matrix#Properties </ref>, <math>\, \Sigma_k = \Sigma_k^*</math>, so we have <math> \, \Sigma_k = U_kS_kU_k^\top </math>.<br />
<br />
For <math>\,\delta_k</math>, the second term becomes what is also known as the Mahalanobis distance <ref>P. C. Mahalanobis, "On The Generalised Distance in Statistics," ''Proceedings of the National Institute of Sciences of India'', 1936</ref> :<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top U_kS_k^{-1}U_k^T(x-\mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-1}(U_k^\top x-U_k^\top \mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-\frac{1}{2}}S_k^{-\frac{1}{2}}(U_k^\top x-U_k^\top\mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top I(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
If we think of <math> \, S_k^{-\frac{1}{2}}U_k^\top </math> as a linear transformation that takes points in class <math>\,k</math> and distributes them spherically around a point, like in case 1. Thus when we are given a new point, we can apply the modified <math>\,\delta_k</math> values to calculate <math>\ h^*(\,x)</math>. After applying the singular value decomposition, <math>\,\Sigma_k^{-1}</math> is considered to be an identity matrix such that<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}[(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k)] + log (\pi_k) </math><br />
<br />
and,<br />
<br />
<math>\ \log(|I|)=\log(1)=0 </math><br />
<br />
For applying the above method with classes that have different covariance matrices (for example the covariance matrices <math>\ \Sigma_0 </math> and <math>\ \Sigma_1 </math> for the two class case), each of the covariance matrices has to be decomposed using SVD to find the according transformation. Then, each new data point has to be transformed using each transformation to compare its distance to the mean of each class (for example for the two class case, the new data point would have to be transformed by the class 1 transformation and then compared to <math>\ \mu_0 </math> and the new data point would also have to be transformed by the class 2 transformation and then compared to <math>\ \mu_1 </math>).<br />
<br />
<br />
The difference between [[#Case 1 | Case 1]] and [[#Case 2 | Case 2]] (i.e. the difference between using the Euclidean and Mahalanobis distance) can be seen in the illustration below. <br />
<br />
[[File:EuclideanVsMahalonobisDistance2.PNG|frame|center|Illustration of Euclidean distance (a) and Mahalanobis distance (b) where the contours represent equidistant points from the center using each distance metric. Source: <ref>R. De Maesschalck, D. Jouan-Rimbaud and D. L. Massart, "Tutorial - The Mahalanobis distance," ''Chemometrics and Intelligent Laboratory Systems'', 2000 </ref>]]<br />
<br />
As can be seen from the illustration above, the Mahalanobis distance takes into account the distribution of the data points, whereas the Euclidean distance would treat the data as though it has a spherical distribution. Thus, the Mahalanobis distance applies for the more general classification in [[#Case 2 | Case 2]], whereas the Euclidean distance applies to the special case in [[#Case 1 | Case 1]] where the data distribution is assumed to be spherical.<br />
<br />
Generally, we can conclude that QDA provides a better classifier for the data then LDA because LDA assumes that the covariance matrix is identical for each class, but QDA does not. QDA still uses Gaussian distribution as a class conditional distribution. In our real life, this distribution can not be happened each time, so we have to use other distribution as a complement.<br />
<br />
== Principal Component Analysis (PCA) (Lecture: Sep. 27, 2011) ==<br />
<br />
'''Principal Component Analysis (PCA)''' is a method of dimensionality reduction/feature extraction that transforms the data from a D dimensional space into a new coordinate system of dimension d, where d <= D ( the worst case would be to have d=D). The goal is to preserve as much of the variance in the original data as possible when switching the coordinate systems. Give data on D variables, the hope is that the data points will lie mainly in a linear subspace of dimension lower than D. In practice, the data will usually not lie precisely in some lower dimensional subspace.<br />
<br />
<br />
The new variables that form a new coordinate system are called '''principal components''' (PCs). PCs are denoted by <math>\ u_1, u_2, ... , u_D </math>. The principal components form a basis for the data. Since PCs are orthogonal linear transformations of the original variables there is at most D PCs. Normally, not all of the D PCs are used but rather a subset of d PCs, <math>\ u_1, u_2, ... , u_d </math>, to approximate the space spanned by the original data points <math>\ x_1, x_2, ... , x_D </math>. We can choose d based on what percentage of the original data we would like to maintain. <br />
<br />
Let <math>\ PC_j</math> be a linear combination of <math>\ x_1, x_2, ... , x_D </math> defined by the coefficients <br />
<math>\ w^{(j)}</math> = <math> ( {w_1}^{(j)}, {w_2}^{(j)},...,{w_D}^{(j)} )^T </math><br />
<br />
Thus, <math> u_j = {w_1}^{(j)} x_1 + {w_2}^{(j)} x_2 + ... + {w_D}^{(j)} x_D = w^{(j)^T} X </math><br />
<br />
<br />
This is a unique configuration since it sets up the PCs in order from maximum to minimum variances. The first PC, <math>\ u_1 </math> is called '''first principal component''' and has the maximum variance, thus it accounts for the most significant variance in the data <math>\ x_1, x_2, ... , x_D </math>. The second PC, <math>\ u_2 </math> is called '''second principal component''' and has the second highest variance and so on until PC, <math>\ u_D </math> which has the minimum variance. <br />
<br />
<br />
To get the first principal component, we would like to use the following equation:<br />
<br />
<math>\ max (Var(w^T X)) = max (w^T S w) </math> <br />
<br />
Where <math>\ S </math> is the covariance matrix. And we solve for <math>\ w </math>.<br />
<br />
<br />
Note: we require the constraint <math>\ w^T w = 1 </math> because if there is no constraint on the length of <math>\ w </math> then there is no upper bound. With the constraint, the direction and not the length that maximizes the variance can be found. <br />
<br />
<br />
====Lagrange Multiplier====<br />
<br />
Before we proceed, we should review Lagrange multipliers.<br />
<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
<br />
Lagrange multipliers are used to find the maximum or minimum of a function <math>\displaystyle f(x,y)</math> subject to constraint <math>\displaystyle g(x,y)=0</math> <br />
<br />
we define a new constant <math> \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle f(x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example :====<br />
Suppose we want to maximize the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method to find the maximum value for the function <math>\displaystyle f </math>; the Lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain two stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
===Determining w :===<br />
<br />
Use the Lagrange multiplier conversion to obtain:<br />
<math>\displaystyle L(w, \lambda) = w^T Sw - \lambda (w^T w - 1)</math> where <math>\displaystyle \lambda </math> is a constant <br />
<br />
Take the derivative and set it to zero:<br />
<math>\displaystyle{\partial L \over{\partial w}} = 0 </math><br />
<br />
<br />
To obtain: <br />
<math>\displaystyle 2Sw - 2 \lambda w = 0</math><br />
<br />
<br />
Rearrange to obtain:<br />
<math>\displaystyle Sw = \lambda w</math><br />
<br />
<br />
where <math>\displaystyle w</math> is eigenvector of <math>\displaystyle S </math> and <math>\ \lambda </math> is the eigenvalue of <math>\displaystyle S </math> as <math>\displaystyle Sw= \lambda w </math> , and <math>\displaystyle w^T w=1</math> , then we can write<br />
<br />
<math>\displaystyle w^T Sw= w^T\lambda w= \lambda w^T w =\lambda </math> <br />
<br />
Note that the PCs decompose the total variance in the data in the following way :<br />
<br />
<math> \sum_{i=1}^{D} Var(u_i) </math><br />
<br />
<math>= \sum_{i=1}^{D} (\lambda_i) </math> <br />
<br />
<math>\ = Tr(S) </math><br />
<br />
<math>= \sum_{i=1}^{D} Var(x_i)</math><br />
<br />
== Principal Component Analysis (PCA) Continued (Lecture: Sep. 29, 2011) == <br />
As can be seen from the above expressions, <math>\ Var(W^\top X) = W^\top S W= \lambda </math> where lambda is an eigenvalue of the sample covariance matrix <math>\ S </math> and <math>\ W</math> is its corresponding eigenvector. So <math>\ Var(u_i) </math> is maximized if <math>\ \lambda_i </math> is the maximum eigenvalue of <math>\ S </math> and the first principal component (PC) is the corresponding eigenvector. Each successive PC can be generated in the above manner by taking the eigenvectors of <math>\ S</math><ref>www.wikipedia.org/wiki/Eigenvalues_and_eigenvectors</ref> that correspond to the eigenvalues:<br />
<br />
<math>\ \lambda_1 \geq ... \geq \lambda_D </math> <br />
<br />
such that <br />
<br />
<math>\ Var(u_1) \geq ... \geq Var(u_D) </math><br />
<br />
=== Alternative Derivation ===<br />
Another way of looking at PCA is to consider PCA as a projection from a higher D-dimension space to a lower d-dimensional subspace that minimizes the squared ''reconstruction error''. The squared reconstruction error is the difference between the original data set <math>\ X </math> and the new data set <math> \hat{X} </math> obtained by first projecting the original data set into a lower d-dimensional subspace and then projecting it back into the the original higher D-dimension space. Since information is (normally) lost by compressing the the original data into a lower d-dimensional subspace, the new data set will (normally) differ from the original data even though both are part of the higher D-dimension space. The reconstruction error is computed as shown below.<br />
<br />
====Reconstruction Error====<br />
<br />
<math> e = \sum_{i=1}^{n} || x_i - \hat{x}_i ||^2 </math><br />
<br />
====Minimize Reconstruction Error====<br />
<br />
Suppose <math> \bar{x} = 0 </math> where <math> \hat{x}_i = x_i - \bar{x} </math><br />
<br />
Let <math>\ f(y) = U_d y </math> where <math>\ U_d </math> is a D by d matrix with d orthogonal unit vectors as columns.<br />
<br />
Fit the model to the data and minimize the reconstruction error:<br />
<br />
<math>\ min_{U_d, y_i} \sum_{i=1}^n || x_i - U_d y_i ||^2 </math><br />
<br />
Differentiate with respect to <math>\ y_i </math>:<br />
<br />
<math> \frac{\partial e}{\partial y_i} = 0 </math><br />
<br />
we can rewrite reconstruction-error as : <math>\ e = \sum_{i=1}^n(x_i - U_d y_i)^T(x_i - U_d y_i) </math><br />
<br />
<math>\ \frac{\partial e}{\partial y_i} = 2(-U_d)(x_i - U_d y_i) = 0 </math><br />
<br />
since <math>\ U_d(x_i - U_d y_i) </math> is a linear combination of the columns of <math>\ U_d </math>,<br />
<br />
which are independent (orthogonal to each other) we can conclude that:<br />
<br />
<math>\ x_i - U_d y_i = 0 </math> or equivalently,<br />
<br />
<math>\ x_i = U_d y_i </math><br />
<br />
<math>\ y_i = U_d^T x_i </math><br />
<br />
Find the orthogonal matrix <math>\ U_d </math>:<br />
<br />
<math>\ min_{U_d} \sum_{i=1}^n || x_i - U_d U_d^T x_i||^2 </math><br />
<br />
====Using SVD====<br />
<br />
A unique solution can be obtained by finding the [[Singular Value Decomposition(SVD) | Singular Value Decomposition (SVD)]] of <math>\ X </math>:<br />
<br />
<math>\ X = U S V^T </math><br />
<br />
For each rank d, <math>\ U_d </math> consists of the first d columns of <math>\ U </math>. Also, the covariance matrix can be expressed as follows <math>\ S = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)(x_i - \mu)^T </math>.<br />
<br />
Simply put, by subtracting the mean of each of the data point features and then applying SVD, one can find the principal components:<br />
<br />
<math> \tilde{X} = X - \mu </math><br />
<br />
<math>\ \tilde{X} = U S V^T </math><br />
<br />
Where <math>\ X </math> is a d by n matrix of data points and the features of each data point form a column in <math>\ X </math>. Also, <math>\ \mu </math> is a d by n matrix with identical columns each equal to the mean of the <math>\ x_i</math>'s, ie <math>\mu_{:,j}=\frac{1}{n}\sum_{i=1}^n x_i </math>. Note that the arrangement of data points is a convention and indeed in Matlab or conventional statistics, the transpose of the matrices in the above formulae is used.<br />
<br />
As the <math>\ S </math> matrix from the SVD has the eigenvalues arranged from largest to smallest, the corresponding eigenvectors in the <math>\ U </math> matrix from the SVD will be such that the first column of <math>\ U </math> is the first principal component and the second column is the second principal component and so on.<br />
<br />
=== Examples ===<br />
<br />
Note that in the Matlab code in the examples below, the mean was not subtracted from the datapoints before performing SVD. This is what was shown in class. However, to properly perform PCA, the mean should be subtracted from the datapoints.<br />
<br />
==== Example 1 ====<br />
Consider a matrix of data points <math>\ X </math> with the dimensions 560 by 1965. 560 is the number of elements in each column. Each column is a vector representation of a 20x28 grayscale pixel image of a face (see image below) and there is a total of 1965 different images of faces. Each of the images are corrupted by noise, but the noise can be removed by projecting the data back to the original space taking as many dimensions as one likes (e.g, 2, 3 4 0r 5). The corresponding Matlab commands are shown below:<br />
[[File:FreyFaceExample.PNG|thumb|185px|An example of the face images used in [[#Example 1 | Example 1]] with noise removed. Source: <ref>S. Roweis (2011). ''Data for MATLAB.'' [Online]. Available: [http://cs.nyu.edu/~roweis/data.html http://cs.nyu.edu/~roweis/data.html.] |</ref>]]<br />
<pre style="align:left; width: 75%; padding: 2% 2%"><br />
>> % start with a 560 by 1965 matrix X that contains the data points<br />
>> load(noisy.mat);<br />
>> <br />
>> % set the colors to grayscale <br />
>> colormap gray<br />
>> <br />
>> % show image in column 10 by reshaping column 10 into a 20 by 28 matrix<br />
>> imagesc(reshape(X(:,10),20,28)')<br />
>> <br />
>> % perform SVD, if X matrix if full rank, will obtain 560 PCs<br />
>> [S U V] = svd(X);<br />
>> <br />
>> % reconstruct X ( project X onto the original space) using only the first ten principal components<br />
>> Y_pca = U(:, 1:10)'*X;<br />
>> <br />
>> % show image in column 10 of X_hat which is now a 560 by 1965 matrix<br />
>> imagesc(reshape(X_hat(:,10),20,28)')<br />
</pre><br />
The reason why the noise is removed in the reconstructed image is because the noise does not create a major variation in a single direction in the original data. Hence, the first ten PCs taken from <math>\ U </math> matrix are not in the direction of the noise. Thus, reconstructing the image using the first ten PCs, will remove the noise.<br />
<br />
==== Example 2 ====<br />
Consider a matrix of data points <math>\ X </math> with the dimensions 64 by 400. 64 is the number of elements in each column. Each column is a vector representation of a 8x8 grayscale pixel image of either a handwritten number ''2'' or a handwritten number ''3'' (see image below) and there are a total of 400 different images, where the first 200 images show a handwritten number ''2'' and the last 200 images show a handwritten number ''3''. <br />
[[File:Handwritten23.PNG|frame|center|An example of the handwritten number images used in [[#Example 2 | Example 2]]. Source: <ref>A. Ghodsi, "PCA" class notes for STAT841, Department of Statistics and Actuarial Science, University of Waterloo, 2011. </ref>]]<br />
<br />
The corresponding Matlab commands for performing PCA on the data points are shown below:<br />
<pre><br />
>> % start with a 64 by 400 matrix X that contains the data points<br />
>> load 2_3.mat;<br />
>> <br />
>> % set the colors to grayscale <br />
>> colormap gray<br />
>> <br />
>> % show image in column 2 by reshaping column 2 into a 8 by 8 matrix<br />
>> imagesc(reshape(X(:,2),8,8))<br />
>> <br />
>> % perform SVD, if X matrix if full rank, will obtain 64 PCs<br />
>> [U S V] = svd(X);<br />
>> <br />
>> % project data down onto the first two PCs<br />
>> Y = U(:,1:2)'*X;<br />
>> <br />
>> % show Y as an image (can see the change in the first PC at column 200,<br />
>> % when the handwritten number changes from 2 to 3)<br />
>> imagesc(Y)<br />
>> <br />
>> % perform PCA using Matlab build-in function (do not use for assignment)<br />
>> % also note that due to the Matlab convention, the transpose of X is used<br />
>> [COEFF, Y] = princomp(X');<br />
>> <br />
>> % again, use the first two PCs<br />
>> Y = Y(:,1:2);<br />
>> <br />
>> % use plot digits to show the distribution of images on the first two PCs<br />
>> images = reshape(X, 8, 8, 400);<br />
>> plotdigits(images, Y, .1, 1);<br />
</pre><br />
Using the ''plotdigits'' function in Matlab, clearly illustrates that the first PC captured the differences between the numbers ''2'' and ''3'' as they are projected onto different regions of the axis for the first PC. Also, the second PC captured the ''tilt'' of the handwritten numbers as numbers tilted to the left or right were projected onto different regions of the axis for the second PC.<br />
<br />
==== Example 3 ====<br />
(Not discussed in class) In the news recently was a story that captures some of the ideas behind PCA. Over the past two years, Scott Golder and Michael Macy, researchers from Cornell University, collected 509 million Twitter messages from 2.4 million users in 84 different countries. The data they used were words collected at various times of day and they classified the data into two different categories: positive emotion words and negative emotion words. Then, they were able to study this new data to evaluate subjects' moods at different times of day, while the subjects were in different parts of the world. They found that the subjects generally exhibited positive emotions in the mornings and late evenings, and negative emotions mid-day. They were able to "project their data onto a smaller dimensional space" using PCS. Their paper, "Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures," is available in the journal Science.<ref>http://www.pcworld.com/article/240831/twitter_analysis_reveals_global_human_moodiness.html</ref>.<br />
<br />
Assumptions Underlying Principal Component Analysis can be found here<ref>http://support.sas.com/publishing/pubcat/chaps/55129.pdf</ref><br />
<br />
==== Example 4 ====<br />
(Not discussed in class) A somewhat well known learning rule in the field of neural networks called Oja's rule can be used to train networks of neurons to compute the principal component directions of data sets. <ref>A Simplified Neuron Model as a Principal Component Analyzer. Erkki Oja. 1982. Journal of Mathematical Biology. 15: 267-273</ref> This rule is formulated as follows<br />
<br />
<math>\,\Delta w = \eta yx -\eta y^2w </math><br />
<br />
where <math>\,\Delta w </math> is the neuron weight change, <math>\,\eta</math> is the learning rate, <math>\,y</math> is the neuron output given the current input, <math>\,x</math> is the current input and <math>\,w</math> is the current neuron weight. This learning rule shares some similarities with another method for calculating principal components: power iteration. The basic algorithm for power iteration (taken from wikipedia: <ref>Wikipedia. http://en.wikipedia.org/wiki/Principal_component_analysis#Computing_principal_components_iteratively</ref>) is shown below <br />
<br />
<br />
<math>\mathbf{p} =</math> a random vector<br />
do ''c'' times:<br />
<math>\mathbf{t} = 0</math> (a vector of length ''m'')<br />
for each row <math>\mathbf{x} \in \mathbf{X^T}</math><br />
<math>\mathbf{t} = \mathbf{t} + (\mathbf{x} \cdot \mathbf{p})\mathbf{x}</math><br />
<math>\mathbf{p} = \frac{\mathbf{t}}{|\mathbf{t}|}</math><br />
return <math>\mathbf{p}</math><br />
<br />
Comparing this with the neuron learning rule we can see that the term <math>\, \eta y x </math> is very similar to the <math>\,\mathbf{t}</math> update equation in the power iteration method, and identical if the neuron model is assumed to be linear (<math>\,y(x)=x\mathbf{p}</math>) and the learning rate is set to 1. Additionally, the <math>\, -\eta y^2w </math> term performs the normalization, the same function as the <math>\,\mathbf{p}</math> update equation in the power iteration method.<br />
<br />
=== Observations ===<br />
Some observations about the PCA were brought up in class:<br />
<br />
* '''PCA''' assumes that data is on a ''linear subspace'' or close to a linear subspace. For non-linear dimensionality reduction, other techniques are used. Amongst the first proposed techniques for non-linear dimensionality reduction are '''Locally Linear Embedding (LLE)''' and '''Isomap'''. More recent techniques include '''Maximum Variance Unfolding (MVU)''' and '''t-Distributed Stochastic Neighbor Embedding (t-SNE)'''. '''Kernel PCAs''' may also be used, but they depend on the type of kernel used and generally do not work well in practice. (Kernels will be covered in more detail later in the course.)<br />
<br />
* Finding the number of PCs to use is not straightforward. It requires knowledge about the ''instrinsic dimentionality of data''. In practice, oftentimes a heuristic approach is adopted by looking at the eigenvalues ordered from largest to smallest. If there is a "dip" in the magnitude of the eigenvalues, the "dip" is used as a cut off point and only the large eigenvalues before the "dip" are used. Otherwise, it is possible to add up the eigenvalues from largest to smallest until a certain percentage value is reached. This percentage value represents the percentage of variance that is preserved when projecting onto the PCs corresponding to the eigenvalues that have been added together to achieve the percentage. <br />
<br />
* It is a good idea to normalize the variance of the data before applying PCA. This will avoid PCA finding PCs in certain directions due to the scaling of the data, rather than the real variance of the data.<br />
<br />
* PCA can be considered as an unsupervised approach, since the main direction of variation is not known beforehand, i.e. it is not completely certain which dimension the first PC will capture. The PCs found may not correspond to the desired labels for the data set. There are, however, alternate methods for performing supervised dimensionality reduction.<br />
<br />
* (Not in class) Even though the traditional PCA method does not work well on data set that lies on a non-linear manifold. A revised PCA method, called c-PCA, has been introduced to improve the stability and convergence of intrinsic dimension estimation. The approach first finds a minimal cover (a cover of a set X is a collection of sets whose union contains X as a subset<ref>http://en.wikipedia.org/wiki/Cover_(topology)</ref>) of the data set. Since set covering is an NP-hard problem, the approach only finds an approximation of minimal cover to reduce the complexity of the run time. In each subset of the minimal cover, it applies PCA and filters out the noise in the data. Finally the global intrinsic dimension can be determined from the variance results from all the subsets. The algorithm produces robust results.<ref>Mingyu Fan, Nannan Gu, Hong Qiao, Bo Zhang, Intrinsic dimension estimation of data by principal component analysis, 2010. Available: http://arxiv.org/abs/1002.2050</ref><br />
<br />
*(Not in class) While PCA finds the mathematically optimal method (as in minimizing the squared error), it is sensitive to outliers in the data that produce large errors PCA tries to avoid. It therefore is common practice to remove outliers before computing PCA. However, in some contexts, outliers can be difficult to identify. For example in data mining algorithms like correlation clustering, the assignment of points to clusters and outliers is not known beforehand. A recently proposed generalization of PCA based on a '''Weighted PCA''' increases robustness by assigning different weights to data objects based on their estimated relevancy.<ref>http://en.wikipedia.org/wiki/Principal_component_analysis</ref><br />
<br />
* (Not in class) Comparison between PCA and LDA: Principal Component Analysis (PCA)and Linear Discriminant Analysis (LDA) are two commonly used techniques for data classification and dimensionality reduction. Linear Discriminant Analysis easily handles the case where the within-class frequencies are unequal and their performances has been examined on randomly generated test data. This method maximizes the ratio of between-class variance to the within-class variance in any particular data set thereby guaranteeing maximal separability. ... The prime difference between LDA and PCA is that PCA does more of feature classification and LDA does data classification. In PCA, the shape and location of the original data sets changes when transformed to a different space whereas LDA doesn’t change the location but only tries to provide more class separability and draw a decision region between the given classes. This method also helps to better understand the distribution of the feature data." [24] [[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf]]<br />
<br />
=== Summary ===<br />
The PCA algorithm can be summarized into the following steps:<br />
<br />
# '''Recover basis'''<br />
#: <math>\ \text{ Calculate } XX^T=\Sigma_{i=1}^{t}x_ix_{i}^{T} \text{ and let } U=\text{ eigenvectors of } XX^T \text{ corresponding to the largest } d \text{ eigenvalues.} </math><br />
# '''Encode training data'''<br />
#: <math>\ \text{Let } Y=U^TX \text{, where } Y \text{ is a } d \times t \text{ matrix of encodings of the original data.} </math><br />
# '''Reconstruct training data'''<br />
#: <math> \hat{X}=UY=UU^TX </math>.<br />
# '''Encode test example'''<br />
#: <math>\ y = U^Tx \text{ where } y \text{ is a } d\text{-dimensional encoding of } x </math>.<br />
# '''Reconstruct test example'''<br />
#: <math> \hat{x}=Uy=UU^Tx </math>.<br />
<br />
== Fisher Discriminant Analysis (FDA) (Lecture: Sep. 29, 2011) ==<br />
<br />
'''Fisher Discriminant Analysis (FDA)''' is sometimes called ''Fisher Linear Discriminant Analysis (FLDA)'' or just ''Linear Discriminant Analysis (LDA)''. This causes confusion with the [[#LDA | ''Linear Discriminant Analysis (LDA)'']] technique covered earlier in the course. The LDA technique covered earlier in the course has a normality assumption and is a boundary finding technique. The FDA technique outlined here is a supervised feature extraction technique. FDA differs from PCA as well because PCA does not use the class labels, <math>\ y_i</math>, of the data <math>\ (x_i,y_i)</math> while FDA organizes data into their ''classes'' by finding the direction of maximum separation between classes.<br />
<br />
== Fisher Discriminant Analysis (FDA) Continued (Lecture: Oct. 04, 2011) ==<br />
<br />
One main drawback of the PCA technique is that the direction of greatest variation may not be the classification we desire. For example, imagine if the [[#Example 2 | data set]] above had a lightening filter applied to a random subset of the images. Then the greatest variation would be the brightness and not the more important variations we wish to classify. FDA circumvents this problem by using the labels, <math>\ y_i</math>, of the data <math>\ (x_i,y_i)</math> i.e. the FDA uses ''supervised learning''. An elementary way to see the algorithm is to imagine two classes of data projected onto a suitably chosen line that minimizes the within class variance, and maximizes the distance between the two classes i.e. group similar data together and spread different data apart. This way, new data acquired can be compared, after a transformation, to where these projections, using some well-chosen metric.<br />
<br />
<br />
We first consider the cases of two-classes. Denote the mean and covariance matrix of class <math>i=0,1</math> by <math>\mathbf{\mu}_i</math> and <math>\mathbf{\Sigma}_i</math> respectively. We transform the data so that it is projected into 1 dimension i.e. a scalar value. To do this, we compute the inner product of our <math>dx1</math>-dimensional data, <math>\mathbf{x}</math>, by a to-be-determined <math>dx1</math>-dimensional vector <math>\mathbf{w}</math>. The new means and covariances of the transformed data:<br />
<br />
::<math> \mu'_i:\rightarrow \mathbf{w}^{T}\mathbf{\mu}_i </math> <br/><br />
::<math> \Sigma'_i :\rightarrow \mathbf{w}^{T}\mathbf{\sigma}_i \mathbf{w}</math><br />
<br />
The new means and variances are actually scalar values now, but we will use vector and matrix notation and arguments throughout the following derivation as the multi-class case is then just a simpler extension. <br />
<br />
===Goals of FDA===<br />
<br />
As will be shown in the objective function, the goal of FDA is to maximize the separation of the classes (between class variance) and minimize the scatter within each class (within class variance). That is, our ideal situation is that the individual classes are as far away from each other as possible and at the same time the data within each class are as close to each other as possible (collapsed to a single point in the most extreme case). An interesting note is that R. A. Fisher who FDA is named after, used the FDA technique for purposes of taxonomy, in particular for categorizing different species of iris flowers. <ref name="RAFisher">R. A. Fisher, "The Use of Multiple measurements in Taxonomic Problems," ''Annals of Eugenics'', 1936</ref>. It is very easy to visualize what is meant by within class variance (i.e. differences between the iris flowers of the same species) and between class variance (i.e. the differences between the iris flowers of different species) in that case.<br />
<br />
<br />
'''1)''' Our '''first''' goal is to minimize the individual classes' covariance. This will help to collapse the data together. <br />
We have two minimization problems<br />
<br />
::<math>\min_{\mathbf{w}} \mathbf{w} \mathbf{\Sigma}_0 \mathbf{w}^{T}</math> <br />
and <br />
::<math>\min_{\mathbf{w}} \mathbf{w} \mathbf{\Sigma}_1 \mathbf{w}^{T}</math>.<br />
<br />
But these can be combined:<br />
::<math> \min_{\mathbf{w}} \mathbf{w} \mathbf{\Sigma}_0 \mathbf{w}^{T} + \mathbf{w} \mathbf{\Sigma}_1 \mathbf{w}^{T}</math> <br />
:: <math> = \min_{\mathbf{w}} \mathbf{w} ( \mathbf{\Sigma_0} + \mathbf{\Sigma_1} ) \mathbf{w}^{T} </math><br />
<br />
Define <math> \mathbf{S}_W =\mathbf{\Sigma_0} + \mathbf{\Sigma_1} </math>, called the ''within class variance matrix''. <br />
<br />
'''2)''' Our '''second''' goal is to move the minimized classes as far away from each other as possible. One way to accomplish this is to maximize the distances between the means of the transformed data i.e.<br />
<br />
<math> \max_{\mathbf{w}} |\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1|^2 </math><br />
<br />
Simplifying:<br />
::<math> \max_{\mathbf{w}} \,(\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1)^T (\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{mu}_1) </math> <br/><br />
::<math> = \max_{\mathbf{w}}\, (\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}\mathbf{w}^{T} \mathbf{w} (\mathbf{\mu}_0-\mathbf{\mu}_1)</math> <br/><br />
::<math> = \max_{\mathbf{w}} \,\mathbf{w}^{T}(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}\mathbf{w}</math><br />
<br />
Recall that <math> \mathbf{\mu}_i </math> are known. Denote<br />
<br />
::<math> \mathbf{S}_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}</math> <br />
<br />
This matrix, called the ''between class variance matrix'', is a rank 1 matrix, so an inverse does not exist. Altogether, we have two optimization problems we must solve simultaneously:<br />
<br />
::1) <math> \min_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} </math><br/><br />
::2) <math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} </math><br />
<br />
There are other metrics one can use to both minimize the data's variance and maximizes the distance between classes, and other goals we can try to accomplish (see metric learning, below...one day), but Fisher used this elegant method, hence his recognition in the name, and we will follow his method.<br />
<br />
We can combine the two optimization problems into one after noting that the negative of max is min:<br />
<br />
::<math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} - \alpha \mathbf{w} \mathbf{S_B} \mathbf{w}^{T}</math><br/><br />
<br />
The <math>\alpha</math> coefficient is a necessary scaling factor: if the scale of one of the terms is much larger than the other, the optimization problem will be dominated by the larger term. This means we have another unknown, <math>\alpha</math>, to solve for. Instead, we can circumvent the scaling problem by looking at the ratio of the quantities, the original solution Fisher proposed:<br />
<br />
::<math> \max_{\mathbf{w}} \frac{\mathbf{w} \mathbf{S_B} \mathbf{w}^{T}}{\mathbf{w} \mathbf{S_W} \mathbf{w}^{T}} </math><br />
<br />
This optimization problem can be shown<ref><br />
http://www.socher.org/uploads/Main/optimizationTutorial01.pdf<br />
</ref> to be equivalent to the following optimization problem:<br />
<br />
:: <math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T}</math> <br />
<br />
subject to:<br />
<br />
:: <math> \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} = 1 </math><br />
<br />
A heuristic understanding of this equivalence is that we have two degrees of freedom: direction and scalar. The scalar value is irrelevant to our discussion. Thus, we can set one of the values to be a constant. We can use Lagrange multipliers to solve this optimization problem:<br />
<br />
::<math>L( \mathbf{w}, \lambda) = \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} - \lambda(\mathbf{w} \mathbf{S_W} \mathbf{w}^{T}-1)</math><br />
:: <math> \Rightarrow \frac{\partial L}{\partial \mathbf{w}} = 2 \mathbf{S}_B \mathbf{w} - 2\lambda \mathbf{S}_W\mathbf{w} </math><br />
<br />
Setting the partial derivative to 0 gives us a ''generalized eigenvalue problem'':<br />
<br />
::<math> \mathbf{S}_B \mathbf{w} = \lambda \mathbf{S}_W \mathbf{w} </math><br />
:: <math> \Rightarrow \mathbf{S}_W^{-1} \mathbf{S}_B \mathbf{w} = \lambda \mathbf{w} </math><br />
<br />
This is a generalized eigenvalue problem and <math>\ W </math> can be computed as the eigenvector corresponds to the largest eigenvalue of <br />
:: <math> \mathbf{S}_W^{-1} \mathbf{S}_B </math><br />
<br />
It is very likely that <math> \mathbf{S}_W </math> has an inverse. If not, the pseudo-inverse<ref><br />
http://en.wikipedia.org/wiki/Generalized_inverse<br />
</ref><ref><br />
http://www.mathworks.com/help/techdoc/ref/pinv.html<br />
</ref> can be used. In Matlab the pseudo-inverse function is named ''pinv''. Thus, we should choose <math>\mathbf{w}</math> to equal the eigenvector of the largest eigenvalue as our projection vector. <br />
<br />
In fact we can simplify the above expression further in the of two classes. Recall the definition of <math>\mathbf{S}_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}</math>. Substituting this into our expression:<br />
<br />
::<math> \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T} \mathbf{w} = \lambda \mathbf{w} </math><br />
::<math> (\mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) ) ((\mathbf{\mu}_0-\mathbf{\mu}_1)^{T} \mathbf{w}) = \lambda \mathbf{w} </math><br />
<br />
This second term is a scalar value, let's denote it <math>\beta</math>. Then<br />
::<math> \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) = \frac{\lambda}{\beta} \mathbf{w} </math><br />
::<math> \Rightarrow \, \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) \propto \mathbf{w} </math><br />
<br />
All we are interested in the direction of <math>\mathbf{w}</math>, so to compute this is sufficient to finding our projection vector. Though this will not work in higher dimensions, as <math>\mathbf{w}</math> would be a matrix and not a vector in higher dimensions.<br />
<br />
=== Extensions to Multiclass Case ===<br />
If we have <math>\ k</math> classes, we need <math>\ k-1</math> directions i.e. we need to project <math>\ k</math> 'points' onto a <math>\ k-1</math> dimensional hyperplane. What does this change in our above derivation? The most significant difference is that our projection vector,<math>\mathbf{w}</math>, is no longer a vector but instead is a matrix <math>\mathbf{W}</math>. We transform the data as:<br />
<br />
::<math> \mathbf{x}' :\rightarrow \mathbf{W}^{T} \mathbf{x}</math><br />
so our new mean and covariances for class k are:<br />
::<math> \mathbf{\mu_k}' :\rightarrow \mathbf{W}^{T} \mathbf{\mu_k}</math><br />
::<math> \mathbf{\Sigma_k}' :\rightarrow \mathbf{W}^{T} \mathbf{\Sigma_k} \mathbf{W}</math><br />
<br />
What are our new optimization sub-problems? As before, we wish to minimize the within class variance. This can be formulated as:<br />
::<math>\min_{\mathbf{W}} \mathbf{W}^{T} \mathbf{\Sigma_1} \mathbf{W} + \dots + \mathbf{W}^{T} \mathbf{\Sigma_k} \mathbf{W} </math><br />
<br />
Again, denoting <math>\mathbf{S}_W = \mathbf{\Sigma_1} + \dots + \mathbf{\Sigma_k}</math>, we can simplify above expression:<br />
<br />
::<math>\min_{\mathbf{W}} \mathbf{W}^{T} \mathbf{S}_W \mathbf{W} </math><br />
<br />
Similarly, the second optimization problem is:<br />
<br />
::<math>\max_{\mathbf{W}} \mathbf{W}^{T} \mathbf{S}_B \mathbf{W} </math><br />
<br />
What is <math>\mathbf{S}_B</math> in this case? It can be shown that <math>\mathbf{S}_T = \mathbf{S}_B + \mathbf{S}_W </math> where <math> \mathbf{S}_T </math> is the covariance matrix of all the data. From this we can compute <math> \mathbf{S}_B </math>. <br />
<br />
Next, if we express <math> \mathbf{W} = ( \mathbf{w}_1 , \mathbf{w}_2 , \dots ,\mathbf{w}_k ) </math> observe that, for <math> \mathbf{A} = \mathbf{S}_B , \mathbf{S}_W </math>: <br />
<br />
::<math> Tr(\mathbf{W}^{T} \mathbf{A} \mathbf{W}) = \mathbf{w}_1^{T} \mathbf{A} \mathbf{w}_1^{T} + \dots + \mathbf{w}_k \mathbf{A} \mathbf{w}_k </math><br />
<br />
where <math>\ Tr()</math> is the trace of a matrix. Thus, following the same steps as in the two-class case, we have the new optimization problem:<br />
<br />
::<math> \max_{\mathbf{W}} \frac{ Tr(\mathbf{W}^{T} \mathbf{S}_B \mathbf{W}) }{Tr(\mathbf{W}^{T} \mathbf{S}_W \mathbf{W})} </math> <br />
<br />
subject to:<br />
<br />
:: <math> Tr( \mathbf{W} \mathbf{S_W} \mathbf{W}^{T}) = \mathbf{I} </math><br />
<br />
Again, in order to solve the above optimization problem, we can use the Lagrange multiplier <ref><br />
http://en.wikipedia.org/wiki/Lagrange_multiplier </ref>:<br />
<br />
:: <math>\begin{align}L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - I \right\}\end{align}</math>.<br />
<br />
where <math>\ \Lambda</math> is a d by d diagonal matrix.<br />
<br />
Then, we differentiating with respect to <math>\mathbf{W}</math>:<br />
<br />
:: <math>\begin{align}\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}\end{align} = 0</math>.<br />
<br />
Thus:<br />
<br />
:: <math>\begin{align}\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}\end{align}</math><br />
<br />
:: <math>\begin{align}\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{W}\end{align}</math><br />
<br />
where, <math> \mathbf{\Lambda} =\begin{pmatrix}\lambda_{1} & & 0\\&\ddots&\\0 & &\lambda_{d}\end{pmatrix}</math><br />
<br />
The above equation is of the form of an eigenvalue problem. Thus, for the solution the k-1 eigenvectors corresponding to the k-1 largest eigenvalues should be chosen as the projection matrix, <math>\mathbf{W}</math>. In fact, there should only by k-1 eigenvectors corresponding to k-1 non-zero eigenvalues using the above equation.<br />
<br />
=== Summary ===<br />
FDA has two optimization problems:<br />
::1) <math> \min_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} </math><br/><br />
::2) <math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} </math> <br />
<br />
where <math>\ S_W = \Sigma_0 + \Sigma_1</math> is called the within class variance and <math>\ S_B = (\mu_0 - \mu_1)(\mu_0 - \mu_1)^T </math> is called the between class variance.<br />
<br />
The two optimization problems are combined as follows:<br />
::<math> \max_{\mathbf{w}} \frac{\mathbf{w} \mathbf{S_B} \mathbf{w}^{T}}{\mathbf{w} \mathbf{S_W} \mathbf{w}^{T}} </math><br />
<br />
By adding a constraint as shown:<br />
::<math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T}</math><br />
<br />
subject to:<br />
:: <math> \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} = 1 </math><br />
<br />
Lagrange multipliers can be used and essentially the problem becomes an eigenvalue problem:<br />
<br />
::<math>\begin{align}\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w} = \lambda\mathbf{w}\end{align}</math><br />
<br />
And <math>\ w </math> can be computed as the k-1 eigenvectors corresponding to the largest k-1 eigenvalues of <math> \mathbf{S}_W^{-1} \mathbf{S}_B </math>.<br />
<br />
=== Variations ===<br />
<br />
Some adaptations and extensions exist for the FDA technique (Source: <ref>R. Gutierrez-Osuna, "Linear Discriminant Analysis" class notes for Intro to Pattern Analysis, Texas A&M University. Available: [http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf]</ref>):<br />
<br />
1) ''Non-Parametric LDA (NPLDA)'' by Fukunaga<br />
<br />
This method does not assume that the Gaussian distribution is unimodal and it is actually possible to extract more than k-1 features (where k is the number of classes).<br />
<br />
2) ''Orthonormal LDA (OLDA)'' by Okada and Tomita<br />
<br />
This method finds projections that are orthonormal in addition to maximizing the FDA objective function. This method can also extract more than k-1 features (where k is the number of classes).<br />
<br />
3) ''Generalized LDA (GLDA)'' by Lowe<br />
<br />
This method incorporates additional cost functions into the FDA objective function. This causes classes with a higher cost to be placed further apart in the lower dimensional representation.<br />
<br />
== Linear and Logistic Regression (Lecture: Oct. 06, 2011) ==<br />
<br />
=== Linear Regression ===<br />
<br />
In regression, <math>\ y </math> is a continuous variable. In classification, <math>\ y </math> is a discrete variable. Regression problems are easier to formulate into functions (since <math>\ y </math> is continuous) and it is possible to solve classification problems by treating them like regression problems. In order to do so, the requirement in classification that <math>\ y </math> is discrete must first be relaxed. Once <math>\ y </math> has been found using regression techniques, it is possible to determine the discrete class corresponding to the <math>\ y </math> that has been found to solve the original classification problem. The discrete class is obtained by defining a threshold where <math>\ y </math> values below the threshold belong to one class and <math>\ y </math> values above the threshold belong to another class.<br />
<br />
<br />
More formally: a more direct approach to classification is to estimate the regression function <math>\ r(\mathbf{x}) = E[Y | X]</math> without bothering to estimate <math>\ f_k(\mathbf{x}) </math>.<br />
<br />
In two-class problems, if <math>\ Y = \{0,1\}</math>, then <math>\, h^*(\mathbf{x})= \left\{\begin{matrix}<br />
1 &\text{,if } \hat r(\mathbf{x})>\frac{1}{2} \\<br />
0 &\mathrm{,otherwise.} \end{matrix}\right.</math><br />
<br />
Basically, we can use a linear function<br />
<math>\ f(x, \beta) = \mathbf{\beta\,}^T \mathbf{x_{i}} + \mathbf{\beta\,_0} </math> and use the least squares approach to fit the function to the given data. This is done by minimizing the following expression:<br />
<br />
<math>\min_{\mathbf{\beta}} \sum_{i=1}^n (y_i - \mathbf{\beta}^T<br />
\mathbf{x_{i}} - \mathbf{\beta_0})^2</math><br />
<br />
where<br />
<br />
<math>\tilde{\mathbf{\beta}} = \left( \begin{array}{c}\mathbf{\beta_{1}}<br />
<br />
\\ \\<br />
\dots \\ \\<br />
\mathbf{\beta}_{d} \\ \\<br />
\mathbf{\beta}_{0} \end{array} \right)</math>.<br />
<br />
For convenience, <math>\mathbf{\beta}</math> and <math>\mathbf{\beta}_0</math> have been combined into a d+1 dimensional vector. And an extra term 1 is appended to to <math>\ x </math>. Thus, the function to be minimized can now be expressed as:<br />
<br />
<math>\ min_{\tilde{\beta}} \sum_{i=1}^{n} (y_i - \tilde{\beta} \tilde{x_i} )^2 </math><br />
<br />
<math>\ = min_{\tilde{\beta}} | y - X \tilde{\beta}^T |^2 </math><br />
<br />
where <math>\ y </math> and <math>\tilde{\beta}</math> are vectors and <math>\ X </math> is a matrix.<br />
<br />
The solution for <math>\ \tilde{\beta} </math> is<br />
<br />
<math>\ {\tilde{\beta}} = (XX^T)^{-1}Xy </math><br />
<br />
Using regression to solve classification problems is not mathematically correct, if we want to be true to classification. However, this method works well in practice, if the problem is not complicated. When we have only two classes (encoded as <math>\ \frac{-n}{n_1} </math> and <math>\ \frac{n}{n_2}) </math>, this method is identical to LDA.<br />
<br />
==== Matlab Example ====<br />
<br />
The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
===Logistic Regression===<br />
<br />
Logistic regression is a more advanced method for classification, and is<br />
more commonly used. <br />
<br />
We can define a function <br /><br />
<math>f_1(x)= P(Y=1| X=x) = (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})</math><br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
<br />
<br />
This is a valid density function. It looks similar to a step function, but<br />
we have relaxed it so that we have a smooth curve, and can therefore take the<br />
derivative.<br />
<br />
The range of this function is (0,1) since<br /> <br />
<math>\lim_{x \to -\infty}f_1(\mathbf{x}) = 0</math> and<br />
<math>\lim_{x \to \infty}f_1(\mathbf{x}) = 1</math>.<br />
<br />
As shown on this graph:<br /><br />
http://www.wolframalpha.com/input/?i=Plot[E^x/%281+%2B+E^x%29,+{x,+-10,+10}]%29<br />
<br />
Then we compute the complement of f1(x), and get<br /><br />
<br />
<math>f_2(x)= P(Y=0| X=x) = 1-f_1(x) = (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})</math>, denoted f2. <br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
<br />
<br />
Function f2 is commonlly called Logistic function, and it behaves like <br /><br />
<math>\lim_{x \to -\infty}f_2(\mathbf{x}) = 1</math> and<br /><br />
<math>\lim_{x \to \infty}f_2(\mathbf{x}) = 0</math>.<br />
<br />
As shown on this graph:<br /><br />
http://www.wolframalpha.com/input/?i=Plot[1/%281+%2B+E^x%29,+{x,+-10,+10}]%29<br />
<br />
From here, we can form the conditional density function. To do this, we must combine<br /><br />
<math>f_1</math> and <math>f_2</math> <br /><br />
such that <br /><br />
<math>f_1=1</math> and<math>f_2=0</math> if y=1 ( which means it’s in class 1), <br /><br />
and <math>f_1=0</math> and <math>f_2=1</math> if y=2 (which means it’s in class 2).<br />
<br />
Eventually, we have our conditional density function formula<br /><br />
<math>f(y|\mathbf{x})= (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y}</math><br />
<br />
To way to use this formula is, with given the training data (x(i), y(i)),to fit the data with <math>f(Y : X)</math>.<br />
<br />
In general, we can think of the problem as having a box with some knobs. Inside the box is our objective function which gives the form to classify our input (xi) to<br />
our output (yi). The knobs in the box are functioning like the parameters of the objective function. Our job is to find the proper parameters that can minimize the error between our output and the true value. So we have turned our machine learning problem intoan optimization problem. <br />
<br />
Since we need to find the parameters that maximize the chance of having our observed data coming from the distribution of f(x|parameter), we need to introduce Maximum Likelihood Estimation.<br />
<br />
====Maximum Likelihood Estimation====<br />
<br />
Given iid data points <math>({\mathbf{x}_i})_{i=1}^n</math> and density function <math>f(\mathbf{x}|\mathbf{\theta})</math>, where the form of f is known but the parameters <math>\theta</math> are our unknown. The maximum likelihood estimation of <math>\theta\,_{ML}</math> is a set of parameters that maximize the probability of observing <math>({\mathbf{x}_i})_{i=1}^n</math> given <math>\theta\,_{ML}</math>.<br />
<br />
<math>\theta_\mathrm{ML} = \underset{\theta}{\operatorname{arg\,max}}\ f(\mathbf{x}|\theta)</math>.<br />
<br />
There was some discussion in class regarding the notation. In literature, Bayesians use <math>f(\mathbf{x}|\mu)</math> while Frequentists use <math>f(\mathbf{x};\mu)</math>. In practice, these two are equivalent.<br />
<br />
Our goal is to find theta to maximize <br />
<math>\mathcal{L}(\theta\,) = f({\mathbf{x}_i})_{i=1}^n|\;\theta) = \prod_{i=1}^n f(\mathbf{x_i}|\theta)</math>. (The second equality holds because data points are iid.)<br />
<br />
In many cases, it’s more convenient to work with the natural logarithm of the likelihood. (Recall that the logarithm preserves minumums and maximums.)<br />
<math>\ell(\theta|x\mathbf)=\ln\mathcal{L}(\theta\,)</math> <br />
<br />
<math>\ell(\theta\,)=\sum_{i=1}^n \ln f(\mathbf{x_i}|\theta)</math><br />
<br />
Applying Maximum Likelihood Estimation to <math>f(y|\mathbf{x})= (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y}</math>, gives<br />
<br />
<math>\mathcal{L}(\mathbf{\beta\,})=\prod_{i=1}^n (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y_i} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y_i}</math><br />
<br />
<math>\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) + (1-y_i) (\ln{1} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}))\right) \\[10pt]&{} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) - (1-y_i) \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \\[10pt] &{} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}) + y_i \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \\[10pt] &{} = \sum_{i=1}^n \left(y_i {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}</math><br />
<br />
<math>\begin{align} {\frac{\partial \ell}{\partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - \frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}} \mathbf{x_i} \right) \\[8pt] & {}= \sum_{i=1}^n \left(y_i \mathbf{x_i} - P(\mathbf{x_i} | \mathbf{\beta\,}) \mathbf{x_i}\right) \end{align}</math><br />
<br />
<math>\frac{\partial \ell}{\partial \mathbf{\beta\,}}</math> can be numerically solved by Newton’s Method.<br />
<br />
====Newton's Method====<br />
<br />
Newton's Method (or Newton-Ralphson method) is a numerical method to find better approximations to the solutions of real-valued function. The function usually does not have an analytical form. <br />
<br />
The goal is to find <math>\mathbf{x}</math> such that <math><br />
f(\mathbf{x})<br />
= 0 </math>. The recursion can be implemented by<br />
<math>\mathbf{x_1} = \mathbf{x_0} - \frac{f(\mathbf{x_0})}{f'(\mathbf{x_0})}.\,\!<br />
</math>.<br />
<br />
It takes an initial guess <math>\mathbf{x_0}</math> and the direction "<math>\mathbf{f(x_{0}) / f' (x_{0})}</math>" that moves toward a better approximation. It then finds a newer and better <math>\mathbf{x_1}</math>. Taking this <math>\mathbf{x_1}</math> as <math>\mathbf{x_0}</math> in the second run, it finds a newer and better <math>\mathbf{x_1}</math> than the previous <math>\mathbf{x_1}</math>. Repeating the same process, the <math>\mathbf{x_1}</math> will be sufficiently accurate to the actual solutions.<br />
<br />
<br />
<br />
===Advantages of Logistic Regression===<br />
<br />
Logistic regression has several advantages over discriminant analysis: <br />
<br />
* it is more robust: the independent variables don't have to be normally distributed, or have equal variance in each group <br />
* It does not assume a linear relationship between the IV and DV <br />
* It may handle nonlinear effects <br />
* You can add explicit interaction and power terms <br />
* The DV need not be normally distributed. <br />
* There is no homogeneity of variance assumption. <br />
* Normally distributed error terms are not assumed. <br />
* It does not require that the independents be interval. <br />
* It does not require that the independents be unbounded.<br />
<br />
==Newton-Raphson Method (Lecture: Oct 11, 2011)==<br />
Previously we had derivated the log likelihood function for the logistic function. <br />
<br />
<math>\begin{align} L(\beta\,) = \prod_{i=1}^n \left( (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y_i}(\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y_i} \right) \end{align}</math><br />
<br />
After taking log, we can have<br />
<br />
<math>\begin{align} \ell(\beta\,) = \sum_{i=1}^n \left( y_i \log{\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}}} + (1 - y_i) \log{\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}}} \right) \end{align}</math><br />
<br />
which implies that<br />
<br />
<math>\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}</math><br />
<br />
Our goal is to find the <math>\beta\,</math> that maximizes <math>{\ell(\mathbf{\beta\,})}</math>. We use calculus to do this ie solve <math>{\frac{\partial \ell}{\partial \mathbf{\beta\,}}}=0</math>. To do this we use the famous numerical method of Newton-Raphson. This is an iterative method were we calculate the first & second derivative at each iteration.<br />
<br />
The first derivative is typically called the score vector.<br />
<br />
<math>\begin{align} S(\beta\,) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - \frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}} \mathbf{x_i} \right) \\[8pt] \end{align}</math><br />
<br />
The negative of the second derivative is typically called the information matrix.<br />
<br />
<math>\begin{align} I(\beta\,) {}= -{\frac{\partial \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \sum_{i=1}^n \left(\mathbf{x_i}\mathbf{x_i}^T (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})(\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}}) \right) \\[8pt] \end{align}</math><br />
<br />
We then use the following update formula to calcalute continually better estimates of the optimal <math>\beta\,</math>. It is not typically important what you use as your initial estimate <math>\beta\,^{(1)}</math> is.<br />
<br />
<math> \beta\,^{(r+1)} {}= \beta\,^{(r)} + I^{-1}(\beta\,^{(r)} )S(\beta\,^{(r)} )</math><br />
<br />
====Matrix Notation====<br />
<br />
let <math>\mathbf{y}</math> be a (n x 1) vector of all class labels. This is called the response in other contexts.<br />
<br />
let <math>\mathbb{X}</math> be a (n x (d+1)) matrix of all your features. Each row represents a data point. Each column represents a feature/covariate.<br />
<br />
let <math>\mathbf{p}^{(r)}</math> be a (n x 1) vector with values <math> P(\mathbf{x_i} |\beta\,^{(r)} ) </math><br />
<br />
let <math>\mathbb{W}^{(r)}</math> be a (n x n) diagonal matrix with <math>\mathbb{W}_{ii}^{(r)} {}= P(\mathbf{x_i} |\beta\,^{(r)} )(1 - P(\mathbf{x_i} |\beta\,^{(r)} ))</math><br />
<br />
we can rewrite our score vector, information matrix & update equation in terms of this new matrix notation, so the first derivitive is<br />
<br />
<math>\begin{align} S(\beta\,^{(r)}) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)})\end{align}</math><br />
<br />
and the second derivitive is<br />
<br />
<math>\begin{align} I(\beta\,^{(r)}) {}= -{\frac{\partial \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X} \end{align}</math><br />
<br />
Therfore, we can fit a regression problem as follows<br />
<br />
<math> \beta\,^{(r+1)} {}= \beta\,^{(r)} + I^{-1}(\beta\,^{(r)} )S(\beta\,^{(r)} ) {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}\mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)})</math><br />
<br />
====Iteratively Re-weighted Least Squares====<br />
If we reorganize this updating formula we can see it is really a iteratively solving a least squares problem each time with a new weighting.<br />
<br />
<math>\beta\,^{(r+1)} {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}(\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X}\beta\,^{(r)} + \mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)}))</math><br />
<br />
<math>\beta\,^{(r+1)} {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}\mathbb{X}^T\mathbb{W}^{(r)}\mathbf(z)^{(r)}</math><br />
<br />
where <math> \mathbf{z}^{(r)} = \mathbb{X}\beta\,^{(r)} + (\mathbb{W}^{(r)})^{-1}(\mathbf{y}-\mathbf{p}^{(r)}) </math><br />
<br />
<br />
<br />
Recall that linear regression by least squares finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-X^T \underline{\beta})^T(\underline{y}-X^T \underline{\beta})</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{(r+1)} \leftarrow arg \min_{\underline{\beta}}(Z-X \underline{\beta})^T W (Z-X \underline{\beta})</math><br />
<br />
====Fisher Scoring Method==== <br />
<br />
Fisher Scoring is a method very similiar to Newton-Raphson. It uses the expected Information Matrix as opposed to the observed information matrix. This distinction simplifies the problem and in perticular the computational complexity. To learn more about this method & logistic regression in general you can take Stat431/831 at the University of Waterloo.<br />
<br />
===Multi-class Logistic Regression===<br />
<br />
In a multi-class logistic regression we have K classes. For 2 classes ''k'' and ''l''<br />
<br />
<math>\frac{P(Y=l|X=x)}{P(Y=k|X=x)} = e^{\beta_l^T x}</math><br />
<br />
We call <math>log(\frac{P(Y=l|X=x)}{P(Y=k|X=x)}) = \beta_l^T x</math> as the logit transformation. The decision boundary between the 2 classes is the set of points where the logit transformation is 0.<br />
<br />
For each class from 1 to K-1 we then have:<br />
<br />
<math>log(\frac{P(Y=1|X=x)}{P(Y=K|X=x)}) = e^{\beta_1^T x}</math><br />
<br />
<math>log(\frac{P(Y=2|X=x)}{P(Y=K|X=x)}) = e^{\beta_2^T x}</math><br />
<br />
<math>log(\frac{P(Y=K-1|X=x)}{P(Y=K|X=x)}) = e^{\beta_{K-1}^T x}</math><br />
<br />
Note that choosing ''Y=K'' is arbitrary and any other choice is equally valid.<br />
<br />
Based on the above the posterior probabilities are given by: <math>P(Y=k|X=x) = \frac{e^{\beta_k^T x}}{1 + \sum_{i=1}^{K-1}{e^{\beta_i^T x}}}</math><br />
<br />
===Sample Size Requirements===<br />
<br />
The number of adjustable components in linear discriminant analysis is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math> where d is the number of parameters. Similarly, the number of adjustable components in logistic regression is <math>\, d+1</math>. The number of components also corresponds to the minimum number of observations needed to compute the coefficients of each function. Techniques do exist though for handling high dimensional problems where the number of parameters exceeds the number of observations.<br />
<br />
Linear discriminant analysis involves the inversion of a d x d covariance matrix. When d is bigger than n, the number of observations, this matrix is large and is rank n < d and thus singular. When this is the case, we can either use the pseudo inverse or perform regularized discriminant analysis which solves this problem. In RDA, we define a new covariance matrix <math>\, \Sigma(\gamma) = \gamma\Sigma + (1 - \gamma)diag(\Sigma)</math> with <math>\gamma \in \[0,1\]</math>. Cross validation can be used to calculate the best <math>\, \gamma</math>.<br />
<br />
===Comparison Between Logistic Regression And Linear Discriminant Analysis (LDA)===<br />
<br />
Logistic Regression Model and Linear Discriminant Analysis are widely used to analyze data which has categorical outcome variables. Both of the models are to build linear boundaries to classify different groups. Also, the categorical outcome variables (i.e. the dependent variables) must be mutually exclusive. <br />
<br />
However, these two models differ in their basic idea. While Logistic Regression is more relaxed and flexible in its assumptions, Linear Discriminant Analysis has the requirement that its explanatory variables must be normally distributed, linearly related and have equal covariance matrix within each class. Therefore, it can be expected that linear Discriminant Analysis should be more appropriate if the normality assumptions and equal covariance assumption are fulfilled in its explanatory variables. But in all other situations Logistic Regression should be appropriate. Besides, the total number of estimates to compute between these models is different. If the explanatory variables have d dimensions, we need to estimate <math>d+1</math> parameters in Logistic Regression and the number of parameters grows linearly w.r.t. dimension, while we need to estimate <math>2d+\frac{d*(d+1)}{2}+2</math> parameters in Linear Discriminant Analysis and the number of parameters grows quadratically w.r.t. dimension. <br />
<br />
== Perceptron (Lecture: Oct. 11, 2011) ==<br />
<br />
[[Image:Perceptron1.png|right|thumb|300px|Simple perceptron]]<br />
[[Image:Perceptron2.png|right|thumb|300px|Simple perceptron where <math>\beta_0</math> is defined as 1]]<br />
<br />
<br />
The perceptron is the building block for neural networks. It was invented by Rosenblatt in 1957 at Cornell Labs, and first mentioned in the paper "The Perceptron - a perceiving and recognizing automaton". The perceptron is used on linearly separable data sets.<br />
<br />
For a 2 class problem, and a set of inputs with ''d'' features, a perceptron will use a weighted sum and it will classify the information using the sign of the result. The figures on the right give an example of a perceptron. In these examples, <math>x^i</math> is the ''i''-th feature of a sample and <math>\beta_i</math> is the ''i''-th weight. <math>\beta_0</math> is defined as the bias. The bias alters the position of the decision boundary between the 2 classes.<br />
<br />
Perceptrons are generally trained using [http://en.wikipedia.org/wiki/Gradient_descent gradient descent]. This type of learning can have 2 side effects:<br />
* If the data sets are well separated, the training of the perceptron can lead to multiple valid solutions,<br />
* If the data sets are not linearly separable, the learning algorithm will never finish.<br />
<br />
Perceptrons are the simplest kind of a feedforward neural network. A perceptron is the building block for other neural networks such as:<br />
* Multi-layer perceptron<br />
* ADALINE<br />
* MADALINE<br />
<br />
==References==<br />
<references /><br />
<br />
24. Balakrishnama, S., Ganapathiraju, A. LINEAR DISCRIMINANT ANALYSIS - A BRIEF TUTORIAL. http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf [[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf]]</div>S9huhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f11&diff=12529stat841f112011-10-13T07:48:00Z<p>S9hu: </p>
<hr />
<div>==[[f11Stat841proposal| Proposal for Final Project]]==<br />
<br />
==[[f11Stat841EditorSignUp| Editor Sign Up]]==<br />
<br />
= STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20 =<br />
== Wiki Course Notes ==<br />
Students will need to contribute to the wiki for 20% of their grade.<br />
Access via wikicoursenote.com<br />
Go to editor sign-up, and use your UW userid for your account name, and use your UW email.<br />
<br />
primary (10%)<br />
Post a draft of lecture notes within 48 hours. <br />
You will need to do this 1 or 2 times, depending on class size.<br />
<br />
secondary (10%)<br />
Make improvements to the notes for at least 60% of the lectures.<br />
More than half of your contributions should be technical rather than editorial.<br />
There will be a spreadsheet where students can indicate what they've done and when.<br />
The instructor will conduct random spot checks to ensure that students have contributed what they claim.<br />
<br />
<br />
== Classification (Lecture: Sep. 20, 2011) ==<br />
=== Definitions ===<br />
'''classification''': Predict a discrete random variable <math>Y</math> (a label) by using another random variable <math>X</math><br />
(new data point) picked iid from a distribution<br />
<br />
<math>X_i = (X_{i1}, X_{i2}, ... X_{id}) \in \mathcal{X} \subset \mathbb{R}^d</math> (<math>d</math>-dimensional vector)<br />
<math>Y_i</math> in some finite set <math>\mathcal{Y}</math><br />
<br />
<br />
'''classification rule''':<br />
<math>h : \mathcal{X} \rightarrow \mathcal{Y}</math><br />
Take new observation <math>X</math> and use a classification function <math>h(x)</math> to generate a label <math>Y</math>. In other words, If we fit the function <math>h(x)</math> with a random variable <math>X</math>, it generates the label <math>Y</math> which is the class to which we predict <math>X</math> belongs.<br />
<br />
Example: Let <math> \mathcal{X}</math> be a set of 2D images and <math>\mathcal{Y}</math> be a finite set of people. We want to learn a classification rule <math>h:\mathcal{X}\rightarrow\mathcal{Y}</math> that with small ''true'' error predicts the person who appears in the image. <br />
<br />
<br />
'''true error rate''' for classifier <math>h</math> is the error with respect to the underlying distribution (that we do not know).<br />
<br />
<math>L(h) = P(h(X) \neq Y )</math><br />
<br />
<br />
'''empirical error rate''' (or training error rate) is the amount of error that our classification function <math>h(x)</math> makes on the training data.<br />
<br />
<math>\hat{L}_n(h) = (1/n) \sum_{i=1}^{n} \mathbf{I}(h(X_i) \neq Y_i)</math><br />
<br />
where <math>\mathbf{I}()</math> is an indicator function. Indicator function is defined by <br />
<br />
<math>\mathbf{I}(x) = \begin{cases} <br />
1 & \text{if } x \text{ is true} \\<br />
0 & \text{if } x \text{ is false}<br />
\end{cases}</math><br />
<br />
So in this case,<br />
<math>\mathbf{I}(h(X_i)\neq Y_i) = \begin{cases}<br />
1 & \text{if } h(X_i)\neq Y_i \text{ (i.e. when misclassification happens)} \\<br />
0 & \text{if } h(X_i)=Y_i \text{ (i.e. classified properly)}<br />
\end{cases}</math><br />
<br />
e.g., 100 new data points with known (true) labels<br />
<br />
<math>y_1 = h(x_1)</math><br />
<br />
...<br />
<br />
<math>y_{100} = h(x_{100})</math><br />
<br />
To calculate the empirical error we count how many labels our function <math>h(x)</math> assigned incorrectly and divide by n=100<br />
<br />
=== Bayes Classifier ===<br />
The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability<ref> http://www.wikicoursenote.com/wiki/Stat841#Bayes_Classifier </ref>.<br />
<br />
First recall Bayes' Rule, in the format<br />
<math>P(Y|X) = \frac{P(X|Y) P(Y)} {P(X)} </math> <br />
<br />
P(Y|X) : ''posterior'' , ''probability of <math>Y</math> given <math>X</math>''<br />
<br />
P(X|Y) : ''likelihood'', ''probability of <math>X</math> being generated by <math>Y</math>''<br />
<br />
P(Y) : ''prior'', ''probability of <math>Y</math> being selected''<br />
<br />
P(X) : ''marginal'', ''probability of obtaining <math>X</math>''<br />
<br />
<br />
We will start with the simplest case: <math>\mathcal{Y} = \{0,1\}</math><br />
<br />
<math> r(x) <br />
= P(Y=1|X=x) <br />
= \frac{P(X=x|Y=1) P(Y=1)} {P(X=x)}<br />
= \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}</math><br />
<br />
Bayes' rule can be approached by computing either:<br />
<br />
1) '''The posterior''': <math>\ P(Y=1|X=x) </math> and <math>\ P(Y=0|X=x) </math> or <br />
<br />
2) '''The likelihood''': <math>\ P(X=x|Y=1) </math> and <math>\ P(X=x|Y=0) </math><br />
<br />
<br />
The former reflects a '''Bayesian''' approach. The Bayesian approach uses previous beliefs and observed data (e.g., the random variable <math>\ X </math>) to determine the probability distribution of the parameter of interest (e.g., the random variable <math>\ Y </math>). The probability, according to Bayesians, is a ''degree of belief'' in the parameter of interest taking on a particular value (e.g., <math>\ Y=1 </math>), given a particular observation (e.g., <math>\ X=x </math>). Historically, the difficulty in this approach lies with determining the posterior distribution, however, more recent methods such as '''Markov Chain Monte Carlo (MCMC)''' allow the Bayesian approach to be implemented <ref name="PCAustin">P. C. Austin, C. D. Naylor, and J. V. Tu, "A comparison of a Bayesian vs. a frequentist method for profiling hospital performance," ''Journal of Evaluation in Clinical Practice'', 2001</ref>.<br />
<br />
The latter reflects a '''Frequentist''' approach. The Frequentist approach assumes that the probability distribution, including the mean, variance, etc., is fixed for the parameter of interest (e.g., the variable <math>\ Y </math>, which is ''not'' random). The observed data (e.g., the random variable <math>\ X </math>) is simply a ''sampling'' of a far larger population of possible observations. Thus, a certain repeatability or ''frequency'' is expected in the observed data. If it were possible to make an infinite number of observations, then the true probability distribution of the parameter of interest can be found. In general, frequentists use a technique called '''hypothesis testing''' to compare a ''null hypothesis'' (e.g. an assumption that the mean of the probability distribution is <math>\ \mu_0 </math>) to an alternative hypothesis (e.g. assuming that the mean of the probability distribution is larger than <math>\ \mu_0 </math>) <ref name="PCAustin"/>. For more information on hypothesis testing see <ref>R. Levy, "Frequency hypothesis testing, and contingency tables" class notes for LING251, Department of Linguistics, University of California, 2007. Available: [http://idiom.ucsd.edu/~rlevy/lign251/fall2007/lecture_8.pdf http://idiom.ucsd.edu/~rlevy/lign251/fall2007/lecture_8.pdf] </ref>. <br />
<br />
There was some class discussion on which approach should be used. Both the ease of computation and the validity of both approaches were discussed. A main point that was brought up in class is that Frequentists consider X to be a random variable, but they do not consider Y to be a random variable because it has to take on one of the values from a fixed set (in the above case it would be either 0 or 1 and there is only one ''correct'' label for a given value X=x). Thus, from a Frequentist's perspective it does not make sense to talk about the probability of Y. This is actually a grey area and sometimes ''Bayesians'' and ''Frequentists'' use each others' approaches. So using ''Bayes' rule'' doesn't necessarily mean you're a ''Bayesian''. Overall, the question remains unresolved.<br />
<br />
<br />
The '''Bayes Classifier''' uses <math>\ P(Y=1|X=x)</math><br />
<br />
<math> P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}</math><br />
<br />
P(Y=1) : the prior, based on belief/evidence beforehand<br />
<br />
denominator : marginalized by summation<br />
<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
The set <math>\mathcal{D}(h) = \{ x : P(Y=1|X=x) = P(Y=0|X=x)... \} </math><br />
<br />
which defines a ''decision boundary''.<br />
<br />
<math>h^*(x) = <br />
\begin{cases}<br />
1 \ \ if \ \ P(Y=1|X=x) > P(Y=0|X=x) \\<br />
0 \ \ \ \ \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
''Theorem'': Bayes rule is optimal. I.e., if h is any other classification rule, <br />
then <math>L(h^*) <= L(h)</math><br />
(This is to be proved in homework.)<br />
<br />
Why then do we need other classfication methods?<br />
A: Because X densities are often/typically unknown. I.e., <math>f_k(x)</math> and/or <math>\pi_k</math> unknown.<br />
<br />
<math>P(Y=k|X=x) = \frac{P(X=x|Y=k)P(Y=k)} {P(X=x)} = \frac{f_k(x) \pi_k} {\sum_k f_k(x) \pi_k}</math><br />
f_k(x) is referred to as the class conditional distribution (~likelihood).<br />
<br />
Therefore, we rely on some data to estimate quantities.<br />
<br />
=== Three Main Approaches ===<br />
<br />
'''1. Empirical Risk Minimization''':<br />
Choose a set of classifiers H (e.g., line, neural network) and find <math>h^* \in H</math><br />
that minimizes (some estimate of) L(h).<br />
<br />
'''2. Regression''':<br />
Find an estimate (<math>\hat{r}</math>) of function <math>r</math> and define<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
The <math> 1/2 </math> in the expression above is a threshold set for the regression prediction output. <br />
<br />
In general ''regression'' refers to finding a continuous, real valued y. The problem here is more difficult, because of the restricted domain (y is a set of discrete label values).<br />
<br />
'''3. Density Estimation''':<br />
Estimate <math>P(X=x|Y=0)</math> from <math>X_i</math>'s for which <math>Y_i = 0</math><br />
Estimate <math>P(X=x|Y=1)</math> from <math>X_i</math>'s for which <math>Y_i = 1</math><br />
and let <math>P(Y=?) = (1/n) \sum_{i=1}^{n} Y_i</math><br />
<br />
Define <math>\hat{r}(x) = \hat{P}(Y=1|X=x)</math> and<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
It is possible that there may not be enough data to estimate from for ''density estimation''. But the main problem lies with high dimensional spaces, as the estimation results may not be good (high error rate) and sometimes even infeasible. The term ''curse of dimensionality'' was coined by Bellman <ref>R. E. Bellman, ''Dynamic Programming''. Princeton University Press,<br />
1957</ref> to describe this problem.<br />
<br />
As the dimension of the space goes up, the learning requirements go up exponentially.<br />
<br />
To Learn more about methods for handling high-dimensional data <ref> https://docs.google.com/viewer?url=http%3A%2F%2Fwww.bios.unc.edu%2F~dzeng%2FBIOS740%2Flecture_notes.pdf</ref><br />
<br />
=== Multi-Class Classification ===<br />
Generalize to case Y takes on k>2 values.<br />
<br />
<br />
''Theorem'': <math>Y \in \mathcal{Y} = \{1,2,..., k\} </math> optimal rule<br />
<br />
<math>\ h^{*}(x) = argmax_k P(Y=k|X=x) </math> <br />
<br />
where <math>P(Y=k|X=x) = \frac{f_k(x) \pi_k} {\sum_r f_r \pi_r}</math><br />
<br />
===Examples of Classification===<br />
<br />
* Face detection in images.<br />
* Medical diagnosis.<br />
* Detecting credit card fraud (fraudulent or legitimate).<br />
* Speech recognition.<br />
* Handwriting recognition.<br />
<br />
== LDA and QDA ==<br />
<br />
'''Discriminant function analysis''' finds features that best allow discrimination between two or more classes. The approach is similar to '''analysis of Variance (ANOVA)''' in that discriminant function analysis looks at the mean values to determine if two or more classes are very different and should be separated. Once the discriminant functions (that separate two or more classes) have been determined, new data points can be classified (i.e. placed in one of the classes) based on the discriminant functions <ref> StatSoft, Inc. (2011). ''Electronic Statistics Textbook.'' [Online]. Available: [http://www.statsoft.com/textbook/discriminant-function-analysis/ http://www.statsoft.com/textbook/discriminant-function-analysis/.] </ref>. '''Linear discriminant analysis (LDA)''' and '''Quadratic discriminant analysis (QDA)''' are methods of discriminant analysis that are best applied to linearly and quadradically separable classes, respectively. '''Fisher discriminant analysis (FDA)''' another method of discriminant analysis that is different from linear discriminant analysis, but oftentimes both terms are used interchangeably.<br />
<br />
=== LDA ===<br />
<br />
The simplest method is to use approach 3 (above) and assume a parametric model for densities. Assume class conditional is Gaussian.<br />
<br />
<math>\mathcal{Y} = \{ 0,1 \}</math> assumed (i.e., 2 labels)<br />
<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ P(Y=1|X=x) > P(Y=0|X=x) \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
<math>P(Y=1|X=x) = \frac{f_1(x) \pi_1} {\sum_k f_k \pi_k} \ \ </math> (denom = P(x))<br />
<br />
1) Assume Gaussian distributions<br />
<br />
<math>f_k(x) = \frac{1}{(2\pi)^{d/2} |\Sigma_k|^{1/2}} exp(-(1/2)(\mathbf{x} - \mathbf{\mu_k}) \Sigma_k^{-1}(\mathbf{x}-\mathbf{\mu_k}) )</math><br />
<br />
must compare <br />
<math>\frac{f_1(x) \pi_1} {p(x)}</math> with <math>\frac{f_0(x) \pi_0} {p(x)}</math><br />
Note that the p(x) denom can be ignored:<br />
<math>f_1(x) \pi_1</math> with <math>f_0(x) \pi_0 </math><br />
<br />
To find the decision boundary, set <br />
<math>f_1(x) \pi_1 = f_0(x) \pi_0 </math><br />
<br />
2) Assume <math>\Sigma_1 = \Sigma_0</math>, we can use <math>\Sigma = \Sigma_0 = \Sigma_1</math>.<br />
<br />
Cancel <math>(2\pi)^{-d/2} |\Sigma_k|^{-1/2}</math> from both sides.<br />
<br />
Take log of both sides.<br />
<br />
Subtract one side from both sides, leaving zero on one side.<br />
<br />
<br />
<math>-(1/2)(\mathbf{x} - \mathbf{\mu_1})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_1}) + log(\pi_1) - [-(1/2)(\mathbf{x} - \mathbf{\mu_0})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_0}) + log(\pi_0)] = 0 </math><br />
<br />
<br />
<math>(1/2)[-\mathbf{x}^T \Sigma^{-1}\mathbf{x} - \mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + 2\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{x}<br />
+ \mathbf{x}^T \Sigma^{-1}\mathbf{x} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} - 2\mathbf{\mu_0}^T \Sigma^{-1} \mathbf{x} ]<br />
+ log(\pi_1/\pi_0) = 0 </math><br />
<br />
<br />
Cancelling out the terms quadratic in <math>\mathbf{x}</math> and rearranging results in <br />
<br />
<math>(1/2)[-\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0}<br />
+ (2\mathbf{\mu_1}^T \Sigma^{-1} - 2\mathbf{\mu_0}^T \Sigma^{-1}) \mathbf{x}]<br />
+ log(\pi_1/\pi_0) = 0 </math><br />
<br />
<br />
We can see that the first pair of terms is constant, and the second pair is linear in x.<br />
Therefore, we end up with something of the form <br />
<math>ax + b = 0</math>.<br />
For more about LDA <ref>http://sites.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf</ref><br />
<br />
== LDA and QDA Continued (Lecture: Sep. 22, 2011) == <br />
<br />
If we relax assumption 2 (i.e. <math>\Sigma_1 \neq \Sigma_0</math>) then we get a quadratic equation that can be written as<br />
<math>{x}^Ta{x}+b{x} + c = 0</math><br />
<br />
===Generalizing LDA and QDA===<br />
<br />
'''Theorem''': <br />
<br />
<br />
Suppose that <math>\,Y \in \{1,\dots,K\}</math>, if <math>\,f_k(x) = Pr(X=x|Y=k)</math> is Gaussian, the Bayes Classifier rule is<br />
:<math>\,h^*(x) = \arg\max_{k} \delta_k(x)</math><br />
<br />
Where<br />
<br />
<math> \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math><br />
<br />
When the Gaussian variances are equal <math>\Sigma_1 = \Sigma_0</math> (e.g. LDA), then<br />
<br />
<math> \,\delta_k(x) = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k) </math><br />
<br />
(To compute this, we need to calculate the value of <math>\,\delta </math> for each class, and then take the one with the max. value).<br />
<br />
===In practice===<br />
We estimate the prior to be the chance that a random item from the collection belongs to class k, e.g.<br />
<br />
<math>\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}</math><br />
<br />
The mean to be the average item in set k, e.g.<br />
<br />
<math>\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i</math><br />
<br />
and calculate the covariance of each class e.g.<br />
<br />
<math>\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top</math><br />
<br />
If we wish to use LDA we must calculate a common covariance, so we average all the covariances e.g.<br />
<br />
<math>\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{r=1}^{k}n_r} </math><br />
<br />
Where: <math>\,n_r</math> is the number of data points in class <math>\,r</math>, <math>\,\Sigma_r</math> is the covariance of class <math>\,r</math>, <math>\,n</math> is the total number of data points, and <math>\,k</math> is the number of classes.<br />
<br />
===Computation===<br />
<br />
For QDA we need to calculate: <math> \,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k) </math><br />
<br />
Lets first consider when <math>\, \Sigma_k = I, \forall k </math>. This is the case where each distribution is spherical, around the mean point.<br />
<br />
====Case 1====<br />
When <math>\, \Sigma_k = I </math><br />
<br />
We have:<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k) </math><br />
<br />
but <math>\ \log(|I|)=\log(1)=0 </math><br />
<br />
and <math>\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k) </math> is the [http://en.wikipedia.org/wiki/Euclidean_distance#Squared_Euclidean_Distance squared Euclidean distance] between two points <math>\,x</math> and <math>\,\mu_k</math><br />
<br />
Thus in this condition, a new point can be classified by its distance away from the center of a class, adjusted by some prior.<br />
<br />
Further, for two-class problem with equal prior, the discriminating function would be the bisector of the 2-class's means.<br />
<br />
====Case 2==== <br />
When <math>\, \Sigma_k \neq I </math><br />
<br />
Using the [[Singular Value Decomposition(SVD) | Singular Value Decomposition (SVD)]] of <math>\, \Sigma_k</math><br />
we get <math> \, \Sigma_k = U_kS_kV_k^\top</math>. In particular, <math>\, U_k</math> is a collection of eigenvectors of <math>\, \Sigma_k\Sigma_k^*</math>, and <math>\, V_k</math> is a collection of eigenvectors of <math>\,\Sigma_k^*\Sigma_k</math>.<br />
Since <math>\, \Sigma_k</math> is a symmetric matrix<ref> http://en.wikipedia.org/wiki/Covariance_matrix#Properties </ref>, <math>\, \Sigma_k = \Sigma_k^*</math>, so we have <math> \, \Sigma_k = U_kS_kU_k^\top </math>.<br />
<br />
For <math>\,\delta_k</math>, the second term becomes what is also known as the Mahalanobis distance <ref>P. C. Mahalanobis, "On The Generalised Distance in Statistics," ''Proceedings of the National Institute of Sciences of India'', 1936</ref> :<br />
<br />
:<math>\begin{align}<br />
(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top U_kS_k^{-1}U_k^T(x-\mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-1}(U_k^\top x-U_k^\top \mu_k)\\<br />
& = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-\frac{1}{2}}S_k^{-\frac{1}{2}}(U_k^\top x-U_k^\top\mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top I(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
& = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\<br />
\end{align}<br />
</math><br />
<br />
If we think of <math> \, S_k^{-\frac{1}{2}}U_k^\top </math> as a linear transformation that takes points in class <math>\,k</math> and distributes them spherically around a point, like in case 1. Thus when we are given a new point, we can apply the modified <math>\,\delta_k</math> values to calculate <math>\ h^*(\,x)</math>. After applying the singular value decomposition, <math>\,\Sigma_k^{-1}</math> is considered to be an identity matrix such that<br />
<br />
<math> \,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}[(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k)] + log (\pi_k) </math><br />
<br />
and,<br />
<br />
<math>\ \log(|I|)=\log(1)=0 </math><br />
<br />
For applying the above method with classes that have different covariance matrices (for example the covariance matrices <math>\ \Sigma_0 </math> and <math>\ \Sigma_1 </math> for the two class case), each of the covariance matrices has to be decomposed using SVD to find the according transformation. Then, each new data point has to be transformed using each transformation to compare its distance to the mean of each class (for example for the two class case, the new data point would have to be transformed by the class 1 transformation and then compared to <math>\ \mu_0 </math> and the new data point would also have to be transformed by the class 2 transformation and then compared to <math>\ \mu_1 </math>).<br />
<br />
<br />
The difference between [[#Case 1 | Case 1]] and [[#Case 2 | Case 2]] (i.e. the difference between using the Euclidean and Mahalanobis distance) can be seen in the illustration below. <br />
<br />
[[File:EuclideanVsMahalonobisDistance2.PNG|frame|center|Illustration of Euclidean distance (a) and Mahalanobis distance (b) where the contours represent equidistant points from the center using each distance metric. Source: <ref>R. De Maesschalck, D. Jouan-Rimbaud and D. L. Massart, "Tutorial - The Mahalanobis distance," ''Chemometrics and Intelligent Laboratory Systems'', 2000 </ref>]]<br />
<br />
As can be seen from the illustration above, the Mahalanobis distance takes into account the distribution of the data points, whereas the Euclidean distance would treat the data as though it has a spherical distribution. Thus, the Mahalanobis distance applies for the more general classification in [[#Case 2 | Case 2]], whereas the Euclidean distance applies to the special case in [[#Case 1 | Case 1]] where the data distribution is assumed to be spherical.<br />
<br />
Generally, we can conclude that QDA provides a better classifier for the data then LDA because LDA assumes that the covariance matrix is identical for each class, but QDA does not. QDA still uses Gaussian distribution as a class conditional distribution. In our real life, this distribution can not be happened each time, so we have to use other distribution as a complement.<br />
<br />
== Principal Component Analysis (PCA) (Lecture: Sep. 27, 2011) ==<br />
<br />
'''Principal Component Analysis (PCA)''' is a method of dimensionality reduction/feature extraction that transforms the data from a D dimensional space into a new coordinate system of dimension d, where d <= D ( the worst case would be to have d=D). The goal is to preserve as much of the variance in the original data as possible when switching the coordinate systems. Give data on D variables, the hope is that the data points will lie mainly in a linear subspace of dimension lower than D. In practice, the data will usually not lie precisely in some lower dimensional subspace.<br />
<br />
<br />
The new variables that form a new coordinate system are called '''principal components''' (PCs). PCs are denoted by <math>\ u_1, u_2, ... , u_D </math>. The principal components form a basis for the data. Since PCs are orthogonal linear transformations of the original variables there is at most D PCs. Normally, not all of the D PCs are used but rather a subset of d PCs, <math>\ u_1, u_2, ... , u_d </math>, to approximate the space spanned by the original data points <math>\ x_1, x_2, ... , x_D </math>. We can choose d based on what percentage of the original data we would like to maintain. <br />
<br />
Let <math>\ PC_j</math> be a linear combination of <math>\ x_1, x_2, ... , x_D </math> defined by the coefficients <br />
<math>\ w^{(j)}</math> = <math> ( {w_1}^{(j)}, {w_2}^{(j)},...,{w_D}^{(j)} )^T </math><br />
<br />
Thus, <math> u_j = {w_1}^{(j)} x_1 + {w_2}^{(j)} x_2 + ... + {w_D}^{(j)} x_D = w^{(j)^T} X </math><br />
<br />
<br />
This is a unique configuration since it sets up the PCs in order from maximum to minimum variances. The first PC, <math>\ u_1 </math> is called '''first principal component''' and has the maximum variance, thus it accounts for the most significant variance in the data <math>\ x_1, x_2, ... , x_D </math>. The second PC, <math>\ u_2 </math> is called '''second principal component''' and has the second highest variance and so on until PC, <math>\ u_D </math> which has the minimum variance. <br />
<br />
<br />
To get the first principal component, we would like to use the following equation:<br />
<br />
<math>\ max (Var(w^T X)) = max (w^T S w) </math> <br />
<br />
Where <math>\ S </math> is the covariance matrix. And we solve for <math>\ w </math>.<br />
<br />
<br />
Note: we require the constraint <math>\ w^T w = 1 </math> because if there is no constraint on the length of <math>\ w </math> then there is no upper bound. With the constraint, the direction and not the length that maximizes the variance can be found. <br />
<br />
<br />
====Lagrange Multiplier====<br />
<br />
Before we proceed, we should review Lagrange multipliers.<br />
<br />
[[Image:LagrangeMultipliers2D.svg.png|right|thumb|200px|"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]]]<br />
<br />
<br />
Lagrange multipliers are used to find the maximum or minimum of a function <math>\displaystyle f(x,y)</math> subject to constraint <math>\displaystyle g(x,y)=0</math> <br />
<br />
we define a new constant <math> \lambda</math> called a [http://en.wikipedia.org/wiki/Lagrange_multipliers Lagrange Multiplier] and we form the Lagrangian,<br /><br /><br />
<math>\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)</math><br />
<br /><br /><br />
If <math>\displaystyle f(x^*,y^*)</math> is the max of <math>\displaystyle f(x,y)</math>, there exists <math>\displaystyle \lambda^*</math> such that <math>\displaystyle (x^*,y^*,\lambda^*) </math> is a stationary point of <math>\displaystyle L</math> (partial derivatives are 0).<br />
<br>In addition <math>\displaystyle (x^*,y^*)</math> is a point in which functions <math>\displaystyle f</math> and <math>\displaystyle g</math> touch but do not cross. At this point, the tangents of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel or gradients of <math>\displaystyle f</math> and <math>\displaystyle g</math> are parallel, such that:<br />
<br /><br /><br />
<math>\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g</math><br />
<br><br />
<br><br />
where,<br /> <math>\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow</math> the gradient of <math>\, f</math> <br />
<br><br />
<math>\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow</math> the gradient of <math>\, g </math> <br />
<br><br /><br />
<br />
====Example :====<br />
Suppose we want to maximize the function <math>\displaystyle f(x,y)=x-y</math> subject to the constraint <math>\displaystyle x^{2}+y^{2}=1</math>. We can apply the Lagrange multiplier method to find the maximum value for the function <math>\displaystyle f </math>; the Lagrangian is:<br />
<br />
<math>\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)</math><br />
<br />
We want the partial derivatives equal to zero:<br />
<br />
<br /><br />
<math>\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0 </math> <br /><br />
<br /> <math>\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0</math><br />
<br> <br /><br />
<math>\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1</math><br />
<br><br /><br />
<br />
Solving the system we obtain two stationary points: <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math> and <math>\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)</math>. In order to understand which one is the maximum, we just need to substitute it in <math>\displaystyle f(x,y)</math> and see which one as the biggest value. In this case the maximum is <math>\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)</math>.<br />
<br />
===Determining w :===<br />
<br />
Use the Lagrange multiplier conversion to obtain:<br />
<math>\displaystyle L(w, \lambda) = w^T Sw - \lambda (w^T w - 1)</math> where <math>\displaystyle \lambda </math> is a constant <br />
<br />
Take the derivative and set it to zero:<br />
<math>\displaystyle{\partial L \over{\partial w}} = 0 </math><br />
<br />
<br />
To obtain: <br />
<math>\displaystyle 2Sw - 2 \lambda w = 0</math><br />
<br />
<br />
Rearrange to obtain:<br />
<math>\displaystyle Sw = \lambda w</math><br />
<br />
<br />
where <math>\displaystyle w</math> is eigenvector of <math>\displaystyle S </math> and <math>\ \lambda </math> is the eigenvalue of <math>\displaystyle S </math> as <math>\displaystyle Sw= \lambda w </math> , and <math>\displaystyle w^T w=1</math> , then we can write<br />
<br />
<math>\displaystyle w^T Sw= w^T\lambda w= \lambda w^T w =\lambda </math> <br />
<br />
Note that the PCs decompose the total variance in the data in the following way :<br />
<br />
<math> \sum_{i=1}^{D} Var(u_i) </math><br />
<br />
<math>= \sum_{i=1}^{D} (\lambda_i) </math> <br />
<br />
<math>\ = Tr(S) </math><br />
<br />
<math>= \sum_{i=1}^{D} Var(x_i)</math><br />
<br />
== Principal Component Analysis (PCA) Continued (Lecture: Sep. 29, 2011) == <br />
As can be seen from the above expressions, <math>\ Var(W^\top X) = W^\top S W= \lambda </math> where lambda is an eigenvalue of the sample covariance matrix <math>\ S </math> and <math>\ W</math> is its corresponding eigenvector. So <math>\ Var(u_i) </math> is maximized if <math>\ \lambda_i </math> is the maximum eigenvalue of <math>\ S </math> and the first principal component (PC) is the corresponding eigenvector. Each successive PC can be generated in the above manner by taking the eigenvectors of <math>\ S</math><ref>www.wikipedia.org/wiki/Eigenvalues_and_eigenvectors</ref> that correspond to the eigenvalues:<br />
<br />
<math>\ \lambda_1 \geq ... \geq \lambda_D </math> <br />
<br />
such that <br />
<br />
<math>\ Var(u_1) \geq ... \geq Var(u_D) </math><br />
<br />
=== Alternative Derivation ===<br />
Another way of looking at PCA is to consider PCA as a projection from a higher D-dimension space to a lower d-dimensional subspace that minimizes the squared ''reconstruction error''. The squared reconstruction error is the difference between the original data set <math>\ X </math> and the new data set <math> \hat{X} </math> obtained by first projecting the original data set into a lower d-dimensional subspace and then projecting it back into the the original higher D-dimension space. Since information is (normally) lost by compressing the the original data into a lower d-dimensional subspace, the new data set will (normally) differ from the original data even though both are part of the higher D-dimension space. The reconstruction error is computed as shown below.<br />
<br />
====Reconstruction Error====<br />
<br />
<math> e = \sum_{i=1}^{n} || x_i - \hat{x}_i ||^2 </math><br />
<br />
====Minimize Reconstruction Error====<br />
<br />
Suppose <math> \bar{x} = 0 </math> where <math> \hat{x}_i = x_i - \bar{x} </math><br />
<br />
Let <math>\ f(y) = U_d y </math> where <math>\ U_d </math> is a D by d matrix with d orthogonal unit vectors as columns.<br />
<br />
Fit the model to the data and minimize the reconstruction error:<br />
<br />
<math>\ min_{U_d, y_i} \sum_{i=1}^n || x_i - U_d y_i ||^2 </math><br />
<br />
Differentiate with respect to <math>\ y_i </math>:<br />
<br />
<math> \frac{\partial e}{\partial y_i} = 0 </math><br />
<br />
we can rewrite reconstruction-error as : <math>\ e = \sum_{i=1}^n(x_i - U_d y_i)^T(x_i - U_d y_i) </math><br />
<br />
<math>\ \frac{\partial e}{\partial y_i} = 2(-U_d)(x_i - U_d y_i) = 0 </math><br />
<br />
since <math>\ U_d(x_i - U_d y_i) </math> is a linear combination of the columns of <math>\ U_d </math>,<br />
<br />
which are independent (orthogonal to each other) we can conclude that:<br />
<br />
<math>\ x_i - U_d y_i = 0 </math> or equivalently,<br />
<br />
<math>\ x_i = U_d y_i </math><br />
<br />
<math>\ y_i = U_d^T x_i </math><br />
<br />
Find the orthogonal matrix <math>\ U_d </math>:<br />
<br />
<math>\ min_{U_d} \sum_{i=1}^n || x_i - U_d U_d^T x_i||^2 </math><br />
<br />
====Using SVD====<br />
<br />
A unique solution can be obtained by finding the [[Singular Value Decomposition(SVD) | Singular Value Decomposition (SVD)]] of <math>\ X </math>:<br />
<br />
<math>\ X = U S V^T </math><br />
<br />
For each rank d, <math>\ U_d </math> consists of the first d columns of <math>\ U </math>. Also, the covariance matrix can be expressed as follows <math>\ S = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)(x_i - \mu)^T </math>.<br />
<br />
Simply put, by subtracting the mean of each of the data point features and then applying SVD, one can find the principal components:<br />
<br />
<math> \tilde{X} = X - \mu </math><br />
<br />
<math>\ \tilde{X} = U S V^T </math><br />
<br />
Where <math>\ X </math> is a d by n matrix of data points and the features of each data point form a column in <math>\ X </math>. Also, <math>\ \mu </math> is a d by n matrix with identical columns each equal to the mean of the <math>\ x_i</math>'s, ie <math>\mu_{:,j}=\frac{1}{n}\sum_{i=1}^n x_i </math>. Note that the arrangement of data points is a convention and indeed in Matlab or conventional statistics, the transpose of the matrices in the above formulae is used.<br />
<br />
As the <math>\ S </math> matrix from the SVD has the eigenvalues arranged from largest to smallest, the corresponding eigenvectors in the <math>\ U </math> matrix from the SVD will be such that the first column of <math>\ U </math> is the first principal component and the second column is the second principal component and so on.<br />
<br />
=== Examples ===<br />
<br />
Note that in the Matlab code in the examples below, the mean was not subtracted from the datapoints before performing SVD. This is what was shown in class. However, to properly perform PCA, the mean should be subtracted from the datapoints.<br />
<br />
==== Example 1 ====<br />
Consider a matrix of data points <math>\ X </math> with the dimensions 560 by 1965. 560 is the number of elements in each column. Each column is a vector representation of a 20x28 grayscale pixel image of a face (see image below) and there is a total of 1965 different images of faces. Each of the images are corrupted by noise, but the noise can be removed by projecting the data back to the original space taking as many dimensions as one likes (e.g, 2, 3 4 0r 5). The corresponding Matlab commands are shown below:<br />
[[File:FreyFaceExample.PNG|thumb|185px|An example of the face images used in [[#Example 1 | Example 1]] with noise removed. Source: <ref>S. Roweis (2011). ''Data for MATLAB.'' [Online]. Available: [http://cs.nyu.edu/~roweis/data.html http://cs.nyu.edu/~roweis/data.html.] |</ref>]]<br />
<pre style="align:left; width: 75%; padding: 2% 2%"><br />
>> % start with a 560 by 1965 matrix X that contains the data points<br />
>> load(noisy.mat);<br />
>> <br />
>> % set the colors to grayscale <br />
>> colormap gray<br />
>> <br />
>> % show image in column 10 by reshaping column 10 into a 20 by 28 matrix<br />
>> imagesc(reshape(X(:,10),20,28)')<br />
>> <br />
>> % perform SVD, if X matrix if full rank, will obtain 560 PCs<br />
>> [S U V] = svd(X);<br />
>> <br />
>> % reconstruct X ( project X onto the original space) using only the first ten principal components<br />
>> Y_pca = U(:, 1:10)'*X;<br />
>> <br />
>> % show image in column 10 of X_hat which is now a 560 by 1965 matrix<br />
>> imagesc(reshape(X_hat(:,10),20,28)')<br />
</pre><br />
The reason why the noise is removed in the reconstructed image is because the noise does not create a major variation in a single direction in the original data. Hence, the first ten PCs taken from <math>\ U </math> matrix are not in the direction of the noise. Thus, reconstructing the image using the first ten PCs, will remove the noise.<br />
<br />
==== Example 2 ====<br />
Consider a matrix of data points <math>\ X </math> with the dimensions 64 by 400. 64 is the number of elements in each column. Each column is a vector representation of a 8x8 grayscale pixel image of either a handwritten number ''2'' or a handwritten number ''3'' (see image below) and there are a total of 400 different images, where the first 200 images show a handwritten number ''2'' and the last 200 images show a handwritten number ''3''. <br />
[[File:Handwritten23.PNG|frame|center|An example of the handwritten number images used in [[#Example 2 | Example 2]]. Source: <ref>A. Ghodsi, "PCA" class notes for STAT841, Department of Statistics and Actuarial Science, University of Waterloo, 2011. </ref>]]<br />
<br />
The corresponding Matlab commands for performing PCA on the data points are shown below:<br />
<pre><br />
>> % start with a 64 by 400 matrix X that contains the data points<br />
>> load 2_3.mat;<br />
>> <br />
>> % set the colors to grayscale <br />
>> colormap gray<br />
>> <br />
>> % show image in column 2 by reshaping column 2 into a 8 by 8 matrix<br />
>> imagesc(reshape(X(:,2),8,8))<br />
>> <br />
>> % perform SVD, if X matrix if full rank, will obtain 64 PCs<br />
>> [U S V] = svd(X);<br />
>> <br />
>> % project data down onto the first two PCs<br />
>> Y = U(:,1:2)'*X;<br />
>> <br />
>> % show Y as an image (can see the change in the first PC at column 200,<br />
>> % when the handwritten number changes from 2 to 3)<br />
>> imagesc(Y)<br />
>> <br />
>> % perform PCA using Matlab build-in function (do not use for assignment)<br />
>> % also note that due to the Matlab convention, the transpose of X is used<br />
>> [COEFF, Y] = princomp(X');<br />
>> <br />
>> % again, use the first two PCs<br />
>> Y = Y(:,1:2);<br />
>> <br />
>> % use plot digits to show the distribution of images on the first two PCs<br />
>> images = reshape(X, 8, 8, 400);<br />
>> plotdigits(images, Y, .1, 1);<br />
</pre><br />
Using the ''plotdigits'' function in Matlab, clearly illustrates that the first PC captured the differences between the numbers ''2'' and ''3'' as they are projected onto different regions of the axis for the first PC. Also, the second PC captured the ''tilt'' of the handwritten numbers as numbers tilted to the left or right were projected onto different regions of the axis for the second PC.<br />
<br />
==== Example 3 ====<br />
(Not discussed in class) In the news recently was a story that captures some of the ideas behind PCA. Over the past two years, Scott Golder and Michael Macy, researchers from Cornell University, collected 509 million Twitter messages from 2.4 million users in 84 different countries. The data they used were words collected at various times of day and they classified the data into two different categories: positive emotion words and negative emotion words. Then, they were able to study this new data to evaluate subjects' moods at different times of day, while the subjects were in different parts of the world. They found that the subjects generally exhibited positive emotions in the mornings and late evenings, and negative emotions mid-day. They were able to "project their data onto a smaller dimensional space" using PCS. Their paper, "Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures," is available in the journal Science.<ref>http://www.pcworld.com/article/240831/twitter_analysis_reveals_global_human_moodiness.html</ref>.<br />
<br />
Assumptions Underlying Principal Component Analysis can be found here<ref>http://support.sas.com/publishing/pubcat/chaps/55129.pdf</ref><br />
<br />
==== Example 4 ====<br />
(Not discussed in class) A somewhat well known learning rule in the field of neural networks called Oja's rule can be used to train networks of neurons to compute the principal component directions of data sets. <ref>A Simplified Neuron Model as a Principal Component Analyzer. Erkki Oja. 1982. Journal of Mathematical Biology. 15: 267-273</ref> This rule is formulated as follows<br />
<br />
<math>\,\Delta w = \eta yx -\eta y^2w </math><br />
<br />
where <math>\,\Delta w </math> is the neuron weight change, <math>\,\eta</math> is the learning rate, <math>\,y</math> is the neuron output given the current input, <math>\,x</math> is the current input and <math>\,w</math> is the current neuron weight. This learning rule shares some similarities with another method for calculating principal components: power iteration. The basic algorithm for power iteration (taken from wikipedia: <ref>Wikipedia. http://en.wikipedia.org/wiki/Principal_component_analysis#Computing_principal_components_iteratively</ref>) is shown below <br />
<br />
<br />
<math>\mathbf{p} =</math> a random vector<br />
do ''c'' times:<br />
<math>\mathbf{t} = 0</math> (a vector of length ''m'')<br />
for each row <math>\mathbf{x} \in \mathbf{X^T}</math><br />
<math>\mathbf{t} = \mathbf{t} + (\mathbf{x} \cdot \mathbf{p})\mathbf{x}</math><br />
<math>\mathbf{p} = \frac{\mathbf{t}}{|\mathbf{t}|}</math><br />
return <math>\mathbf{p}</math><br />
<br />
Comparing this with the neuron learning rule we can see that the term <math>\, \eta y x </math> is very similar to the <math>\,\mathbf{t}</math> update equation in the power iteration method, and identical if the neuron model is assumed to be linear (<math>\,y(x)=x\mathbf{p}</math>) and the learning rate is set to 1. Additionally, the <math>\, -\eta y^2w </math> term performs the normalization, the same function as the <math>\,\mathbf{p}</math> update equation in the power iteration method.<br />
<br />
=== Observations ===<br />
Some observations about the PCA were brought up in class:<br />
<br />
* '''PCA''' assumes that data is on a ''linear subspace'' or close to a linear subspace. For non-linear dimensionality reduction, other techniques are used. Amongst the first proposed techniques for non-linear dimensionality reduction are '''Locally Linear Embedding (LLE)''' and '''Isomap'''. More recent techniques include '''Maximum Variance Unfolding (MVU)''' and '''t-Distributed Stochastic Neighbor Embedding (t-SNE)'''. '''Kernel PCAs''' may also be used, but they depend on the type of kernel used and generally do not work well in practice. (Kernels will be covered in more detail later in the course.)<br />
<br />
* Finding the number of PCs to use is not straightforward. It requires knowledge about the ''instrinsic dimentionality of data''. In practice, oftentimes a heuristic approach is adopted by looking at the eigenvalues ordered from largest to smallest. If there is a "dip" in the magnitude of the eigenvalues, the "dip" is used as a cut off point and only the large eigenvalues before the "dip" are used. Otherwise, it is possible to add up the eigenvalues from largest to smallest until a certain percentage value is reached. This percentage value represents the percentage of variance that is preserved when projecting onto the PCs corresponding to the eigenvalues that have been added together to achieve the percentage. <br />
<br />
* It is a good idea to normalize the variance of the data before applying PCA. This will avoid PCA finding PCs in certain directions due to the scaling of the data, rather than the real variance of the data.<br />
<br />
* PCA can be considered as an unsupervised approach, since the main direction of variation is not known beforehand, i.e. it is not completely certain which dimension the first PC will capture. The PCs found may not correspond to the desired labels for the data set. There are, however, alternate methods for performing supervised dimensionality reduction.<br />
<br />
* (Not in class) Even though the traditional PCA method does not work well on data set that lies on a non-linear manifold. A revised PCA method, called c-PCA, has been introduced to improve the stability and convergence of intrinsic dimension estimation. The approach first finds a minimal cover (a cover of a set X is a collection of sets whose union contains X as a subset<ref>http://en.wikipedia.org/wiki/Cover_(topology)</ref>) of the data set. Since set covering is an NP-hard problem, the approach only finds an approximation of minimal cover to reduce the complexity of the run time. In each subset of the minimal cover, it applies PCA and filters out the noise in the data. Finally the global intrinsic dimension can be determined from the variance results from all the subsets. The algorithm produces robust results.<ref>Mingyu Fan, Nannan Gu, Hong Qiao, Bo Zhang, Intrinsic dimension estimation of data by principal component analysis, 2010. Available: http://arxiv.org/abs/1002.2050</ref><br />
<br />
*(Not in class) While PCA finds the mathematically optimal method (as in minimizing the squared error), it is sensitive to outliers in the data that produce large errors PCA tries to avoid. It therefore is common practice to remove outliers before computing PCA. However, in some contexts, outliers can be difficult to identify. For example in data mining algorithms like correlation clustering, the assignment of points to clusters and outliers is not known beforehand. A recently proposed generalization of PCA based on a '''Weighted PCA''' increases robustness by assigning different weights to data objects based on their estimated relevancy.<ref>http://en.wikipedia.org/wiki/Principal_component_analysis</ref><br />
<br />
* (Not in class) Comparison between PCA and LDA: Principal Component Analysis (PCA)and Linear Discriminant Analysis (LDA) are two commonly used techniques for data classification and dimensionality reduction. Linear Discriminant Analysis easily handles the case where the within-class frequencies are unequal and their performances has been examined on randomly generated test data. This method maximizes the ratio of between-class variance to the within-class variance in any particular data set thereby guaranteeing maximal separability. ... The prime difference between LDA and PCA is that PCA does more of feature classification and LDA does data classification. In PCA, the shape and location of the original data sets changes when transformed to a different space whereas LDA doesn’t change the location but only tries to provide more class separability and draw a decision region between the given classes. This method also helps to better understand the distribution of the feature data." [24] [[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf]]<br />
<br />
=== Summary ===<br />
The PCA algorithm can be summarized into the following steps:<br />
<br />
# '''Recover basis'''<br />
#: <math>\ \text{ Calculate } XX^T=\Sigma_{i=1}^{t}x_ix_{i}^{T} \text{ and let } U=\text{ eigenvectors of } XX^T \text{ corresponding to the largest } d \text{ eigenvalues.} </math><br />
# '''Encode training data'''<br />
#: <math>\ \text{Let } Y=U^TX \text{, where } Y \text{ is a } d \times t \text{ matrix of encodings of the original data.} </math><br />
# '''Reconstruct training data'''<br />
#: <math> \hat{X}=UY=UU^TX </math>.<br />
# '''Encode test example'''<br />
#: <math>\ y = U^Tx \text{ where } y \text{ is a } d\text{-dimensional encoding of } x </math>.<br />
# '''Reconstruct test example'''<br />
#: <math> \hat{x}=Uy=UU^Tx </math>.<br />
<br />
== Fisher Discriminant Analysis (FDA) (Lecture: Sep. 29, 2011) ==<br />
<br />
'''Fisher Discriminant Analysis (FDA)''' is sometimes called ''Fisher Linear Discriminant Analysis (FLDA)'' or just ''Linear Discriminant Analysis (LDA)''. This causes confusion with the [[#LDA | ''Linear Discriminant Analysis (LDA)'']] technique covered earlier in the course. The LDA technique covered earlier in the course has a normality assumption and is a boundary finding technique. The FDA technique outlined here is a supervised feature extraction technique. FDA differs from PCA as well because PCA does not use the class labels, <math>\ y_i</math>, of the data <math>\ (x_i,y_i)</math> while FDA organizes data into their ''classes'' by finding the direction of maximum separation between classes.<br />
<br />
== Fisher Discriminant Analysis (FDA) Continued (Lecture: Oct. 04, 2011) ==<br />
<br />
One main drawback of the PCA technique is that the direction of greatest variation may not be the classification we desire. For example, imagine if the [[#Example 2 | data set]] above had a lightening filter applied to a random subset of the images. Then the greatest variation would be the brightness and not the more important variations we wish to classify. FDA circumvents this problem by using the labels, <math>\ y_i</math>, of the data <math>\ (x_i,y_i)</math> i.e. the FDA uses ''supervised learning''. An elementary way to see the algorithm is to imagine two classes of data projected onto a suitably chosen line that minimizes the within class variance, and maximizes the distance between the two classes i.e. group similar data together and spread different data apart. This way, new data acquired can be compared, after a transformation, to where these projections, using some well-chosen metric.<br />
<br />
<br />
We first consider the cases of two-classes. Denote the mean and covariance matrix of class <math>i=0,1</math> by <math>\mathbf{\mu}_i</math> and <math>\mathbf{\Sigma}_i</math> respectively. We transform the data so that it is projected into 1 dimension i.e. a scalar value. To do this, we compute the inner product of our <math>dx1</math>-dimensional data, <math>\mathbf{x}</math>, by a to-be-determined <math>dx1</math>-dimensional vector <math>\mathbf{w}</math>. The new means and covariances of the transformed data:<br />
<br />
::<math> \mu'_i:\rightarrow \mathbf{w}^{T}\mathbf{\mu}_i </math> <br/><br />
::<math> \Sigma'_i :\rightarrow \mathbf{w}^{T}\mathbf{\sigma}_i \mathbf{w}</math><br />
<br />
The new means and variances are actually scalar values now, but we will use vector and matrix notation and arguments throughout the following derivation as the multi-class case is then just a simpler extension. <br />
<br />
===Goals of FDA===<br />
<br />
As will be shown in the objective function, the goal of FDA is to maximize the separation of the classes (between class variance) and minimize the scatter within each class (within class variance). That is, our ideal situation is that the individual classes are as far away from each other as possible and at the same time the data within each class are as close to each other as possible (collapsed to a single point in the most extreme case). An interesting note is that R. A. Fisher who FDA is named after, used the FDA technique for purposes of taxonomy, in particular for categorizing different species of iris flowers. <ref name="RAFisher">R. A. Fisher, "The Use of Multiple measurements in Taxonomic Problems," ''Annals of Eugenics'', 1936</ref>. It is very easy to visualize what is meant by within class variance (i.e. differences between the iris flowers of the same species) and between class variance (i.e. the differences between the iris flowers of different species) in that case.<br />
<br />
<br />
'''1)''' Our '''first''' goal is to minimize the individual classes' covariance. This will help to collapse the data together. <br />
We have two minimization problems<br />
<br />
::<math>\min_{\mathbf{w}} \mathbf{w} \mathbf{\Sigma}_0 \mathbf{w}^{T}</math> <br />
and <br />
::<math>\min_{\mathbf{w}} \mathbf{w} \mathbf{\Sigma}_1 \mathbf{w}^{T}</math>.<br />
<br />
But these can be combined:<br />
::<math> \min_{\mathbf{w}} \mathbf{w} \mathbf{\Sigma}_0 \mathbf{w}^{T} + \mathbf{w} \mathbf{\Sigma}_1 \mathbf{w}^{T}</math> <br />
:: <math> = \min_{\mathbf{w}} \mathbf{w} ( \mathbf{\Sigma_0} + \mathbf{\Sigma_1} ) \mathbf{w}^{T} </math><br />
<br />
Define <math> \mathbf{S}_W =\mathbf{\Sigma_0} + \mathbf{\Sigma_1} </math>, called the ''within class variance matrix''. <br />
<br />
'''2)''' Our '''second''' goal is to move the minimized classes as far away from each other as possible. One way to accomplish this is to maximize the distances between the means of the transformed data i.e.<br />
<br />
<math> \max_{\mathbf{w}} |\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1|^2 </math><br />
<br />
Simplifying:<br />
::<math> \max_{\mathbf{w}} \,(\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1)^T (\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{mu}_1) </math> <br/><br />
::<math> = \max_{\mathbf{w}}\, (\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}\mathbf{w}^{T} \mathbf{w} (\mathbf{\mu}_0-\mathbf{\mu}_1)</math> <br/><br />
::<math> = \max_{\mathbf{w}} \,\mathbf{w}^{T}(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}\mathbf{w}</math><br />
<br />
Recall that <math> \mathbf{\mu}_i </math> are known. Denote<br />
<br />
::<math> \mathbf{S}_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}</math> <br />
<br />
This matrix, called the ''between class variance matrix'', is a rank 1 matrix, so an inverse does not exist. Altogether, we have two optimization problems we must solve simultaneously:<br />
<br />
::1) <math> \min_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} </math><br/><br />
::2) <math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} </math><br />
<br />
There are other metrics one can use to both minimize the data's variance and maximizes the distance between classes, and other goals we can try to accomplish (see metric learning, below...one day), but Fisher used this elegant method, hence his recognition in the name, and we will follow his method.<br />
<br />
We can combine the two optimization problems into one after noting that the negative of max is min:<br />
<br />
::<math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} - \alpha \mathbf{w} \mathbf{S_B} \mathbf{w}^{T}</math><br/><br />
<br />
The <math>\alpha</math> coefficient is a necessary scaling factor: if the scale of one of the terms is much larger than the other, the optimization problem will be dominated by the larger term. This means we have another unknown, <math>\alpha</math>, to solve for. Instead, we can circumvent the scaling problem by looking at the ratio of the quantities, the original solution Fisher proposed:<br />
<br />
::<math> \max_{\mathbf{w}} \frac{\mathbf{w} \mathbf{S_B} \mathbf{w}^{T}}{\mathbf{w} \mathbf{S_W} \mathbf{w}^{T}} </math><br />
<br />
This optimization problem can be shown<ref><br />
http://www.socher.org/uploads/Main/optimizationTutorial01.pdf<br />
</ref> to be equivalent to the following optimization problem:<br />
<br />
:: <math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T}</math> <br />
<br />
subject to:<br />
<br />
:: <math> \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} = 1 </math><br />
<br />
A heuristic understanding of this equivalence is that we have two degrees of freedom: direction and scalar. The scalar value is irrelevant to our discussion. Thus, we can set one of the values to be a constant. We can use Lagrange multipliers to solve this optimization problem:<br />
<br />
::<math>L( \mathbf{w}, \lambda) = \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} - \lambda(\mathbf{w} \mathbf{S_W} \mathbf{w}^{T}-1)</math><br />
:: <math> \Rightarrow \frac{\partial L}{\partial \mathbf{w}} = 2 \mathbf{S}_B \mathbf{w} - 2\lambda \mathbf{S}_W\mathbf{w} </math><br />
<br />
Setting the partial derivative to 0 gives us a ''generalized eigenvalue problem'':<br />
<br />
::<math> \mathbf{S}_B \mathbf{w} = \lambda \mathbf{S}_W \mathbf{w} </math><br />
:: <math> \Rightarrow \mathbf{S}_W^{-1} \mathbf{S}_B \mathbf{w} = \lambda \mathbf{w} </math><br />
<br />
This is a generalized eigenvalue problem and <math>\ W </math> can be computed as the eigenvector corresponds to the largest eigenvalue of <br />
:: <math> \mathbf{S}_W^{-1} \mathbf{S}_B </math><br />
<br />
It is very likely that <math> \mathbf{S}_W </math> has an inverse. If not, the pseudo-inverse<ref><br />
http://en.wikipedia.org/wiki/Generalized_inverse<br />
</ref><ref><br />
http://www.mathworks.com/help/techdoc/ref/pinv.html<br />
</ref> can be used. In Matlab the pseudo-inverse function is named ''pinv''. Thus, we should choose <math>\mathbf{w}</math> to equal the eigenvector of the largest eigenvalue as our projection vector. <br />
<br />
In fact we can simplify the above expression further in the of two classes. Recall the definition of <math>\mathbf{S}_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}</math>. Substituting this into our expression:<br />
<br />
::<math> \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T} \mathbf{w} = \lambda \mathbf{w} </math><br />
::<math> (\mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) ) ((\mathbf{\mu}_0-\mathbf{\mu}_1)^{T} \mathbf{w}) = \lambda \mathbf{w} </math><br />
<br />
This second term is a scalar value, let's denote it <math>\beta</math>. Then<br />
::<math> \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) = \frac{\lambda}{\beta} \mathbf{w} </math><br />
::<math> \Rightarrow \, \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) \propto \mathbf{w} </math><br />
<br />
All we are interested in the direction of <math>\mathbf{w}</math>, so to compute this is sufficient to finding our projection vector. Though this will not work in higher dimensions, as <math>\mathbf{w}</math> would be a matrix and not a vector in higher dimensions.<br />
<br />
=== Extensions to Multiclass Case ===<br />
If we have <math>\ k</math> classes, we need <math>\ k-1</math> directions i.e. we need to project <math>\ k</math> 'points' onto a <math>\ k-1</math> dimensional hyperplane. What does this change in our above derivation? The most significant difference is that our projection vector,<math>\mathbf{w}</math>, is no longer a vector but instead is a matrix <math>\mathbf{W}</math>. We transform the data as:<br />
<br />
::<math> \mathbf{x}' :\rightarrow \mathbf{W}^{T} \mathbf{x}</math><br />
so our new mean and covariances for class k are:<br />
::<math> \mathbf{\mu_k}' :\rightarrow \mathbf{W}^{T} \mathbf{\mu_k}</math><br />
::<math> \mathbf{\Sigma_k}' :\rightarrow \mathbf{W}^{T} \mathbf{\Sigma_k} \mathbf{W}</math><br />
<br />
What are our new optimization sub-problems? As before, we wish to minimize the within class variance. This can be formulated as:<br />
::<math>\min_{\mathbf{W}} \mathbf{W}^{T} \mathbf{\Sigma_1} \mathbf{W} + \dots + \mathbf{W}^{T} \mathbf{\Sigma_k} \mathbf{W} </math><br />
<br />
Again, denoting <math>\mathbf{S}_W = \mathbf{\Sigma_1} + \dots + \mathbf{\Sigma_k}</math>, we can simplify above expression:<br />
<br />
::<math>\min_{\mathbf{W}} \mathbf{W}^{T} \mathbf{S}_W \mathbf{W} </math><br />
<br />
Similarly, the second optimization problem is:<br />
<br />
::<math>\max_{\mathbf{W}} \mathbf{W}^{T} \mathbf{S}_B \mathbf{W} </math><br />
<br />
What is <math>\mathbf{S}_B</math> in this case? It can be shown that <math>\mathbf{S}_T = \mathbf{S}_B + \mathbf{S}_W </math> where <math> \mathbf{S}_T </math> is the covariance matrix of all the data. From this we can compute <math> \mathbf{S}_B </math>. <br />
<br />
Next, if we express <math> \mathbf{W} = ( \mathbf{w}_1 , \mathbf{w}_2 , \dots ,\mathbf{w}_k ) </math> observe that, for <math> \mathbf{A} = \mathbf{S}_B , \mathbf{S}_W </math>: <br />
<br />
::<math> Tr(\mathbf{W}^{T} \mathbf{A} \mathbf{W}) = \mathbf{w}_1^{T} \mathbf{A} \mathbf{w}_1^{T} + \dots + \mathbf{w}_k \mathbf{A} \mathbf{w}_k </math><br />
<br />
where <math>\ Tr()</math> is the trace of a matrix. Thus, following the same steps as in the two-class case, we have the new optimization problem:<br />
<br />
::<math> \max_{\mathbf{W}} \frac{ Tr(\mathbf{W}^{T} \mathbf{S}_B \mathbf{W}) }{Tr(\mathbf{W}^{T} \mathbf{S}_W \mathbf{W})} </math> <br />
<br />
subject to:<br />
<br />
:: <math> Tr( \mathbf{W} \mathbf{S_W} \mathbf{W}^{T}) = \mathbf{I} </math><br />
<br />
Again, in order to solve the above optimization problem, we can use the Lagrange multiplier <ref><br />
http://en.wikipedia.org/wiki/Lagrange_multiplier </ref>:<br />
<br />
:: <math>\begin{align}L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - I \right\}\end{align}</math>.<br />
<br />
where <math>\ \Lambda</math> is a d by d diagonal matrix.<br />
<br />
Then, we differentiating with respect to <math>\mathbf{W}</math>:<br />
<br />
:: <math>\begin{align}\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}\end{align} = 0</math>.<br />
<br />
Thus:<br />
<br />
:: <math>\begin{align}\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}\end{align}</math><br />
<br />
:: <math>\begin{align}\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{W}\end{align}</math><br />
<br />
where, <math> \mathbf{\Lambda} =\begin{pmatrix}\lambda_{1} & & 0\\&\ddots&\\0 & &\lambda_{d}\end{pmatrix}</math><br />
<br />
The above equation is of the form of an eigenvalue problem. Thus, for the solution the k-1 eigenvectors corresponding to the k-1 largest eigenvalues should be chosen as the projection matrix, <math>\mathbf{W}</math>. In fact, there should only by k-1 eigenvectors corresponding to k-1 non-zero eigenvalues using the above equation.<br />
<br />
=== Summary ===<br />
FDA has two optimization problems:<br />
::1) <math> \min_{\mathbf{w}} \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} </math><br/><br />
::2) <math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T} </math> <br />
<br />
where <math>\ S_W = \Sigma_0 + \Sigma_1</math> is called the within class variance and <math>\ S_B = (\mu_0 - \mu_1)(\mu_0 - \mu_1)^T </math> is called the between class variance.<br />
<br />
The two optimization problems are combined as follows:<br />
::<math> \max_{\mathbf{w}} \frac{\mathbf{w} \mathbf{S_B} \mathbf{w}^{T}}{\mathbf{w} \mathbf{S_W} \mathbf{w}^{T}} </math><br />
<br />
By adding a constraint as shown:<br />
::<math> \max_{\mathbf{w}} \mathbf{w} \mathbf{S_B} \mathbf{w}^{T}</math><br />
<br />
subject to:<br />
:: <math> \mathbf{w} \mathbf{S_W} \mathbf{w}^{T} = 1 </math><br />
<br />
Lagrange multipliers can be used and essentially the problem becomes an eigenvalue problem:<br />
<br />
::<math>\begin{align}\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w} = \lambda\mathbf{w}\end{align}</math><br />
<br />
And <math>\ w </math> can be computed as the k-1 eigenvectors corresponding to the largest k-1 eigenvalues of <math> \mathbf{S}_W^{-1} \mathbf{S}_B </math>.<br />
<br />
=== Variations ===<br />
<br />
Some adaptations and extensions exist for the FDA technique (Source: <ref>R. Gutierrez-Osuna, "Linear Discriminant Analysis" class notes for Intro to Pattern Analysis, Texas A&M University. Available: [http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf]</ref>):<br />
<br />
1) ''Non-Parametric LDA (NPLDA)'' by Fukunaga<br />
<br />
This method does not assume that the Gaussian distribution is unimodal and it is actually possible to extract more than k-1 features (where k is the number of classes).<br />
<br />
2) ''Orthonormal LDA (OLDA)'' by Okada and Tomita<br />
<br />
This method finds projections that are orthonormal in addition to maximizing the FDA objective function. This method can also extract more than k-1 features (where k is the number of classes).<br />
<br />
3) ''Generalized LDA (GLDA)'' by Lowe<br />
<br />
This method incorporates additional cost functions into the FDA objective function. This causes classes with a higher cost to be placed further apart in the lower dimensional representation.<br />
<br />
== Linear and Logistic Regression (Lecture: Oct. 06, 2011) ==<br />
<br />
=== Linear Regression ===<br />
<br />
In regression, <math>\ y </math> is a continuous variable. In classification, <math>\ y </math> is a discrete variable. Regression problems are easier to formulate into functions (since <math>\ y </math> is continuous) and it is possible to solve classification problems by treating them like regression problems. In order to do so, the requirement in classification that <math>\ y </math> is discrete must first be relaxed. Once <math>\ y </math> has been found using regression techniques, it is possible to determine the discrete class corresponding to the <math>\ y </math> that has been found to solve the original classification problem. The discrete class is obtained by defining a threshold where <math>\ y </math> values below the threshold belong to one class and <math>\ y </math> values above the threshold belong to another class.<br />
<br />
<br />
More formally: a more direct approach to classification is to estimate the regression function <math>\ r(\mathbf{x}) = E[Y | X]</math> without bothering to estimate <math>\ f_k(\mathbf{x}) </math>.<br />
<br />
In two-class problems, if <math>\ Y = \{0,1\}</math>, then <math>\, h^*(\mathbf{x})= \left\{\begin{matrix}<br />
1 &\text{,if } \hat r(\mathbf{x})>\frac{1}{2} \\<br />
0 &\mathrm{,otherwise.} \end{matrix}\right.</math><br />
<br />
Basically, we can use a linear function<br />
<math>\ f(x, \beta) = \mathbf{\beta\,}^T \mathbf{x_{i}} + \mathbf{\beta\,_0} </math> and use the least squares approach to fit the function to the given data. This is done by minimizing the following expression:<br />
<br />
<math>\min_{\mathbf{\beta}} \sum_{i=1}^n (y_i - \mathbf{\beta}^T<br />
\mathbf{x_{i}} - \mathbf{\beta_0})^2</math><br />
<br />
where<br />
<br />
<math>\tilde{\mathbf{\beta}} = \left( \begin{array}{c}\mathbf{\beta_{1}}<br />
<br />
\\ \\<br />
\dots \\ \\<br />
\mathbf{\beta}_{d} \\ \\<br />
\mathbf{\beta}_{0} \end{array} \right)</math>.<br />
<br />
For convenience, <math>\mathbf{\beta}</math> and <math>\mathbf{\beta}_0</math> have been combined into a d+1 dimensional vector. And an extra term 1 is appended to to <math>\ x </math>. Thus, the function to be minimized can now be expressed as:<br />
<br />
<math>\ min_{\tilde{\beta}} \sum_{i=1}^{n} (y_i - \tilde{\beta} \tilde{x_i} )^2 </math><br />
<br />
<math>\ = min_{\tilde{\beta}} | y - X \tilde{\beta}^T |^2 </math><br />
<br />
where <math>\ y </math> and <math>\tilde{\beta}</math> are vectors and <math>\ X </math> is a matrix.<br />
<br />
The solution for <math>\ \tilde{\beta} </math> is<br />
<br />
<math>\ {\tilde{\beta}} = (XX^T)^{-1}Xy </math><br />
<br />
Using regression to solve classification problems is not mathematically correct, if we want to be true to classification. However, this method works well in practice, if the problem is not complicated. When we have only two classes (encoded as <math>\ \frac{-n}{n_1} </math> and <math>\ \frac{n}{n_2}) </math>, this method is identical to LDA.<br />
<br />
==== Matlab Example ====<br />
<br />
The following is the code and the explanation for each step.<br />
<br />
Again, we use the data in 2_3.m. <br />
>>load 2_3;<br />
>>[U, sample] = princomp(X');<br />
>>sample = sample(:,1:2);<br />
We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.<br />
<br />
>>y = zeros(400,1);<br />
>>y(201:400) = 1;<br />
We let y represent the set of labels coded as 0 and 1.<br />
<br />
>>x=[sample;ones(1,400)];<br />
Construct x by adding a row of vector 1 to data.<br />
<br />
>>b=inv(x*x')*x*y;<br />
Calculate b, which represents <math>\beta</math> in the linear regression model.<br />
<br />
>>x1=x';<br />
>>for i=1:400<br />
if x1(i,:)*b>0.5<br />
plot(x1(i,1),x1(i,2),'.')<br />
hold on<br />
elseif x1(i,:)*b < 0.5<br />
plot(x1(i,1),x1(i,2),'r.')<br />
end <br />
end<br />
Plot the fitted y values.<br />
<br />
[[File: linearregression.png|center|frame| the figure shows that the classification of the data points in 2_3.m by the linear regression model]]<br />
<br />
===Logistic Regression===<br />
<br />
Logistic regression is a more advanced method for classification, and is<br />
more commonly used. <br />
<br />
We can define a function <br /><br />
<math>f_1(x)= P(Y=1| X=x) = (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})</math><br />
[[File:Picture1.png|150px|thumb|right|<math>P(Y=1 | X=x)</math>]]<br />
<br />
<br />
This is a valid density function. It looks similar to a step function, but<br />
we have relaxed it so that we have a smooth curve, and can therefore take the<br />
derivative.<br />
<br />
The range of this function is (0,1) since<br /> <br />
<math>\lim_{x \to -\infty}f_1(\mathbf{x}) = 0</math> and<br />
<math>\lim_{x \to \infty}f_1(\mathbf{x}) = 1</math>.<br />
<br />
As shown on this graph:<br /><br />
http://www.wolframalpha.com/input/?i=Plot[E^x/%281+%2B+E^x%29,+{x,+-10,+10}]%29<br />
<br />
Then we compute the complement of f1(x), and get<br /><br />
<br />
<math>f_2(x)= P(Y=0| X=x) = 1-f_1(x) = (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})</math>, denoted f2. <br />
[[File:Picture2.png |150px|thumb|right|<math>P(Y=0 | X=x)</math>]]<br />
<br />
<br />
Function f2 is commonlly called Logistic function, and it behaves like <br /><br />
<math>\lim_{x \to -\infty}f_2(\mathbf{x}) = 1</math> and<br /><br />
<math>\lim_{x \to \infty}f_2(\mathbf{x}) = 0</math>.<br />
<br />
As shown on this graph:<br /><br />
http://www.wolframalpha.com/input/?i=Plot[1/%281+%2B+E^x%29,+{x,+-10,+10}]%29<br />
<br />
From here, we can form the conditional density function. To do this, we must combine<br /><br />
<math>f_1</math> and <math>f_2</math> <br /><br />
such that <br /><br />
<math>f_1=1</math> and<math>f_2=0</math> if y=1 ( which means it’s in class 1), <br /><br />
and <math>f_1=0</math> and <math>f_2=1</math> if y=2 (which means it’s in class 2).<br />
<br />
Eventually, we have our conditional density function formula<br /><br />
<math>f(y|\mathbf{x})= (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y}</math><br />
<br />
To way to use this formula is, with given the training data (x(i), y(i)),to fit the data with <math>f(Y : X)</math>.<br />
<br />
In general, we can think of the problem as having a box with some knobs. Inside the box is our objective function which gives the form to classify our input (xi) to<br />
our output (yi). The knobs in the box are functioning like the parameters of the objective function. Our job is to find the proper parameters that can minimize the error between our output and the true value. So we have turned our machine learning problem intoan optimization problem. <br />
<br />
Since we need to find the parameters that maximize the chance of having our observed data coming from the distribution of f(x|parameter), we need to introduce Maximum Likelihood Estimation.<br />
<br />
====Maximum Likelihood Estimation====<br />
<br />
Given iid data points <math>({\mathbf{x}_i})_{i=1}^n</math> and density function <math>f(\mathbf{x}|\mathbf{\theta})</math>, where the form of f is known but the parameters <math>\theta</math> are our unknown. The maximum likelihood estimation of <math>\theta\,_{ML}</math> is a set of parameters that maximize the probability of observing <math>({\mathbf{x}_i})_{i=1}^n</math> given <math>\theta\,_{ML}</math>.<br />
<br />
<math>\theta_\mathrm{ML} = \underset{\theta}{\operatorname{arg\,max}}\ f(\mathbf{x}|\theta)</math>.<br />
<br />
There was some discussion in class regarding the notation. In literature, Bayesians use <math>f(\mathbf{x}|\mu)</math> while Frequentists use <math>f(\mathbf{x};\mu)</math>. In practice, these two are equivalent.<br />
<br />
Our goal is to find theta to maximize <br />
<math>\mathcal{L}(\theta\,) = f({\mathbf{x}_i})_{i=1}^n|\;\theta) = \prod_{i=1}^n f(\mathbf{x_i}|\theta)</math>. (The second equality holds because data points are iid.)<br />
<br />
In many cases, it’s more convenient to work with the natural logarithm of the likelihood. (Recall that the logarithm preserves minumums and maximums.)<br />
<math>\ell(\theta|x\mathbf)=\ln\mathcal{L}(\theta\,)</math> <br />
<br />
<math>\ell(\theta\,)=\sum_{i=1}^n \ln f(\mathbf{x_i}|\theta)</math><br />
<br />
Applying Maximum Likelihood Estimation to <math>f(y|\mathbf{x})= (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y}</math>, gives<br />
<br />
<math>\mathcal{L}(\mathbf{\beta\,})=\prod_{i=1}^n (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y_i} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y_i}</math><br />
<br />
<math>\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) + (1-y_i) (\ln{1} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}))\right) \\[10pt]&{} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) - (1-y_i) \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \\[10pt] &{} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}) + y_i \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \\[10pt] &{} = \sum_{i=1}^n \left(y_i {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}</math><br />
<br />
<math>\begin{align} {\frac{\partial \ell}{\partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - \frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}} \mathbf{x_i} \right) \\[8pt] & {}= \sum_{i=1}^n \left(y_i \mathbf{x_i} - P(\mathbf{x_i} | \mathbf{\beta\,}) \mathbf{x_i}\right) \end{align}</math><br />
<br />
<math>\frac{\partial \ell}{\partial \mathbf{\beta\,}}</math> can be numerically solved by Newton’s Method.<br />
<br />
====Newton's Method====<br />
<br />
Newton's Method (or Newton-Ralphson method) is a numerical method to find better approximations to the solutions of real-valued function. The function usually does not have an analytical form. <br />
<br />
The goal is to find <math>\mathbf{x}</math> such that <math><br />
f(\mathbf{x})<br />
= 0 </math>. The recursion can be implemented by<br />
<math>\mathbf{x_1} = \mathbf{x_0} - \frac{f(\mathbf{x_0})}{f'(\mathbf{x_0})}.\,\!<br />
</math>.<br />
<br />
It takes an initial guess <math>\mathbf{x_0}</math> and the direction "<math>\mathbf{f(x_{0}) / f' (x_{0})}</math>" that moves toward a better approximation. It then finds a newer and better <math>\mathbf{x_1}</math>. Taking this <math>\mathbf{x_1}</math> as <math>\mathbf{x_0}</math> in the second run, it finds a newer and better <math>\mathbf{x_1}</math> than the previous <math>\mathbf{x_1}</math>. Repeating the same process, the <math>\mathbf{x_1}</math> will be sufficiently accurate to the actual solutions.<br />
<br />
<br />
<br />
===Advantages of Logistic Regression===<br />
<br />
Logistic regression has several advantages over discriminant analysis: <br />
<br />
* it is more robust: the independent variables don't have to be normally distributed, or have equal variance in each group <br />
* It does not assume a linear relationship between the IV and DV <br />
* It may handle nonlinear effects <br />
* You can add explicit interaction and power terms <br />
* The DV need not be normally distributed. <br />
* There is no homogeneity of variance assumption. <br />
* Normally distributed error terms are not assumed. <br />
* It does not require that the independents be interval. <br />
* It does not require that the independents be unbounded.<br />
<br />
==Newton-Raphson Method (Lecture: Oct 11, 2011)==<br />
Previously we had derivated the log likelihood function for the logistic function. <br />
<br />
<math>\begin{align} L(\beta\,) = \prod_{i=1}^n \left( (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y_i}(\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y_i} \right) \end{align}</math><br />
<br />
After taking log, we can have<br />
<br />
<math>\begin{align} \ell(\beta\,) = \sum_{i=1}^n \left( y_i \log{\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}}} + (1 - y_i) \log{\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}}} \right) \end{align}</math><br />
<br />
which implies that<br />
<br />
<math>\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}</math><br />
<br />
Our goal is to find the <math>\beta\,</math> that maximizes <math>{\ell(\mathbf{\beta\,})}</math>. We use calculus to do this ie solve <math>{\frac{\partial \ell}{\partial \mathbf{\beta\,}}}=0</math>. To do this we use the famous numerical method of Newton-Raphson. This is an iterative method were we calculate the first & second derivative at each iteration.<br />
<br />
The first derivative is typically called the score vector.<br />
<br />
<math>\begin{align} S(\beta\,) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - \frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}} \mathbf{x_i} \right) \\[8pt] \end{align}</math><br />
<br />
The negative of the second derivative is typically called the information matrix.<br />
<br />
<math>\begin{align} I(\beta\,) {}= -{\frac{\partial \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \sum_{i=1}^n \left(\mathbf{x_i}\mathbf{x_i}^T (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})(\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}}) \right) \\[8pt] \end{align}</math><br />
<br />
We then use the following update formula to calcalute continually better estimates of the optimal <math>\beta\,</math>. It is not typically important what you use as your initial estimate <math>\beta\,^{(1)}</math> is.<br />
<br />
<math> \beta\,^{(r+1)} {}= \beta\,^{(r)} + I^{-1}(\beta\,^{(r)} )S(\beta\,^{(r)} )</math><br />
<br />
====Matrix Notation====<br />
<br />
let <math>\mathbf{y}</math> be a (n x 1) vector of all class labels. This is called the response in other contexts.<br />
<br />
let <math>\mathbb{X}</math> be a (n x (d+1)) matrix of all your features. Each row represents a data point. Each column represents a feature/covariate.<br />
<br />
let <math>\mathbf{p}^{(r)}</math> be a (n x 1) vector with values <math> P(\mathbf{x_i} |\beta\,^{(r)} ) </math><br />
<br />
let <math>\mathbb{W}^{(r)}</math> be a (n x n) diagonal matrix with <math>\mathbb{W}_{ii}^{(r)} {}= P(\mathbf{x_i} |\beta\,^{(r)} )(1 - P(\mathbf{x_i} |\beta\,^{(r)} ))</math><br />
<br />
we can rewrite our score vector, information matrix & update equation in terms of this new matrix notation, so the first derivitive is<br />
<br />
<math>\begin{align} S(\beta\,^{(r)}) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)})\end{align}</math><br />
<br />
and the second derivitive is<br />
<br />
<math>\begin{align} I(\beta\,^{(r)}) {}= -{\frac{\partial \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X} \end{align}</math><br />
<br />
Therfore, we can fit a regression problem as follows<br />
<br />
<math> \beta\,^{(r+1)} {}= \beta\,^{(r)} + I^{-1}(\beta\,^{(r)} )S(\beta\,^{(r)} ) {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}\mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)})</math><br />
<br />
====Iteratively Re-weighted Least Squares====<br />
If we reorganize this updating formula we can see it is really a iteratively solving a least squares problem each time with a new weighting.<br />
<br />
<math>\beta\,^{(r+1)} {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}(\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X}\beta\,^{(r)} + \mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)}))</math><br />
<br />
<math>\beta\,^{(r+1)} {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}\mathbb{X}^T\mathbb{W}^{(r)}\mathbf(z)^{(r)}</math><br />
<br />
where <math> \mathbf{z}^{(r)} = \mathbb{X}\beta\,^{(r)} + (\mathbb{W}^{(r)})^{-1}(\mathbf{y}-\mathbf{p}^{(r)}) </math><br />
<br />
<br />
<br />
Recall that linear regression by least squares finds the following minimum: <math>\min_{\underline{\beta}}(\underline{y}-X^T \underline{\beta})^T(\underline{y}-X^T \underline{\beta})</math><br />
<br />
we have <math>\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}</math><br />
<br />
Similarly, we can say that <math>\underline{\beta}^{new}</math> is the solution of a weighted least square problem:<br />
<br />
<math>\underline{\beta}^{(r+1)} \leftarrow arg \min_{\underline{\beta}}(Z-X \underline{\beta})^T W (Z-X \underline{\beta})</math><br />
<br />
====Fisher Scoring Method==== <br />
<br />
Fisher Scoring is a method very similiar to Newton-Raphson. It uses the expected Information Matrix as opposed to the observed information matrix. This distinction simplifies the problem and in perticular the computational complexity. To learn more about this method & logistic regression in general you can take Stat431/831 at the University of Waterloo.<br />
<br />
===Multi-class Logistic Regression===<br />
<br />
In a multi-class logistic regression we have K classes. For 2 classes ''k'' and ''l''<br />
<br />
<math>\frac{P(Y=l|X=x)}{P(Y=k|X=x)} = e^{\beta_l^T x}</math><br />
<br />
We call <math>log(\frac{P(Y=l|X=x)}{P(Y=k|X=x)}) = \beta_l^T x</math> as the logit transformation. The decision boundary between the 2 classes is the set of points where the logit transformation is 0.<br />
<br />
For each class from 1 to K-1 we then have:<br />
<br />
<math>log(\frac{P(Y=1|X=x)}{P(Y=K|X=x)}) = e^{\beta_1^T x}</math><br />
<br />
<math>log(\frac{P(Y=2|X=x)}{P(Y=K|X=x)}) = e^{\beta_2^T x}</math><br />
<br />
<math>log(\frac{P(Y=K-1|X=x)}{P(Y=K|X=x)}) = e^{\beta_{K-1}^T x}</math><br />
<br />
Note that choosing ''Y=K'' is arbitrary and any other choice is equally valid.<br />
<br />
Based on the above the posterior probabilities are given by: <math>P(Y=k|X=x) = \frac{e^{\beta_k^T x}}{1 + \sum_{i=1}^{K-1}{e^{\beta_i^T x}}}</math><br />
<br />
===Sample Size Requirements===<br />
<br />
The number of adjustable components in linear discriminant analysis is <math>\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2</math> where d is the number of parameters. Similarly, the number of adjustable components in logistic regression is <math>\, d+1</math>. The number of components also corresponds to the minimum number of observations needed to compute the coefficients of each function. Techniques do exist though for handling high dimensional problems where the number of parameters exceeds the number of observations.<br />
<br />
Linear discriminant analysis involves the inversion of a d x d covariance matrix. When d is bigger than n, the number of observations, this matrix is large and is rank n < d and thus singular. When this is the case, we can either use the pseudo inverse or perform regularized discriminant analysis which solves this problem. In RDA, we define a new covariance matrix <math>\, \Sigma(\gamma) = \gamma\Sigma + (1 - \gamma)diag(\Sigma) with \gamma \in \[0,1\]</math>. Cross validation can be used to calculate the best <math>\, \gamma</math>.<br />
<br />
<br />
===Comparison Between Logistic Regression And Linear Discriminant Analysis (LDA)===<br />
<br />
Logistic Regression Model and Linear Discriminant Analysis are widely used to analyze data which has categorical outcome variables. Both of the models are to build linear boundaries to classify different groups. Also, the categorical outcome variables (i.e. the dependent variables) must be mutually exclusive. <br />
<br />
However, these two models differ in their basic idea. While Logistic Regression is more relaxed and flexible in its assumptions, Linear Discriminant Analysis has the requirement that its explanatory variables must be normally distributed, linearly related and have equal covariance matrix within each class. Therefore, it can be expected that linear Discriminant Analysis should be more appropriate if the normality assumptions and equal covariance assumption are fulfilled in its explanatory variables. But in all other situations Logistic Regression should be appropriate. Besides, the total number of estimates to compute between these models is different. If the explanatory variables have d dimensions, we need to estimate <math>d+1</math> parameters in Logistic Regression and the number of parameters grows linearly w.r.t. dimension, while we need to estimate <math>2d+\frac{d*(d+1)}{2}+2</math> parameters in Linear Discriminant Analysis and the number of parameters grows quadratically w.r.t. dimension. <br />
<br />
== Perceptron (Lecture: Oct. 11, 2011) ==<br />
<br />
[[Image:Perceptron1.png|right|thumb|300px|Simple perceptron]]<br />
[[Image:Perceptron2.png|right|thumb|300px|Simple perceptron where <math>\beta_0</math> is defined as 1]]<br />
<br />
<br />
The perceptron is the building block for neural networks. It was invented by Rosenblatt in 1957 at Cornell Labs, and first mentioned in the paper "The Perceptron - a perceiving and recognizing automaton". The perceptron is used on linearly separable data sets.<br />
<br />
For a 2 class problem, and a set of inputs with ''d'' features, a perceptron will use a weighted sum and it will classify the information using the sign of the result. The figures on the right give an example of a perceptron. In these examples, <math>x^i</math> is the ''i''-th feature of a sample and <math>\beta_i</math> is the ''i''-th weight. <math>\beta_0</math> is defined as the bias. The bias alters the position of the decision boundary between the 2 classes.<br />
<br />
Perceptrons are generally trained using [http://en.wikipedia.org/wiki/Gradient_descent gradient descent]. This type of learning can have 2 side effects:<br />
* If the data sets are well separated, the training of the perceptron can lead to multiple valid solutions,<br />
* If the data sets are not linearly separable, the learning algorithm will never finish.<br />
<br />
Perceptrons are the simplest kind of a feedforward neural network. A perceptron is the building block for other neural networks such as:<br />
* Multi-layer perceptron<br />
* ADALINE<br />
* MADALINE<br />
<br />
==References==<br />
<references /><br />
<br />
24. Balakrishnama, S., Ganapathiraju, A. LINEAR DISCRIMINANT ANALYSIS - A BRIEF TUTORIAL. http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf [[http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf]]</div>S9huhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f11Stat841EditorSignUp&diff=11602f11Stat841EditorSignUp2011-09-23T18:42:19Z<p>S9hu: </p>
<hr />
<div>{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="100pt"|Date<br />
|width="200pt"|Name (1)<br />
|width="200pt"|Name (2)<br />
|-<br />
|Sep 20 ||Greg Pitt || <br />
|-<br />
|Sep 22 ||Gobaan Raveendran || <br />
|-<br />
|Sep 27 ||Mikhail Targonski || <br />
|-<br />
|Sep 29 ||Guoting (Jane) Chang || Mohamed El Massad<br />
|-<br />
|Oct 4 || Cameron Davidson-Pilon ||<br />
|-<br />
|Oct 6 ||Johnny Chow || Jennifer Smith<br />
|-<br />
|Oct 11 ||Daniel Nicoara || Samson Hu<br />
|-<br />
|Oct 13 ||Zhikang Huang ||<br />
|-<br />
|Oct 18 || Mahmoud Faraj ||<br />
|-<br />
|Oct 20 || Chunwei Lai ||<br />
|-<br />
|Oct 25 || Jeff Glaister || Steven Leigh<br />
|-<br />
|Oct 27 ||Nika Haghtalab || <br />
|-<br />
|Nov 3 || Robert Amelard || <br />
|-<br />
|Nov 10|| ||<br />
|-</div>S9huhttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=signupformStat341F11&diff=11601signupformStat341F112011-09-23T18:41:44Z<p>S9hu: </p>
<hr />
<div>{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="100pt"|Date<br />
|width="200pt"|Name (1)<br />
|width="200pt"|Name (2)<br />
|-<br />
|Sep 20 || acodd || <br />
|-<br />
|Sep 22 ||Samantha Rahman || <br />
|-<br />
|Sep 27 || Pu Zhao || <br />
|-<br />
|Sep 29 ||Adam Prins || <br />
|-<br />
|Oct 4 || Zhou Xiaojie || <br />
|-<br />
|Oct 6 || Joel Smith || <br />
|-<br />
|Oct 11 ||Choi Chek Hin || <br />
|-<br />
|Oct 13 ||Matthew Tacchino || <br />
|-<br />
|Oct 18 ||Yin Jie Xu || <br />
|-<br />
|Oct 20 ||Li Fangzhou|| <br />
|-<br />
|Oct 25 || Valentin Cardinale || <br />
|-<br />
|Oct 27 || George Li || <br />
|-<br />
|Nov 1 ||Marie-Sarah Lacharité || <br />
|-<br />
|Nov 3 || Samson Hu || <br />
|-<br />
|Nov 8 ||Patrick Dornian || <br />
|-<br />
|Nov 10 || || <br />
|-<br />
|Nov 15 || Lindsay Millard || <br />
|-<br />
|Nov 17 || || <br />
|-<br />
|Nov 22 || || <br />
|-<br />
|Nov 24 || || <br />
|-<br />
|Nov 29 ||Han Li || <br />
|-<br />
|Dec 1 || || <br />
|-</div>S9hu