Difference between revisions of "stat340s13"

From statwiki
Jump to: navigation, search
(Gibbs Sampling)
m (Conversion script moved page Stat340s13 to stat340s13: Converting page titles to lowercase)
 
(311 intermediate revisions by 63 users not shown)
Line 42: Line 42:
  
 
=== Final ===
 
=== Final ===
Saturday August 10,2013 from7:30pm-10:00pm
+
Saturday August 10,2013 from 7:30pm-10:00pm
  
 
=== TA(s):  ===
 
=== TA(s):  ===
Line 186: Line 186:
 
if y = ax + b, then <math>b:=y \mod a</math>. <br />
 
if y = ax + b, then <math>b:=y \mod a</math>. <br />
  
'''For example:'''<br />
+
'''Example 1:'''<br />
  
 
<math>30 = 4 \cdot  7 + 2</math><br />
 
<math>30 = 4 \cdot  7 + 2</math><br />
Line 201: Line 201:
  
 
<br />
 
<br />
'''Another example:'''<br />
+
'''Example 2:'''<br />
  
 
If <math>23 = 3 \cdot  6 + 5</math> <br />
 
If <math>23 = 3 \cdot  6 + 5</math> <br />
Line 214: Line 214:
  
 
Then equivalently, <math>3 := -37\mod 40</math><br />
 
Then equivalently, <math>3 := -37\mod 40</math><br />
 +
 +
'''Example 3:'''<br />
 +
<math>77 = 3 \cdot  25 + 2</math><br />
 +
 +
<math>2 := 77\mod 3</math><br />
 +
<br />
 +
<math>25 = 25 \cdot  1 + 0</math><br />
 +
 +
<math>0: = 25\mod 25</math><br />
 +
<br />
 +
 +
  
  
Line 221: Line 233:
  
 
==== Mixed Congruential Algorithm ====
 
==== Mixed Congruential Algorithm ====
We define the Linear Congruential Method to be <math>x_{k+1}=(ax_k + b) \mod m</math>, where <math>x_k, a, b, m \in \N, \;\text{with}\; a, m \neq 0</math>. Given a '''seed''' (i.e. an initial value <math>x_0 \in \N</math>), we can obtain values for <math>x_1, \, x_2, \, \cdots, x_n</math> inductively. The Multiplicative Congruential Method, invented by Berkeley professor D. H. Lehmer, may also refer to the special case where <math>b=0</math> and the Mixed Congruential Method is case where <math>b \neq 0</math> <br />
+
We define the Linear Congruential Method to be <math>x_{k+1}=(ax_k + b) \mod m</math>, where <math>x_k, a, b, m \in \N, \;\text{with}\; a, m \neq 0</math>. Given a '''seed''' (i.e. an initial value <math>x_0 \in \N</math>), we can obtain values for <math>x_1, \, x_2, \, \cdots, x_n</math> inductively. The Multiplicative Congruential Method, invented by Berkeley professor D. H. Lehmer, may also refer to the special case where <math>b=0</math> and the Mixed Congruential Method is case where <math>b \neq 0</math> <br />. Their title as "mixed" arises from the fact that it has both a multiplicative and additive term.
  
 
An interesting fact about '''Linear Congruential Method''' is that it is one of the oldest and best-known pseudo random number generator algorithms. It is very fast and requires minimal memory to retain state. However, this method should not be used for applications that require high randomness. They should not be used for Monte Carlo simulation and cryptographic applications. (Monte Carlo simulation will consider possibilities for every choice of consideration, and it shows the extreme possibilities. This method is not precise enough.)<br />
 
An interesting fact about '''Linear Congruential Method''' is that it is one of the oldest and best-known pseudo random number generator algorithms. It is very fast and requires minimal memory to retain state. However, this method should not be used for applications that require high randomness. They should not be used for Monte Carlo simulation and cryptographic applications. (Monte Carlo simulation will consider possibilities for every choice of consideration, and it shows the extreme possibilities. This method is not precise enough.)<br />
Line 228: Line 240:
  
 
'''First consider the following algorithm'''<br />
 
'''First consider the following algorithm'''<br />
<math>x_{k+1}=x_{k} \mod m</math>
+
<math>x_{k+1}=x_{k} \mod m</math> <br />
 +
 
 +
such that: if <math>x_{0}=5(mod 150)</math>, <math>x_{n}=3x_{n-1}</math>, find <math>x_{1},x_{8},x_{9}</math>. <br />
 +
<math>x_{n}=(3^n)*5(mod 150)</math> <br />
 +
<math>x_{1}=45,x_{8}=105,x_{9}=15</math> <br />
 +
 
  
  
Line 294: Line 311:
 
2. close all: closes all figures.<br />
 
2. close all: closes all figures.<br />
 
3. who: displays all defined variables.<br />
 
3. who: displays all defined variables.<br />
4. clc: clears screen.<br /><br />
+
4. clc: clears screen.<br />
5. ; : prevents the results from printing.<br /><br />
+
5. ; : prevents the results from printing.<br />
 +
6. disstool: displays a graphing tool.<br /><br />
  
 
<pre style="font-size:16px">
 
<pre style="font-size:16px">
Line 378: Line 396:
  
 
'''Comments:'''<br />
 
'''Comments:'''<br />
 +
 +
Matlab code:
 +
a=5;
 +
b=7;
 +
m=200;
 +
x(1)=3;
 +
for ii=2:1000
 +
x(ii)=mod(a*x(ii-1)+b,m);
 +
end
 +
size(x);
 +
hist(x)
 +
 +
 +
 
Typically, it is good to choose <math>m</math> such that <math>m</math> is large, and <math>m</math> is prime. Careful selection of parameters '<math>a</math>' and '<math>b</math>' also helps generate relatively "random" output values, where it is harder to identify patterns. For example, when we used a composite (non prime) number such as 40 for <math>m</math>, our results were not satisfactory in producing an output resembling a uniform distribution.<br />
 
Typically, it is good to choose <math>m</math> such that <math>m</math> is large, and <math>m</math> is prime. Careful selection of parameters '<math>a</math>' and '<math>b</math>' also helps generate relatively "random" output values, where it is harder to identify patterns. For example, when we used a composite (non prime) number such as 40 for <math>m</math>, our results were not satisfactory in producing an output resembling a uniform distribution.<br />
  
Line 429: Line 461:
 
</pre>
 
</pre>
 
</div>
 
</div>
 +
Another algorithm for generating pseudo random numbers is the multiply with carry method. Its simplest form is similar to the linear congruential generator. They differs in that the parameter b changes in the MWC algorithm. It is as follows: <br>
 +
 +
1.) x<sub>k+1</sub> = ax<sub>k</sub> + b<sub>k</sub> mod m <br>
 +
2.) b<sub>k+1</sub> = floor((ax<sub>k</sub> + b<sub>k</sub>)/m) <br>
 +
3.) set k to k + 1 and go to step 1
 +
[http://www.javamex.com/tutorials/random_numbers/multiply_with_carry.shtml Source]
  
 
=== Inverse Transform Method ===
 
=== Inverse Transform Method ===
Line 442: Line 480:
 
'''Proof of the theorem:'''<br />
 
'''Proof of the theorem:'''<br />
 
The generalized inverse satisfies the following: <br />
 
The generalized inverse satisfies the following: <br />
<math>\begin{align}
+
 
\forall u \in \left[0,1\right], \, x \in \R, \\
+
:<math>P(X\leq x)</math> <br />
&{} F^{-1}\left(u\right) \leq x &{} \\
+
<math>= P(F^{-1}(U)\leq x)</math> (since <math>X= F^{-1}(U)</math> by the inverse method)<br />
\Rightarrow &{} F\Big(F^{-1}\left(u\right)\Big) \leq F\left(x\right) &&{} F \text{ is non-decreasing} \\
+
<math>= P((F(F^{-1}(U))\leq F(x))</math>  (since <math>F </math> is monotonically increasing) <br />
\Rightarrow &{} F\Big(\inf \{y \in \R | F(y)\geq u \}\Big) \leq F\left(x\right) &&{} \text{by definition of } F^{-1} \\
+
<math>= P(U\leq F(x)) </math> (since <math> P(U\leq a)= a</math> for <math>U \sim U(0,1), a \in [0,1]</math>,<br />
\Rightarrow &{} \inf \{F(y) \in [0,1] | F(y)\geq u \} \leq F\left(x\right) &&{} F \text{ is right continuous and non-decreasing} \\
+
<math>= F(x) , \text{ where } 0 \leq F(x) \leq 1 </math>  <br />
\Rightarrow &{} u \leq F\left(x\right) &&{} \text{by definition of } \inf \\
+
 
\Rightarrow &{} x \in \{y \in \R | F(y) \geq u\} &&{} \\
+
This is the c.d.f. of X.  <br />
\Rightarrow &{} x \geq \inf \{y \in \R | F(y)\geq u \}\Big) &&{} \text{by definition of } \inf \\
+
<br />
\Rightarrow &{} x \geq F^{-1}(u) &&{} \text{by definition of } F^{-1} \\
 
\end{align}</math>
 
  
 
That is <math>F^{-1}\left(u\right) \leq x \Leftrightarrow u \leq F\left(x\right)</math><br />
 
That is <math>F^{-1}\left(u\right) \leq x \Leftrightarrow u \leq F\left(x\right)</math><br />
Line 495: Line 531:
 
<pre style="font-size:16px">
 
<pre style="font-size:16px">
 
>>u=rand(1,1000);
 
>>u=rand(1,1000);
>>hist(u)      #will generate a fairly uniform diagram
+
>>hist(u)      # this will generate a fairly uniform diagram
 
</pre>
 
</pre>
 
[[File:ITM_example_hist(u).jpg|300px]]
 
[[File:ITM_example_hist(u).jpg|300px]]
Line 531: Line 567:
 
Sol:  
 
Sol:  
 
Let <math>y=x^5</math>, solve for x: <math>x=y^\frac {1}{5}</math>. Therefore, <math>F^{-1} (x) = x^\frac {1}{5}</math><br />
 
Let <math>y=x^5</math>, solve for x: <math>x=y^\frac {1}{5}</math>. Therefore, <math>F^{-1} (x) = x^\frac {1}{5}</math><br />
Hence, to obtain a value of x from F(x), we first set u as an uniform distribution, then obtain the inverse function of F(x), and set
+
Hence, to obtain a value of x from F(x), we first set 'u' as an uniform distribution, then obtain the inverse function of F(x), and set
 
<math>x= u^\frac{1}{5}</math><br /><br />
 
<math>x= u^\frac{1}{5}</math><br /><br />
  
Line 593: Line 629:
 
== Class 3 - Tuesday, May 14 ==
 
== Class 3 - Tuesday, May 14 ==
 
=== Recall the Inverse Transform Method ===
 
=== Recall the Inverse Transform Method ===
 
+
Let U~Unif(0,1),then the random variable  X = F<sup>-1</sup>(u) has distribution F.  <br />
 
To sample X with CDF F(x), <br />
 
To sample X with CDF F(x), <br />
  
'''1) Draw u~U(0,1) '''<br />
+
<math>1) U~ \sim~ Unif [0,1] </math>
 
'''2) X = F<sup>-1</sup>(u)  '''<br />
 
'''2) X = F<sup>-1</sup>(u)  '''<br />
  
  
'''Proof''' <br />
 
First note that
 
<math>P(U\leq a)=a, \forall a\in[0,1]</math> <br />
 
  
:<math>P(X\leq x)</math> <br />
 
<math>= P(F^{-1}(U)\leq x)</math> (since <math>X= F^{-1}(U)</math> by the inverse method)<br />
 
<math>= P((F(F^{-1}(U))\leq F(x))</math>  (since <math>F </math> is monotonically increasing) <br />
 
<math>= P(U\leq F(x)) </math> (since <math> P(U\leq a)= a</math> for <math>U \sim U(0,1), a \in [0,1]</math>, this is explained further below)<br />
 
<math>= F(x) , \text{ where } 0 \leq F(x) \leq 1 </math>  <br />
 
  
This is the c.d.f. of X.  <br />
+
 
 
<br />
 
<br />
  
Line 662: Line 690:
  
 
Note that after generating a random U, the value of X can be determined by finding the interval <math>[F(x_{j-1}),F(x_{j})]</math> in which U lies. <br />
 
Note that after generating a random U, the value of X can be determined by finding the interval <math>[F(x_{j-1}),F(x_{j})]</math> in which U lies. <br />
 +
 +
In summary:
 +
Generate a discrete r.v.x that has pmf:<br />
 +
  P(X=xi)=Pi,    x0<x1<x2<... <br />
 +
1. Draw U~U(0,1);<br />
 +
2. If F(x(i-1))<U<F(xi), x=xi.<br />
  
  
Line 907: Line 941:
  
 
'''Problems'''<br />
 
'''Problems'''<br />
1. We have to find <math> F^{-1} </math>
+
Though this method is very easy to use and apply, it does have a major disadvantage/limitation:
 
+
We need to find the inverse cdf  F^{-1}(\cdot) . In some cases the inverse function does not exist, or is difficult to find because it requires a closed form expression for F(x).
2. For many distributions, such as Gaussian, it is too difficult to find the inverse of <math> F(x)</math>.<br>
+
For example, it is too difficult to find the inverse cdf of the Gaussian distribution, so we must find another method to sample from the Gaussian distribution.
 +
In conclusion, we need to find another way of sampling from more complicated distributions
 
Flipping a coin is a discrete case of uniform distribution, and the code below shows an example of flipping a coin 1000 times; the result is close to the expected value 0.5.<br>
 
Flipping a coin is a discrete case of uniform distribution, and the code below shows an example of flipping a coin 1000 times; the result is close to the expected value 0.5.<br>
 
Example 2, as another discrete distribution, shows that we can sample from parts like 0,1 and 2, and the probability of each part or each trial is the same.<br>
 
Example 2, as another discrete distribution, shows that we can sample from parts like 0,1 and 2, and the probability of each part or each trial is the same.<br>
Line 962: Line 997:
 
3. Mixed continues discrete
 
3. Mixed continues discrete
  
'''Problems with Inverse-Transform Approach'''
 
 
1. must invert CDF, which may be different (numerical methods).
 
 
2. May not be the fastest or simplest approach for a given distribution.
 
  
 
'''Advantages of Inverse-Transform Method'''
 
'''Advantages of Inverse-Transform Method'''
Line 1,188: Line 1,218:
  
 
== Class 4 - Thursday, May 16 ==  
 
== Class 4 - Thursday, May 16 ==  
*When we want to find target distribution, denoted as <math>f(x)</math>, we need to first find a proposal distribution <math>g(x)</math>  that is easy to sample from. <br>  
+
 
 +
'''Goals'''<br>
 +
*When we want to find target distribution <math>f(x)</math>, we need to first find a proposal distribution <math>g(x)</math>  that is easy to sample from. <br>  
 
*Relationship between the proposal distribution and target distribution is: <math> c \cdot g(x) \geq f(x) </math>, where c is constant. This means that the area of f(x) is under the area of <math> c \cdot g(x)</math>. <br>
 
*Relationship between the proposal distribution and target distribution is: <math> c \cdot g(x) \geq f(x) </math>, where c is constant. This means that the area of f(x) is under the area of <math> c \cdot g(x)</math>. <br>
 
*Chance of acceptance is less if the distance between <math>f(x)</math> and <math> c \cdot g(x)</math> is big, and vice-versa, we use <math> c </math> to keep <math> \frac {f(x)}{c \cdot g(x)} </math> below 1 (so <math>f(x) \leq c \cdot g(x)</math>). Therefore, we must find the constant <math> C </math> to achieve this.<br />
 
*Chance of acceptance is less if the distance between <math>f(x)</math> and <math> c \cdot g(x)</math> is big, and vice-versa, we use <math> c </math> to keep <math> \frac {f(x)}{c \cdot g(x)} </math> below 1 (so <math>f(x) \leq c \cdot g(x)</math>). Therefore, we must find the constant <math> C </math> to achieve this.<br />
 
*In other words, <math>C</math> is chosen to make sure  <math> c \cdot g(x) \geq f(x) </math>. However, it will not make sense if <math>C</math> is simply chosen to be arbitrarily large. We need to choose <math>C</math> such that <math>c \cdot g(x)</math> fits <math>f(x)</math> as tightly as possible. This means that we must find the minimum c such that the area of f(x) is under the area of c*g(x). <br />
 
*In other words, <math>C</math> is chosen to make sure  <math> c \cdot g(x) \geq f(x) </math>. However, it will not make sense if <math>C</math> is simply chosen to be arbitrarily large. We need to choose <math>C</math> such that <math>c \cdot g(x)</math> fits <math>f(x)</math> as tightly as possible. This means that we must find the minimum c such that the area of f(x) is under the area of c*g(x). <br />
 
*The constant c cannot be a negative number.<br />
 
*The constant c cannot be a negative number.<br />
 +
  
 
'''How to find C''':<br />
 
'''How to find C''':<br />
 +
 
<math>\begin{align}
 
<math>\begin{align}
 
&c \cdot g(x) \geq f(x)\\
 
&c \cdot g(x) \geq f(x)\\
Line 1,200: Line 1,234:
 
&c= \max \left(\frac{f(x)}{g(x)}\right)  
 
&c= \max \left(\frac{f(x)}{g(x)}\right)  
 
\end{align}</math><br>
 
\end{align}</math><br>
 +
 
If <math>f</math> and <math> g </math> are continuous, we can find the extremum by taking the derivative and solve for <math>x_0</math> such that:<br/>
 
If <math>f</math> and <math> g </math> are continuous, we can find the extremum by taking the derivative and solve for <math>x_0</math> such that:<br/>
 
<math> 0=\frac{d}{dx}\frac{f(x)}{g(x)}|_{x=x_0}</math> <br/>
 
<math> 0=\frac{d}{dx}\frac{f(x)}{g(x)}|_{x=x_0}</math> <br/>
 +
 
Thus <math> c = \frac{f(x_0)}{g(x_0)} </math><br/>
 
Thus <math> c = \frac{f(x_0)}{g(x_0)} </math><br/>
  
*The logic behind this:
+
Note: This procedure is called the Acceptance-Rejection Method.<br>
The Acceptance-Rejection method involves finding a distribution that we know how to sample from, g(x), and multiplying g(x) by a constant c so that <math>c \cdot g(x)</math> is always greater than or equal to f(x). Mathematically, we want <math> c \cdot g(x) \geq f(x) </math>.
+
 
And it means, c has to be greater or equal to <math>\frac{f(x)}{g(x)}</math>. So the smallest possible c that satisfies the condition is the maximum value of <math>\frac{f(x)}{g(x)}</math><br/>. If c is too large, the chance of acceptance of generated values will be small, thereby losing efficiency of the algorithm. Therefore, it is best to get the smallest possible c such that <math> c g(x) \geq f(x)</math>. <br>
+
'''The Acceptance-Rejection method''' involves finding a distribution that we know how to sample from, g(x), and multiplying g(x) by a constant c so that <math>c \cdot g(x)</math> is always greater than or equal to f(x). Mathematically, we want <math> c \cdot g(x) \geq f(x) </math>.
 +
And it means, c has to be greater or equal to <math>\frac{f(x)}{g(x)}</math>. So the smallest possible c that satisfies the condition is the maximum value of <math>\frac{f(x)}{g(x)}</math><br/>.  
 +
But in case of c being too large, the chance of acceptance of generated values will be small, thereby losing efficiency of the algorithm. Therefore, it is best to get the smallest possible c such that <math> c g(x) \geq f(x)</math>. <br>
 +
 
 +
'''Important points:'''<br>  
  
 
*For this method to be efficient, the constant c must be selected so that the rejection rate is low. (The efficiency for this method is <math>\left ( \frac{1}{c} \right )</math>)<br>
 
*For this method to be efficient, the constant c must be selected so that the rejection rate is low. (The efficiency for this method is <math>\left ( \frac{1}{c} \right )</math>)<br>
 
*It is easy to show that the expected number of trials for an acceptance is  <math> \frac{Total Number of Trials} {C} </math>. <br>
 
*It is easy to show that the expected number of trials for an acceptance is  <math> \frac{Total Number of Trials} {C} </math>. <br>
*recall the acceptance rate is 1/c. (Not rejection rate)  
+
*recall the '''acceptance rate is 1/c'''. (Not rejection rate)  
 
:Let <math>X</math> be the number of trials for an acceptance, <math> X \sim~ Geo(\frac{1}{c})</math><br>
 
:Let <math>X</math> be the number of trials for an acceptance, <math> X \sim~ Geo(\frac{1}{c})</math><br>
 
:<math>\mathbb{E}[X] = \frac{1}{\frac{1}{c}} = c </math>
 
:<math>\mathbb{E}[X] = \frac{1}{\frac{1}{c}} = c </math>
 
*The number of trials needed to generate a sample size of <math>N</math> follows a negative binomial distribution. The expected number of trials needed is then <math>cN</math>.<br>
 
*The number of trials needed to generate a sample size of <math>N</math> follows a negative binomial distribution. The expected number of trials needed is then <math>cN</math>.<br>
 
*So far, the only distribution we know how to sample from is the '''UNIFORM''' distribution. <br>
 
*So far, the only distribution we know how to sample from is the '''UNIFORM''' distribution. <br>
 +
  
 
'''Procedure''': <br>
 
'''Procedure''': <br>
 +
 
1. Choose <math>g(x)</math> (simple density function that we know how to sample, i.e. Uniform so far) <br>
 
1. Choose <math>g(x)</math> (simple density function that we know how to sample, i.e. Uniform so far) <br>
The easiest case is UNIF(0,1). However, in other cases we need to generate UNIF(a,b). We may need to perform a linear transformation on the UNIF(0,1) variable. <br>
+
The easiest case is <math>U~ \sim~ Unif [0,1] </math>. However, in other cases we need to generate UNIF(a,b). We may need to perform a linear transformation on the <math>U~ \sim~ Unif [0,1] </math> variable. <br>
 
2. Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math>, otherwise return to step 1.
 
2. Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math>, otherwise return to step 1.
  
Line 1,226: Line 1,268:
 
#If <math>U \leq \frac{f(Y)}{c \cdot g(Y)}</math> then X=Y; else return to step 1 (This is not the way to find C. This is the general procedure.)
 
#If <math>U \leq \frac{f(Y)}{c \cdot g(Y)}</math> then X=Y; else return to step 1 (This is not the way to find C. This is the general procedure.)
  
<hr><b>Example: Generate a random variable from the pdf</b><br>
+
<hr><b>Example: <br>
 +
 
 +
Generate a random variable from the pdf</b><br>
 
<math> f(x) =  
 
<math> f(x) =  
 
\begin{cases}  
 
\begin{cases}  
Line 1,261: Line 1,305:
 
[[File:Beta(2,1)_example.jpg|750x750px]]
 
[[File:Beta(2,1)_example.jpg|750x750px]]
  
Note: g follows uniform distribution, it only covers half of the graph which runs from 0 to 1 on y-axis. Thus we need to multiply by c to ensure that <math>c\cdot g</math> can cover entire f(x) area. In this case, c=2, so that makes g run from 0 to 2 on y-axis which covers f(x).
+
'''Note:''' g follows uniform distribution, it only covers half of the graph which runs from 0 to 1 on y-axis. Thus we need to multiply by c to ensure that <math>c\cdot g</math> can cover entire f(x) area. In this case, c=2, so that makes g run from 0 to 2 on y-axis which covers f(x).
  
Comment:
+
'''Comment:'''<br>
 
From the picture above, we could observe that the area under f(x)=2x is a half of the area under the pdf of UNIF(0,1). This is why in order to sample 1000 points of f(x), we need to sample approximately 2000 points in UNIF(0,1).
 
From the picture above, we could observe that the area under f(x)=2x is a half of the area under the pdf of UNIF(0,1). This is why in order to sample 1000 points of f(x), we need to sample approximately 2000 points in UNIF(0,1).
 
And in general, if we want to sample n points from a distritubion with pdf f(x), we need to scan approximately <math>n\cdot c</math> points from the proposal distribution (g(x)) in total. <br>
 
And in general, if we want to sample n points from a distritubion with pdf f(x), we need to scan approximately <math>n\cdot c</math> points from the proposal distribution (g(x)) in total. <br>
Line 1,274: Line 1,318:
 
</ol>
 
</ol>
  
Note: In the above example, we sample 2 numbers. If second number (u) is less than or equal to first number (y), then accept x=y, if not then start all over.
+
'''Note:''' In the above example, we sample 2 numbers. If second number (u) is less than or equal to first number (y), then accept x=y, if not then start all over.
  
 
<span style="font-weight:bold;color:green;">Matlab Code</span>
 
<span style="font-weight:bold;color:green;">Matlab Code</span>
Line 1,354: Line 1,398:
 
=====Example of Acceptance-Rejection Method=====
 
=====Example of Acceptance-Rejection Method=====
  
<math> f(x) = 3x^2,  0<x<1 </math>
+
<math>\begin{align}
<math>g(x)=1,  0<x<1</math>
+
& f(x) = 3x^2,  0<x<1 \\
 +
\end{align}</math><br\>
 +
 
 +
<math>\begin{align}
 +
& g(x)=1,  0<x<1 \\
 +
\end{align}</math><br\>
  
 
<math>c = \max \frac{f(x)}{g(x)} = \max \frac{3x^2}{1} = 3 </math><br>
 
<math>c = \max \frac{f(x)}{g(x)} = \max \frac{3x^2}{1} = 3 </math><br>
Line 1,361: Line 1,410:
  
 
1. Generate two uniform numbers in the unit interval <math>U_1, U_2 \sim~ U(0,1)</math><br>
 
1. Generate two uniform numbers in the unit interval <math>U_1, U_2 \sim~ U(0,1)</math><br>
2. If <math>U_2 \leqslant {U_1}^2</math>, accept <math>U_1</math> as the random variable with pdf <math>f</math>, if not return to Step 1
+
2. If <math>U_2 \leqslant {U_1}^2</math>, accept <math>\begin{align}U_1\end{align}</math> as the random variable with pdf <math>\begin{align}f\end{align}</math>, if not return to Step 1
  
We can also use <math>g(x)=2x</math> for a more efficient algorithm
+
We can also use <math>\begin{align}g(x)=2x\end{align}</math> for a more efficient algorithm
  
 
<math>c = \max \frac{f(x)}{g(x)} = \max \frac {3x^2}{2x} = \frac {3x}{2}  </math>.
 
<math>c = \max \frac{f(x)}{g(x)} = \max \frac {3x^2}{2x} = \frac {3x}{2}  </math>.
Use the inverse method to sample from <math>g(x)</math>
+
Use the inverse method to sample from <math>\begin{align}g(x)\end{align}</math>
<math>G(x)=x^2</math>.
+
<math>\begin{align}G(x)=x^2\end{align}</math>.
Generate <math>U</math> from <math>U(0,1)</math> and set <math>x=sqrt(u)</math>
+
Generate <math>\begin{align}U\end{align}</math> from <math>\begin{align}U(0,1)\end{align}</math> and set <math>\begin{align}x=sqrt(u)\end{align}</math>
  
 
1. Generate two uniform numbers in the unit interval <math>U_1, U_2 \sim~ U(0,1)</math><br>
 
1. Generate two uniform numbers in the unit interval <math>U_1, U_2 \sim~ U(0,1)</math><br>
 
2. If <math>U_2 \leq \frac{3\sqrt{U_1}}{2}</math>, accept <math>U_1</math> as the random variable with pdf <math>f</math>, if not return to Step 1
 
2. If <math>U_2 \leq \frac{3\sqrt{U_1}}{2}</math>, accept <math>U_1</math> as the random variable with pdf <math>f</math>, if not return to Step 1
  
*Note :the function q(x) = c * g(x) is called an envelop or majoring function.<br>
+
*Note :the function <math>\begin{align}q(x) = c * g(x)\end{align}</math> is called an envelop or majoring function.<br>
To obtain a better proposing function g(x), we can first assume a new q(x) and then solve for the normalizing constant by integrating.<br>
+
To obtain a better proposing function <math>\begin{align}g(x)\end{align}</math>, we can first assume a new <math>\begin{align}q(x)\end{align}</math> and then solve for the normalizing constant by integrating.<br>
In the previous example, we first assume q(x) = 3x. To find the normalizing constant, we need to solve k * <math>\sum 3x = 1</math> which gives us k = 2/3. So, g(x) = k*q(x) = 2x.
+
In the previous example, we first assume <math>\begin{align}q(x) = 3x\end{align}</math>. To find the normalizing constant, we need to solve <math>k *\sum 3x = 1</math> which gives us k = 2/3. So,<math>\begin{align}g(x) = k*q(x) = 2x\end{align}</math>.
       
+
 
 +
*Source: http://www.cs.bgu.ac.il/~mps042/acceptance.htm*       
  
 
'''Possible Limitations'''
 
'''Possible Limitations'''
Line 1,504: Line 1,554:
 
3) A constant c where <math>f(x)\leq c\cdot g(x)</math><br/>
 
3) A constant c where <math>f(x)\leq c\cdot g(x)</math><br/>
 
4) A uniform draw<br/>
 
4) A uniform draw<br/>
 
  
 
==== Interpretation of 'C' ====
 
==== Interpretation of 'C' ====
Line 1,514: Line 1,563:
  
 
In order to ensure the algorithm is as efficient as possible, the 'C' value should be as close to one as possible, such that <math>\tfrac{1}{c}</math> approaches 1 => 100% acceptance rate.
 
In order to ensure the algorithm is as efficient as possible, the 'C' value should be as close to one as possible, such that <math>\tfrac{1}{c}</math> approaches 1 => 100% acceptance rate.
 +
 +
 +
>> close All
 +
>> clear All
 +
>> i=1
 +
>> j=0;
 +
>> while ii<1000
 +
y=rand
 +
u=rand
 +
if u<=y;
 +
x(ii)=y
 +
ii=ii+1
 +
end
 +
end
  
 
== Class 5 - Tuesday, May 21 ==
 
== Class 5 - Tuesday, May 21 ==
Line 1,539: Line 1,602:
 
>>hist(x,30)                #30 is the number of bars
 
>>hist(x,30)                #30 is the number of bars
 
</pre>
 
</pre>
 +
 +
calculate process:
 +
<math>u_{1} <= \sqrt (1-(2u-1)^2) </math> <br>
 +
<math>(u_{1})^2 <=(1-(2u-1)^2) </math> <br>
 +
<math>(u_{1})^2 -1 <=(-(2u-1)^2) </math> <br>
 +
<math>1-(u_{1})^2 >=((2u-1)^2-1) </math> <br>
 +
  
 
MATLAB tips: hist(x,y) plots a histogram of variable x, where y is the number of bars in the graph.
 
MATLAB tips: hist(x,y) plots a histogram of variable x, where y is the number of bars in the graph.
Line 1,618: Line 1,688:
 
>>close all
 
>>close all
 
>>clear all
 
>>clear all
>>p=[.1 .3 .6];  
+
>>p=[.1 .3 .6];     %This a vector holding the values 
 
>>ii=1;
 
>>ii=1;
 
>>while ii < 1000
 
>>while ii < 1000
     y=unidrnd(3);
+
     y=unidrnd(3);   %generates random numbers for the discrete uniform distribution with maximum 3
     u=rand;
+
     u=rand;          
 
     if u<= p(y)/0.6
 
     if u<= p(y)/0.6
       x(ii)=y;
+
       x(ii)=y;    
       ii=ii+1;
+
       ii=ii+1;     %else ii=ii+1
 
     end
 
     end
 
   end
 
   end
Line 1,633: Line 1,703:
  
 
* '''Example 3'''<br>
 
* '''Example 3'''<br>
<math>p_{x}=e^{-3}3^{x}/x! , x>=0</math><br>(poisson distribution)
 
Try the first few p_{x}'s:  .0498 .149 .224 .224 .168 .101 .0504 .0216 .0081 .0027<br>
 
  
Use the geometric distribution for <math>g(x)</math>;<br>
+
Suppose <math>\begin{align}p_{x} = e^{-3}3^{x}/x! , x\geq 0\end{align}</math> (Poisson distribution)
<math>g(x)=p(1-p)^{x}</math>, choose p=0.25<br>
 
Look at <math>p_{x}/g(x)</math> for the first few numbers: .199 .797 1.59 2.12 2.12 1.70 1.13 .647 .324 .144<br>
 
We want <math>c=max(p_{x}/g(x))</math> which is approximately 2.12<br>
 
  
1. Generate <math>U_{1} \sim~ U(0,1); U_{2} \sim~ U(0,1)</math><br>
+
'''First:''' Try the first few <math>\begin{align}p_{x}'s\end{align}</math>:  0.0498, 0.149, 0.224, 0.224, 0.168, 0.101, 0.0504, 0.0216, 0.0081, 0.0027 for <math>\begin{align} x = 0,1,2,3,4,5,6,7,8,9 \end{align}</math><br>
2. <math>j = \lfloor \frac{ln(U_{1})}{ln(.75)} \rfloor+1;</math><br>
 
3. if <math>U_{2} < \frac{p_{j}}{cg(j)}</math>, set X = x<sub>j</sub>, else go to step 1.
 
  
Note: In this case, f(x)/g(x) is extremely difficult to differentiate so we were required to test points. If the function is easily differentiable, we can calculate the max as if it were a continuous function then check the two surrounding points for which is the highest discrete value.
+
'''Proposed distribution:''' Use the geometric distribution for <math>\begin{align}g(x)\end{align}</math>;<br>
 +
 
 +
<math>\begin{align}g(x)=p(1-p)^{x}\end{align}</math>, choose <math>\begin{align}p=0.25\end{align}</math><br>
 +
 
 +
Look at <math>\begin{align}p_{x}/g(x)\end{align}</math> for the first few numbers: 0.199 0.797 1.59 2.12 2.12 1.70 1.13 0.647 0.324 0.144 for <math>\begin{align} x = 0,1,2,3,4,5,6,7,8,9 \end{align}</math><br>
 +
 
 +
We want <math>\begin{align}c=max(p_{x}/g(x))\end{align}</math> which is approximately 2.12<br>
 +
 
 +
'''The general procedures to generate <math>\begin{align}p(x)\end{align}</math> is as follows:'''
 +
 
 +
1. Generate <math>\begin{align}U_{1} \sim~ U(0,1); U_{2} \sim~ U(0,1)\end{align}</math><br>
 +
 
 +
2. <math>\begin{align}j = \lfloor \frac{ln(U_{1})}{ln(.75)} \rfloor+1;\end{align}</math><br>
 +
 
 +
3. if <math>U_{2} < \frac{p_{j}}{cg(j)}</math>, set <math>\begin{align}X = x_{j}\end{align}</math>, else go to step 1.
 +
 
 +
Note: In this case, <math>\begin{align}f(x)/g(x)\end{align}</math> is extremely difficult to differentiate so we were required to test points. If the function is very easy to differentiate, we can calculate the max as if it were a continuous function then check the two surrounding points for which is the highest discrete value.
 +
 
 +
* Source: http://www.math.wsu.edu/faculty/genz/416/lect/l04-46.pdf*
  
 
*'''Example 4''' (Hypergeometric & Binomial)<br>  
 
*'''Example 4''' (Hypergeometric & Binomial)<br>  
Line 1,724: Line 1,805:
 
The CDF of the Gamma distribution <math>Gamma(t,\lambda)</math> is(t denotes the shape, <math>\lambda</math> denotes the scale:  <br>
 
The CDF of the Gamma distribution <math>Gamma(t,\lambda)</math> is(t denotes the shape, <math>\lambda</math> denotes the scale:  <br>
 
<math> F(x) = \int_0^{x} \frac{e^{-y}y^{t-1}}{(t-1)!} \mathrm{d}y, \; \forall x \in (0,+\infty)</math>, where <math>t \in \N^+ \text{ and } \lambda \in (0,+\infty)</math>.<br>
 
<math> F(x) = \int_0^{x} \frac{e^{-y}y^{t-1}}{(t-1)!} \mathrm{d}y, \; \forall x \in (0,+\infty)</math>, where <math>t \in \N^+ \text{ and } \lambda \in (0,+\infty)</math>.<br>
 +
 +
Note that the CDF of the Gamma distribution does not have a closed form.
  
 
The gamma distribution is often used to model waiting times between a certain number of events. It can also be expressed as the sum of infinitely many independent and identically distributed exponential distributions. This distribution has two parameters: the number of exponential terms n, and the rate parameter <math>\lambda</math>. In this distribution there is the Gamma function, <math>\Gamma </math> which has some very useful properties. "Source: STAT 340 Spring 2010 Course Notes" <br/>
 
The gamma distribution is often used to model waiting times between a certain number of events. It can also be expressed as the sum of infinitely many independent and identically distributed exponential distributions. This distribution has two parameters: the number of exponential terms n, and the rate parameter <math>\lambda</math>. In this distribution there is the Gamma function, <math>\Gamma </math> which has some very useful properties. "Source: STAT 340 Spring 2010 Course Notes" <br/>
Line 1,836: Line 1,919:
 
:<math>f(x) = \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} x^2}</math>
 
:<math>f(x) = \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} x^2}</math>
  
*Warning : the General Normal distribution is  
+
*Warning : the General Normal distribution is:
:
 
 
<table>
 
<table>
 
<tr>
 
<tr>
Line 1,889: Line 1,971:
  
 
Let <math> \theta </math> and R denote the Polar coordinate of the vector (X, Y)  
 
Let <math> \theta </math> and R denote the Polar coordinate of the vector (X, Y)  
 +
where <math> X = R \cdot \sin\theta </math> and <math> Y = R \cdot \cos \theta </math>
  
 
[[File:rtheta.jpg]]
 
[[File:rtheta.jpg]]
Line 1,905: Line 1,988:
 
We know that  
 
We know that  
  
<math>R_{2}= X_{2}+Y_{2}</math> and <math> \tan(\theta) = \frac{y}{x} </math> where X and Y are two independent standard normal
+
<math>R^{2}= X^{2}+Y^{2}</math> and <math> \tan(\theta) = \frac{y}{x} </math> where X and Y are two independent standard normal
 
:<math>f(x) = \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} x^2}</math>
 
:<math>f(x) = \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} x^2}</math>
 
:<math>f(y) = \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} y^2}</math>
 
:<math>f(y) = \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} y^2}</math>
:<math>f(x,y) = \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} x^2} * \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} y^2}=\frac{1}{2\pi}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} (x^2+y^2)} </math><br /> - Since for independent distributions, their joint probability function is the multiplication of two independent probability functions
+
:<math>f(x,y) = \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} x^2} * \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} y^2}=\frac{1}{2\pi}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} (x^2+y^2)} </math><br /> - Since for independent distributions, their joint probability function is the multiplication of two independent probability functions. It can also be shown using 1-1 transformation that the joint distribution of R and θ is given by, 1-1 transformation:<br />
It can also be shown using 1-1 transformation that the joint distribution of R and θ is given by,
+
 
1-1 transformation:<br />
+
 
Let <math>d=R^2</math><br />
+
'''Let <math>d=R^2</math>'''<br />
 +
 
 
  <math>x= \sqrt {d}\cos \theta </math>
 
  <math>x= \sqrt {d}\cos \theta </math>
 
  <math>y= \sqrt {d}\sin \theta </math>
 
  <math>y= \sqrt {d}\sin \theta </math>
 
then  
 
then  
 
<math>\left| J\right| = \left| \dfrac {1} {2}d^{-\frac {1} {2}}\cos \theta d^{\frac{1}{2}}\cos \theta +\sqrt {d}\sin \theta \dfrac {1} {2}d^{-\frac{1}{2}}\sin \theta \right| = \dfrac {1} {2}</math>
 
<math>\left| J\right| = \left| \dfrac {1} {2}d^{-\frac {1} {2}}\cos \theta d^{\frac{1}{2}}\cos \theta +\sqrt {d}\sin \theta \dfrac {1} {2}d^{-\frac{1}{2}}\sin \theta \right| = \dfrac {1} {2}</math>
It can be shown that the pdf of <math> d </math> and <math> \theta </math> is:
+
It can be shown that the joint density of <math> d /R^2</math> and <math> \theta </math> is:
 
:<math>\begin{matrix}  f(d,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad  d = R^2 \end{matrix},\quad for\quad 0\leq d<\infty\ and\quad 0\leq \theta\leq 2\pi </math>
 
:<math>\begin{matrix}  f(d,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad  d = R^2 \end{matrix},\quad for\quad 0\leq d<\infty\ and\quad 0\leq \theta\leq 2\pi </math>
  
Line 1,923: Line 2,007:
 
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent
 
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent
 
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2),  \theta \sim~ Unif[0,2\pi] \end{matrix} </math>
 
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2),  \theta \sim~ Unif[0,2\pi] \end{matrix} </math>
::* <math> \begin{align} R^2 = x^2 + y^2 \end{align} </math>
+
::* <math> \begin{align} R^2 = d = x^2 + y^2 \end{align} </math>
 
::* <math> \tan(\theta) = \frac{y}{x} </math>
 
::* <math> \tan(\theta) = \frac{y}{x} </math>
 
<math>\begin{align} f(d) = Exp(1/2)=\frac{1}{2}e^{-\frac{d}{2}}\ \end{align}</math>  
 
<math>\begin{align} f(d) = Exp(1/2)=\frac{1}{2}e^{-\frac{d}{2}}\ \end{align}</math>  
Line 1,929: Line 2,013:
 
<math>\begin{align} f(\theta) =\frac{1}{2\pi}\ \end{align}</math>
 
<math>\begin{align} f(\theta) =\frac{1}{2\pi}\ \end{align}</math>
 
<br>
 
<br>
 +
 
To sample from the normal distribution, we can generate a pair of independent standard normal X and Y by:<br />
 
To sample from the normal distribution, we can generate a pair of independent standard normal X and Y by:<br />
 +
 
1) Generating their polar coordinates<br />
 
1) Generating their polar coordinates<br />
 
2) Transforming back to rectangular (Cartesian) coordinates.<br />
 
2) Transforming back to rectangular (Cartesian) coordinates.<br />
  
Alternative Method of Generating Standard Normal Random Variables 
 
  
Step 1: Generate <math>u_{1}</math> ~<math>Unif(0,1)</math>
+
'''Alternative Method of Generating Standard Normal Random Variables'''<br /> 
Step 2: Generate <math>Y_{1}</math> ~<math>Exp(1)</math>,<math>Y_{2}</math>~<math>Exp(2)</math>
+
 
Step 3: If <math>Y_{2} \geq(Y_{1}-1)^2/2</math>,set <math>V=Y1</math>,otherwise,go to step 1
+
Step 1: Generate <math>u_{1}</math> ~<math>Unif(0,1)</math><br />
Step 4: If <math>u_{1} \leq 1/2</math>,then <math>X=-V</math>
+
Step 2: Generate <math>Y_{1}</math> ~<math>Exp(1)</math>,<math>Y_{2}</math>~<math>Exp(2)</math><br />
 +
Step 3: If <math>Y_{2} \geq(Y_{1}-1)^2/2</math>,set <math>V=Y1</math>,otherwise,go to step 1<br />
 +
Step 4: If <math>u_{1} \leq 1/2</math>,then <math>X=-V</math><br />
 +
 
 +
===Expectation of a Standard Normal distribution===<br />
  
==== Expectation of a Standard Normal distribution ====
+
The expectation of a standard normal distribution is 0<br />
The expectation of a standard normal distribution is 0
+
 
:Below is the proof:
+
'''Proof:''' <br />
  
 
:<math>\operatorname{E}[X]= \;\int_{-\infty}^{\infty} x \frac{1}{\sqrt{2\pi}}  e^{-x^2/2} \, dx.</math>
 
:<math>\operatorname{E}[X]= \;\int_{-\infty}^{\infty} x \frac{1}{\sqrt{2\pi}}  e^{-x^2/2} \, dx.</math>
Line 1,951: Line 2,040:
 
:<math>= - \left[\phi(x)\right]_{-\infty}^{\infty}</math>
 
:<math>= - \left[\phi(x)\right]_{-\infty}^{\infty}</math>
 
:<math>= 0</math><br />
 
:<math>= 0</math><br />
More intuitively, because x is an odd function (f(x)+f(-x)=0). Taking integral of x will give <math>x^2/2 </math> which is an even function (f(x)=f(-x)). Note that this is in relation to the symmetrical properties of the standard normal distribution. If support is from negative infinity to infinity, then the integral will return 0.<br />
 
  
* '''Procedure (Box-Muller Transformation Method):''' <br />
+
'''Note,''' more intuitively, because x is an odd function (f(x)+f(-x)=0). Taking integral of x will give <math>x^2/2 </math> which is an even function (f(x)=f(-x)). This is in relation to the symmetrical properties of the standard normal distribution. If support is from negative infinity to infinity, then the integral will return 0.<br />
 +
 
 +
 
 +
'''Procedure (Box-Muller Transformation Method):''' <br />
 +
 
 
Pseudorandom approaches to generating normal random variables used to be limited. Inefficient methods such as inverse Gaussian function, sum of uniform random variables, and acceptance-rejection were used. In 1958, a new method was proposed by George Box and Mervin Muller of Princeton University. This new technique was easy to use and also had the accuracy to the inverse transform sampling method that it grew more valuable as computers became more computationally astute. <br>
 
Pseudorandom approaches to generating normal random variables used to be limited. Inefficient methods such as inverse Gaussian function, sum of uniform random variables, and acceptance-rejection were used. In 1958, a new method was proposed by George Box and Mervin Muller of Princeton University. This new technique was easy to use and also had the accuracy to the inverse transform sampling method that it grew more valuable as computers became more computationally astute. <br>
 
The Box-Muller method takes a sample from a bivariate independent standard normal distribution, each component of which is thus a univariate standard normal. The algorithm is based on the following two properties of the bivariate independent standard normal distribution: <br>
 
The Box-Muller method takes a sample from a bivariate independent standard normal distribution, each component of which is thus a univariate standard normal. The algorithm is based on the following two properties of the bivariate independent standard normal distribution: <br>
 
if <math>Z = (Z_{1}, Z_{2}</math>) has this distribution, then <br>
 
if <math>Z = (Z_{1}, Z_{2}</math>) has this distribution, then <br>
 +
 
1.<math>R^2=Z_{1}^2+Z_{2}^2</math> is exponentially distributed with mean 2, i.e. <br>
 
1.<math>R^2=Z_{1}^2+Z_{2}^2</math> is exponentially distributed with mean 2, i.e. <br>
 
<math>P(R^2 \leq x) = 1-e^{-x/2}</math>. <br>
 
<math>P(R^2 \leq x) = 1-e^{-x/2}</math>. <br>
 
2.Given <math>R^2</math>, the point <math>(Z_{1},Z_{2}</math>) is uniformly distributed on the circle of radius R centered at the origin. <br>
 
2.Given <math>R^2</math>, the point <math>(Z_{1},Z_{2}</math>) is uniformly distributed on the circle of radius R centered at the origin. <br>
 
We can use these properties to build the algorithm: <br>
 
We can use these properties to build the algorithm: <br>
 +
  
 
1) Generate random number <math> \begin{align} U_1,U_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
 
1) Generate random number <math> \begin{align} U_1,U_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
Line 1,979: Line 2,073:
  
  
Note: In steps 2 and 3, we are using a similar technique as that used in the inverse transform method. <br />
+
'''Note:''' In steps 2 and 3, we are using a similar technique as that used in the inverse transform method. <br />
 
The Box-Muller Transformation Method generates a pair of independent Standard Normal distributions, X and Y (Using the transformation of polar coordinates). <br />
 
The Box-Muller Transformation Method generates a pair of independent Standard Normal distributions, X and Y (Using the transformation of polar coordinates). <br />
 +
 
If you want to generate a number of independent standard normal distributed numbers (more than two), you can run the Box-Muller method several times.<br/>
 
If you want to generate a number of independent standard normal distributed numbers (more than two), you can run the Box-Muller method several times.<br/>
 
For example: <br />
 
For example: <br />
Line 1,987: Line 2,082:
  
  
* '''Code'''<br />
+
'''Matlab Code'''<br />
 +
 
 
<pre style="font-size:16px">
 
<pre style="font-size:16px">
 
>>close all
 
>>close all
Line 2,002: Line 2,098:
 
>>hist(y)
 
>>hist(y)
 
</pre>
 
</pre>
 +
<br>
 +
'''Remember''': For the above code to work the "." needs to be after the d to ensure that each element of d is raised to the power of 0.5.<br /> Otherwise matlab will raise the entire matrix to the power of 0.5."<br>
  
"''Remember'': For the above code to work the "." needs to be after the d to ensure that each element of d is raised to the power of 0.5.<br /> Otherwise matlab will raise the entire matrix to the power of 0.5."
+
'''Note:'''<br>the first graph is hist(tet) and it is a uniform distribution.<br>The second one is hist(d) and it is a exponential distribution.<br>The third one is hist(x) and it is a normal distribution.<br>The last one is hist(y) and it is also a normal distribution.
 
 
Note:<br>the first graph is hist(tet) and it is a uniform distribution.<br>The second one is hist(d) and it is a exponential distribution.<br>The third one is hist(x) and it is a normal distribution.<br>The last one is hist(y) and it is also a normal distribution.
 
  
 
Attention:There is a "dot" between sqrt(d) and "*". It is because d and tet are vectors. <br>
 
Attention:There is a "dot" between sqrt(d) and "*". It is because d and tet are vectors. <br>
Line 2,022: Line 2,118:
 
>>hist(x)
 
>>hist(x)
 
>>hist(x+2)
 
>>hist(x+2)
>>hist(x*2+2)
+
>>hist(x*2+2)<br>
 
</pre>
 
</pre>
 
+
<br>
Note: randn is random sample from a standard normal distribution.<br />
+
'''Note:'''<br>
Note: hist(x+2) will be centered at 2 instead of at 0. <br />
+
1. randn is random sample from a standard normal distribution.<br />
      hist(x*3+2) is also centered at 2. The mean doesn't change, but the variance of x*3+2 becomes nine times (3^2) the variance of x.<br />
+
2. hist(x+2) will be centered at 2 instead of at 0. <br />
 +
3. hist(x*3+2) is also centered at 2. The mean doesn't change, but the variance of x*3+2 becomes nine times (3^2) the variance of x.<br />
 
[[File:Normal_x.jpg|300x300px]][[File:Normal_x+2.jpg|300x300px]][[File:Normal(2x+2).jpg|300px]]
 
[[File:Normal_x.jpg|300x300px]][[File:Normal_x+2.jpg|300x300px]][[File:Normal(2x+2).jpg|300px]]
 
<br />
 
<br />
  
<b>Comment</b>: Box-Muller transformations are not computationally efficient. The reason for this is the need to compute sine and cosine functions. A way to get around this time-consuming difficulty is by an indirect computation of the sine and cosine of  a random angle (as opposed to a direct computation which generates  U  and then computes the sine and cosine of 2πU. <br />
+
<b>Comment</b>:<br />
 +
Box-Muller transformations are not computationally efficient. The reason for this is the need to compute sine and cosine functions. A way to get around this time-consuming difficulty is by an indirect computation of the sine and cosine of  a random angle (as opposed to a direct computation which generates  U  and then computes the sine and cosine of 2πU. <br />
 +
 
 +
 
  
 
'''Alternative Methods of generating normal distribution'''<br />
 
'''Alternative Methods of generating normal distribution'''<br />
 +
 
1. Even though we cannot use inverse transform method, we can approximate this inverse using different functions.One method would be '''rational approximation'''.<br />
 
1. Even though we cannot use inverse transform method, we can approximate this inverse using different functions.One method would be '''rational approximation'''.<br />
 
2.'''Central limit theorem''' : If we sum 12 independent U(0,1) distribution and subtract 6 (which is E(ui)*12)we will approximately get a standard normal distribution.<br />
 
2.'''Central limit theorem''' : If we sum 12 independent U(0,1) distribution and subtract 6 (which is E(ui)*12)we will approximately get a standard normal distribution.<br />
Line 2,047: Line 2,148:
 
=== Proof of Box Muller Transformation ===
 
=== Proof of Box Muller Transformation ===
  
Definition:
+
'''Definition:'''<br />
 
A transformation which transforms from a '''two-dimensional continuous uniform''' distribution to a '''two-dimensional bivariate normal''' distribution (or complex normal distribution).
 
A transformation which transforms from a '''two-dimensional continuous uniform''' distribution to a '''two-dimensional bivariate normal''' distribution (or complex normal distribution).
  
Line 2,067: Line 2,168:
 
       u<sub>2</sub> = g<sub>2</sub> ^-1(x1,x2)
 
       u<sub>2</sub> = g<sub>2</sub> ^-1(x1,x2)
  
Inverting the above transformations, we have
+
Inverting the above transformation, we have
 
     u1 = exp^{-(x<sub>1</sub> ^2+ x<sub>2</sub> ^2)/2}
 
     u1 = exp^{-(x<sub>1</sub> ^2+ x<sub>2</sub> ^2)/2}
 
     u2 = (1/2pi)*tan^-1 (x<sub>2</sub>/x<sub>1</sub>)
 
     u2 = (1/2pi)*tan^-1 (x<sub>2</sub>/x<sub>1</sub>)
Line 2,331: Line 2,432:
 
Procedure:
 
Procedure:
  
1) Generate U~Unif [0, 1)<br>
+
1) Generate U~Unif (0, 1)<br>
 
2) Set <math>x=F^{-1}(u)</math><br>
 
2) Set <math>x=F^{-1}(u)</math><br>
 
3) X~f(x)<br>
 
3) X~f(x)<br>
  
 
'''Remark'''<br>
 
'''Remark'''<br>
1) The preceding can be written algorithmically as
+
1) The preceding can be written algorithmically for discrete random variables as <br>
Generate a random number U
+
Generate a random number U ~ U(0,1] <br>
If U<<sub>p0</sub> set X=<sub>x0</sub> and stop
+
If U < p<sub>0</sub> set X = x<sub>0</sub> and stop <br>
If U<<sub>p0</sub>+<sub>p1</sub> set X=x1 and stop
+
If U < p<sub>0</sub> + p<sub>1</sub> set X = x<sub>1</sub> and stop <br>
...
+
... <br>
2) If the <sub>xi</sub>, i>=0, are ordered so that <sub>x0</sub><<sub>x1</sub><<sub>x2</sub><... and if we let F denote the distribution function of X, then X will equal <sub>xj</sub> if F(<sub>x(j-1)</sub>)<=U<F(<sub>xj</sub>)
+
2) If the x<sub>i</sub>, i>=0, are ordered so that x<sub>0</sub> < x<sub>1</sub> < x<sub>2</sub> <... and if we let F denote the distribution function of X, then X will equal x<sub>j</sub> if F(x<sub>j-1</sub>) <= U < F(x<sub>j</sub>)
  
 
'''Example 1'''<br>
 
'''Example 1'''<br>
Line 2,368: Line 2,469:
  
 
Step1: Generate U~ U(0, 1)<br>
 
Step1: Generate U~ U(0, 1)<br>
Step2: set <math>y=\, {-\frac {1}{{\lambda_1 +\lambda_2}}} ln(u)</math><br>
+
 
 +
Step2: set <math>y=\, {-\frac {1}{{\lambda_1 +\lambda_2}}} ln(1-u)</math><br>
 +
 
 +
    or set <math>y=\, {-\frac {1} {{\lambda_1 +\lambda_2}}} ln(u)</math><br>
 +
Since it is a uniform distribution, therefore after generate a lot of times 1-u and u are the same.
 +
 
 +
 
 +
* '''Matlab Code'''<br />
 +
<pre style="font-size:16px">
 +
>> lambda1 = 1;
 +
>> lambda2 = 2;
 +
>> u = rand;
 +
>> y = -log(u)/(lambda1 + lambda2)
 +
</pre>
  
 
If we generalize this example from two independent particles to n independent particles we will have:<br>
 
If we generalize this example from two independent particles to n independent particles we will have:<br>
Line 2,524: Line 2,638:
 
=== Example of Decomposition Method ===
 
=== Example of Decomposition Method ===
  
F<sub>x</sub>(x) = 1/3*x+1/3*x<sup>2</sup>+1/3*x<sup>3</sup>, 0<= x<=1
+
<math>F_x(x) = \frac {1}{3} x+\frac {1}{3} x^2+\frac {1}{3} x^3, 0\leq x\leq 1</math>
  
let U =F<sub>x</sub>(x) = 1/3*x+1/3*x<sup>2</sup>+1/3*x<sup>3</sup>, solve for x.
+
Let <math>U =F_x(x) = \frac {1}{3} x+\frac {1}{3} x^2+\frac {1}{3} x^3</math>, solve for x.
  
P<sub>1</sub>=1/3, F<sub>x1</sub>(x)= x, P<sub>2</sub>=1/3,F<sub>x2</sub>(x)= x<sup>2</sup>,  
+
<math>P_1=\frac{1}{3}, F_{x1} (x)= x, P_2=\frac{1}{3},F_{x2} (x)= x^2,  
P<sub>3</sub>=1/3,F<sub>x3</sub>(x)= x<sup>3</sup>
+
P_3=\frac{1}{3},F_{x3} (x)= x^3</math>
  
 
'''Algorithm:'''
 
'''Algorithm:'''
  
Generate U ~ Unif [0,1)
+
Generate <math>\,U \sim Unif [0,1)</math>
  
Generate V~ Unif [0,1)
+
Generate <math>\,V \sim  Unif [0,1)</math>
  
if 0<u<1/3, x = v
+
if <math>0\leq u \leq \frac{1}{3}, x = v</math>
  
else if u<2/3, x = v<sup>1/2</sup>
+
else if <math>u \leq \frac{2}{3}, x = v^{\frac{1}{2}}</math>
  
else x = v<sup>1/3</sup><br>
+
else <math>x=v^{\frac{1}{3}}</math> <br>
  
  
Line 2,606: Line 2,720:
  
 
For More Details, please refer to http://www.stanford.edu/class/ee364b/notes/decomposition_notes.pdf
 
For More Details, please refer to http://www.stanford.edu/class/ee364b/notes/decomposition_notes.pdf
 
  
 
===Fundamental Theorem of Simulation===
 
===Fundamental Theorem of Simulation===
Line 2,617: Line 2,730:
 
Inverse each part of partial CDF, the partial CDF is divided by the original CDF, partial range is uniform distribution.<br />
 
Inverse each part of partial CDF, the partial CDF is divided by the original CDF, partial range is uniform distribution.<br />
 
More specific definition of the theorem can be found here.<ref>http://www.bus.emory.edu/breno/teaching/MCMC_GibbsHandouts.pdf</ref>
 
More specific definition of the theorem can be found here.<ref>http://www.bus.emory.edu/breno/teaching/MCMC_GibbsHandouts.pdf</ref>
 +
 +
Matlab code:
 +
 +
<pre style="font-size:16px">
 +
close all
 +
clear all
 +
ii=1;
 +
while ii<1000
 +
u=rand
 +
y=R*(2*U-1)
 +
if (1-U^2)>=(2*u-1)^2
 +
x(ii)=y;
 +
ii=ii+1
 +
end
 +
</pre>
  
 
===Question 2===
 
===Question 2===
Line 2,659: Line 2,787:
 
===The Bernoulli distribution===
 
===The Bernoulli distribution===
  
The Bernoulli distribution is a special case of the binomial distribution, where n = 1. X ~ Bin(1, p) has the same meaning as X ~ Ber(p), where p is the probability if the event success, otherwise the probability is 1-p (we usually define a variate q, q= 1-p). The mean of Bernoulli is p, variance is p(1-p). Bin(n, p), is the distribution of the sum of n independent Bernoulli trials, Bernoulli(p), each with the same probability p, where 0<p<1. <br>
+
The Bernoulli distribution is a special case of the binomial distribution, where n = 1. X ~ Bin(1, p) has the same meaning as X ~ Ber(p), where p is the probability of success and 1-p is the probability of failure (we usually define a variate q, q= 1-p). The mean of Bernoulli is p and the variance is p(1-p). Bin(n, p), is the distribution of the sum of n independent Bernoulli trials, Bernoulli(p), each with the same probability p, where 0<p<1. <br>
 
For example, let X be the event that a coin toss results in a "head" with probability ''p'', then ''X~Bernoulli(p)''. <br>
 
For example, let X be the event that a coin toss results in a "head" with probability ''p'', then ''X~Bernoulli(p)''. <br>
<math>P(X=1)=p,P(X=0)=1-p, P(x=0)+P(x=1)=p+q=1</math>
+
P(X=1)= p
 +
P(X=0)= q = 1-p
 +
Therefore, P(X=0) + P(X=1) = p + q = 1
  
 
'''Algorithm: '''
 
'''Algorithm: '''
Line 2,672: Line 2,802:
 
when <math>U \geq p, x=0</math><br>
 
when <math>U \geq p, x=0</math><br>
 
3) Repeat as necessary
 
3) Repeat as necessary
 +
 +
* '''Matlab Code'''<br />
 +
<pre style="font-size:16px">
 +
>> p = 0.8    % an arbitrary probability for example
 +
>> for i = 1: 100
 +
>>  u = rand;
 +
>>  if u < p
 +
>>      x(ii) = 1;
 +
>>  else
 +
>>      x(ii) = 0;
 +
>>  end
 +
>> end
 +
>> hist(x)
 +
</pre>
  
 
===The Binomial Distribution===
 
===The Binomial Distribution===
Line 2,780: Line 2,924:
  
 
P (X > x) = (1-p)<sup>x</sup>(because first x trials are not successful) <br/>
 
P (X > x) = (1-p)<sup>x</sup>(because first x trials are not successful) <br/>
 +
 +
NB: An advantage of using this method is that nothing is rejected. We accept all the points, and the method is more efficient. Also, this method is closer to the inverse transform method as nothing is being rejected. <br />
  
 
'''Proof''' <br/>
 
'''Proof''' <br/>
Line 2,994: Line 3,140:
 
=== Beta Distribution ===
 
=== Beta Distribution ===
 
The beta distribution is a continuous probability distribution. <br>
 
The beta distribution is a continuous probability distribution. <br>
 +
PDF:<math>\displaystyle \text{ } f(x) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1} </math><br>  where <math>0 \leq x \leq 1</math> and <math>\alpha</math>>0, <math>\beta</math>>0<br/>
 +
<div style = "align:left; background:#F5F5DC; font-size: 120%">
 +
Definition:
 +
In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by α and β, that appear as exponents of the random variable and control the shape of the distribution.<br/.>
 +
More can be find in the link: <ref>http://en.wikipedia.org/wiki/Beta_distribution</ref>
 +
</div>
 +
 
There are two positive shape parameters in this distribution defined as alpha and beta: <br>
 
There are two positive shape parameters in this distribution defined as alpha and beta: <br>
-Both parameters greater than 0, and X within the interval [0,1]. <br>
+
-Both parameters are greater than 0, and X is within the interval [0,1]. <br>
 
-Alpha is used as exponents of the random variable. <br>
 
-Alpha is used as exponents of the random variable. <br>
 
-Beta is used to control the shape of the this distribution. We use the beta distribution to build the model of the behavior of random variables, which are limited to intervals of finite length. <br>
 
-Beta is used to control the shape of the this distribution. We use the beta distribution to build the model of the behavior of random variables, which are limited to intervals of finite length. <br>
Line 3,045: Line 3,198:
 
:<math>\displaystyle \text{f}(x) = \frac{\Gamma(\alpha+1)}{\Gamma(\alpha)\Gamma(1)}x^{\alpha-1}(1-x)^{1-1}=\alpha x^{\alpha-1}</math><br>
 
:<math>\displaystyle \text{f}(x) = \frac{\Gamma(\alpha+1)}{\Gamma(\alpha)\Gamma(1)}x^{\alpha-1}(1-x)^{1-1}=\alpha x^{\alpha-1}</math><br>
  
The CDF is <math>F(x) = x^{\alpha}</math> (using integration of <math>f(x)</math>)
+
By integrating <math>f(x)</math>, we find the CDF of X is <math>F(x) = x^{\alpha}</math>.
With CDF F(x) = x^α, if U have CDF, it is very easy to sample:
+
As <math>F(x)^{-1} = x^\frac {1}{\alpha}</math>, using the inverse transform method, <math> X = U^\frac {1}{\alpha} </math> with U ~ U[0,1].
y=x^α --> x=y^α --> inverseF(x)= x^(1/α)
 
U~U(0,1) --> x=u^(1/α)
 
Applying the inverse transform method with <math>y = x^\alpha \Rightarrow x = y^\frac {1}{\alpha}</math>
 
 
 
<math>F(x)^{-1} = y^\frac {1}{\alpha}</math>
 
 
 
between case 1 and case 2, when alpha and beta be different value, the beta distribution can simplify to other distribution.
 
  
 
'''Algorithm'''
 
'''Algorithm'''
Line 3,068: Line 3,214:
 
</pre>
 
</pre>
  
'''Case 3:'''<br\> To sample from beta in general. we use the property that <br\>
+
'''Case 3:'''<br\> To sample from beta in general, we use the property that <br\>
  
 
:if <math>Y_1</math> follows gamma <math>(\alpha,1)</math><br\>
 
:if <math>Y_1</math> follows gamma <math>(\alpha,1)</math><br\>
Line 3,464: Line 3,610:
 
'''Definition:''' In probability theory, a stochastic process /stoʊˈkæstɪk/, or sometimes random process (widely used) is a collection of random variables; this is often used to represent the evolution of some random value, or system, over time. This is the probabilistic counterpart to a deterministic process (or deterministic system). Instead of describing a process which can only evolve in one way (as in the case, for example, of solutions of an ordinary differential equation), in a stochastic or random process there is some indeterminacy: even if the initial condition (or starting point) is known, there are several (often infinitely many) directions in which the process may evolve. (from Wikipedia)
 
'''Definition:''' In probability theory, a stochastic process /stoʊˈkæstɪk/, or sometimes random process (widely used) is a collection of random variables; this is often used to represent the evolution of some random value, or system, over time. This is the probabilistic counterpart to a deterministic process (or deterministic system). Instead of describing a process which can only evolve in one way (as in the case, for example, of solutions of an ordinary differential equation), in a stochastic or random process there is some indeterminacy: even if the initial condition (or starting point) is known, there are several (often infinitely many) directions in which the process may evolve. (from Wikipedia)
  
A stochastic process is non-deterministic. This means that there is some indeterminacy in the final state, even if the initial condition is known.
+
A stochastic process is non-deterministic. This means that even if we know the initial condition(state), and we know some possibilities of the states to follow, the exact value of the final state remains to be uncertain.  
  
 
We can illustrate this with an example of speech: if "I" is the first word in a sentence, the set of words that could follow would be limited (eg. like, want, am), and the same happens for the third word and so on. The words then have some probabilities among them such that each of them is a random variable, and the sentence would be a collection of random variables. <br>
 
We can illustrate this with an example of speech: if "I" is the first word in a sentence, the set of words that could follow would be limited (eg. like, want, am), and the same happens for the third word and so on. The words then have some probabilities among them such that each of them is a random variable, and the sentence would be a collection of random variables. <br>
Line 3,476: Line 3,622:
 
2. Markov Process- This is a stochastic process that satisfies the Markov property which can be understood as the memory-less property. The property states that the jump to a future state only depends on the current state of the process, and not of the process's history. This model is used to model random walks exhibited by particles, the health state of a life insurance policyholder, decision making by a memory-less mouse in a maze, etc. <br>
 
2. Markov Process- This is a stochastic process that satisfies the Markov property which can be understood as the memory-less property. The property states that the jump to a future state only depends on the current state of the process, and not of the process's history. This model is used to model random walks exhibited by particles, the health state of a life insurance policyholder, decision making by a memory-less mouse in a maze, etc. <br>
 
   
 
   
 
Stochastic Process means even we get some conditions at the beginning, we just can guess some variables followed the first, but at the end the variable would be unpredictable.
 
  
 
=====Example=====
 
=====Example=====
Line 3,484: Line 3,628:
 
stochastic process always has state space and the index set to limit the range.
 
stochastic process always has state space and the index set to limit the range.
  
The state space is the set of cars , while <math>x_t</math> are sport cars.
+
The state space is the set of cars, while <math>x_t</math> are sport cars.
  
 
Births in a hospital occur randomly at an average rate
 
Births in a hospital occur randomly at an average rate
Line 3,491: Line 3,635:
  
 
==== Poisson Process ====
 
==== Poisson Process ====
The Poisson process is a discrete counting process of number of occurrences over time.
+
[[File:Possionprocessidiagram.png‎]]
 +
 
 +
The Poisson process is a discrete counting process which counts the number of<br\>
 +
of events and the time that these occur in a given time interval.<br\>
  
 
e.g traffic accidents , arrival of emails. Emails arrive at random time <math>T_1, T_2 ... T_n</math> for example (2, 7, 3) is the number of emails received on day 1, day 2, day 3. This is a stochastic process and Poisson process with condition.
 
e.g traffic accidents , arrival of emails. Emails arrive at random time <math>T_1, T_2 ... T_n</math> for example (2, 7, 3) is the number of emails received on day 1, day 2, day 3. This is a stochastic process and Poisson process with condition.
Line 3,515: Line 3,662:
 
the rate parameter may change over time; such a process is called a non-homogeneous Poisson process
 
the rate parameter may change over time; such a process is called a non-homogeneous Poisson process
  
==== ====
+
==== Examples ====
 
<br />
 
<br />
 
'''How to generate a multivariate normal with the built-in function "randn": (example)'''<br />
 
'''How to generate a multivariate normal with the built-in function "randn": (example)'''<br />
Line 3,527: Line 3,674:
 
                       %matrix to 1*n matrix;
 
                       %matrix to 1*n matrix;
 
</pre>
 
</pre>
 +
For example, if we use mu = [2 5], we would get <br/>
 +
<math> = \left[ \begin{array}{ccc}
 +
3.8214 & 0.3447 \\
 +
6.3097 & 5.6157 \end{array} \right]</math>
  
and if we want to use box-muller to generate a multivariate normal, we could use the code in lecture 6:
+
 
 +
If we want to use box-muller to generate a multivariate normal, we could use the code in lecture 6:
 
<pre style='font-size:16px'>
 
<pre style='font-size:16px'>
 
d = length(mu);
 
d = length(mu);
Line 3,570: Line 3,722:
  
 
===Poisson Process===
 
===Poisson Process===
 +
A Poisson Process is a stochastic approach to count number of events in a certain time period. <s>Strike-through text</s>
 
A discrete stochastic variable ''X'' is said to have a Poisson distribution with parameter ''λ'' > 0 if
 
A discrete stochastic variable ''X'' is said to have a Poisson distribution with parameter ''λ'' > 0 if
 
:<math>\!f(n)= \frac{\lambda^n e^{-\lambda}}{n!}  \qquad n= 0,1,2,3,4,5,\ldots,</math>.
 
:<math>\!f(n)= \frac{\lambda^n e^{-\lambda}}{n!}  \qquad n= 0,1,2,3,4,5,\ldots,</math>.
Line 3,593: Line 3,746:
  
 
'''Generate a Poisson Process'''<br />
 
'''Generate a Poisson Process'''<br />
 
<math>U_{n} \sim~ U(0,1)</math><br>
 
<math>T_n-T_{n-1}=-\frac {1}{\lambda} log(U_n)</math><br>
 
  
 
1. set <math>T_{0}=0</math> and n=1<br/>
 
1. set <math>T_{0}=0</math> and n=1<br/>
Line 3,677: Line 3,827:
  
 
</pre>
 
</pre>
 
+
<br>
  
 
The following plot is using TT = 50.<br>
 
The following plot is using TT = 50.<br>
 
The number of points generated every time on average should be <math>\lambda</math> * TT. <br>
 
The number of points generated every time on average should be <math>\lambda</math> * TT. <br>
 
The maximum value of the points should be TT. <br>
 
The maximum value of the points should be TT. <br>
[[File:Poisson.jpg]]
+
[[File:Poisson.jpg]]<br>
 
when TT be big, the plot of the graph will be linear, when we set the TT be 5 or small number, the plot graph looks like discrete distribution.
 
when TT be big, the plot of the graph will be linear, when we set the TT be 5 or small number, the plot graph looks like discrete distribution.
  
Line 3,789: Line 3,939:
 
=== Examples of Transition Matrix ===
 
=== Examples of Transition Matrix ===
  
[[File:Mark13.png]]
+
[[File:Mark13.png]]<br>
 
The picture is from http://www.google.ca/imgres?imgurl=http://academic.uprm.edu/wrolke/esma6789/graphs/mark13.png&imgrefurl=http://academic.uprm.edu/wrolke/esma6789/mark1.htm&h=274&w=406&sz=5&tbnid=6A8GGaxoPux9kM:&tbnh=83&tbnw=123&prev=/search%3Fq%3Dtransition%2Bmatrix%26tbm%3Disch%26tbo%3Du&zoom=1&q=transition+matrix&usg=__hZR-1Cp6PbZ5PfnSjs2zU6LnCiI=&docid=PaQvi1F97P2urM&sa=X&ei=foTxUY3DB-rMyQGvq4D4Cg&sqi=2&ved=0CDYQ9QEwAQ&dur=5515)
 
The picture is from http://www.google.ca/imgres?imgurl=http://academic.uprm.edu/wrolke/esma6789/graphs/mark13.png&imgrefurl=http://academic.uprm.edu/wrolke/esma6789/mark1.htm&h=274&w=406&sz=5&tbnid=6A8GGaxoPux9kM:&tbnh=83&tbnw=123&prev=/search%3Fq%3Dtransition%2Bmatrix%26tbm%3Disch%26tbo%3Du&zoom=1&q=transition+matrix&usg=__hZR-1Cp6PbZ5PfnSjs2zU6LnCiI=&docid=PaQvi1F97P2urM&sa=X&ei=foTxUY3DB-rMyQGvq4D4Cg&sqi=2&ved=0CDYQ9QEwAQ&dur=5515)
  
Line 3,825: Line 3,975:
 
</div>
 
</div>
  
<math>x_k+1= (ax_k+c) mod</math> <math>m</math><br />
+
<math>\begin{align}x_k+1= (ax_k+c) \mod m\end{align}</math><br />
 +
 
 +
Where a, c, m and x<sub>1</sub> (the seed) are values we must chose before running the algorithm. While there is no set value for each, it is best for m to be large and prime. For example, Matlab uses a = 75,b = 0,m = 231 − 1.
  
Where a, c, m and x<sub>1</sub> (the seed) are values we must chose before running the algorithm. While there is no set value for each, it is best for m to be large and prime.
+
'''Examples:'''<br>
 +
1. <math>\begin{align}X_{0} = 10 ,a = 2 , c = 1 , m = 13 \end{align}</math><br>
 +
   
 +
<math>\begin{align}X_{1} = 2 * 10 + 1\mod 13 = 8\end{align}</math><br>
  
Examples:
+
<math>\begin{align}X_{2} = 2 * 8  + 1\mod 13 = 4\end{align}</math> ... and so on<br>
      X<sub>0</sub> = 10 ,a = 2 , c = 1 , m = 13
+
 
 +
 
 +
2. <math>\begin{align}X_{0} = 44 ,a = 13 , c = 17 , m = 211\end{align}</math><br>
 
        
 
        
          X<sub>1</sub> = 2 * 10 + mod 13 = 8
+
<math>\begin{align}X_{1} = 13 * 44 + 17\mod 211 = 167\end{align}</math><br>
          X<sub>2</sub> = 2 * 8 + mod 13 = 4
+
 
          ... and so on
+
<math>\begin{align}X_{2} = 13 * 167 + 17\mod 211 = 78\end{align}</math><br>
  
      X<sub>0</sub> = 44 ,a = 13 , c = 17 , m = 211
+
<math>\begin{align}X_{3} = 13 * 78 + 17\mod 211 = 187\end{align}</math> ... and so on<br>
     
 
          X<sub>1</sub> = 13 * 44 + 17  mod 211 = 167
 
          X<sub>2</sub> = 13 * 167 + 17 mod 211 = 78
 
          X<sub>3</sub> = 13 * 78  + 17  mod 211 = 187
 
          ... and so on
 
  
 
=== Inverse Transformation Method ===
 
=== Inverse Transformation Method ===
Line 4,085: Line 4,237:
 
<br>N-Step Transition Matrix: a matrix <math> P_n </math> whose elements are the probability of moving from state i to state j in n steps. <br/>
 
<br>N-Step Transition Matrix: a matrix <math> P_n </math> whose elements are the probability of moving from state i to state j in n steps. <br/>
 
<math>P_n (i,j)=Pr⁡(X_{m+n}=j|X_m=i)</math> <br/>
 
<math>P_n (i,j)=Pr⁡(X_{m+n}=j|X_m=i)</math> <br/>
 +
 +
Explanation: (with an example) Suppose there 10 states { 1, 2, ..., 10}, and suppose you are on state 2, then P<sub>8</sub>(2, 5) represent the probability of moving from state 2 to state 5 in 8 steps.
  
 
One-step transition probability:<br/>
 
One-step transition probability:<br/>
Line 4,165: Line 4,319:
 
Note: <math>P_2 = P_1\times P_1; P_n = P^n</math><br />
 
Note: <math>P_2 = P_1\times P_1; P_n = P^n</math><br />
 
The equation above is a special case of the Chapman-Kolmogorov equations.<br />
 
The equation above is a special case of the Chapman-Kolmogorov equations.<br />
It is true because of the Markov property or<br />
+
It is true because of the Markov property or the memoryless property of Markov chains, where the probabilities of going forward to the next state <br />
the memoryless property of Markov chains, where the probabilities of going forward to the next state <br />
 
 
only depends on your current state, not your previous states. By intuition, we can multiply the 1-step transition <br />
 
only depends on your current state, not your previous states. By intuition, we can multiply the 1-step transition <br />
 
matrix n-times to get a n-step transition matrix.<br />
 
matrix n-times to get a n-step transition matrix.<br />
Line 4,213: Line 4,366:
 
The vector <math>\underline{\mu_0}</math> is called the initial distribution. <br/>
 
The vector <math>\underline{\mu_0}</math> is called the initial distribution. <br/>
  
<math> P_2~=P_1 P_1 </math> (as verified above)  
+
<math> P^2~=P\cdot P </math> (as verified above)  
  
 
In general,
 
In general,
<math> P_n~=(P_1)^n </math> **Note that <math>P_1</math> is equal to the matrix P <br/>
+
<math> P^n~= \Pi_{i=1}^{n} P</math> (P multiplied n times)<br/>
<math>\mu_n~=\mu_0 P_n</math><br/>
+
<math>\mu_n~=\mu_0 P^n</math><br/>
 
where <math>\mu_0</math> is the initial distribution,
 
where <math>\mu_0</math> is the initial distribution,
and <math>\mu_{m+n}~=\mu_m P_n</math><br/>
+
and <math>\mu_{m+n}~=\mu_m P^n</math><br/>
 
N can be negative, if P is invertible.
 
N can be negative, if P is invertible.
  
Line 4,252: Line 4,405:
  
  
<math>\pi</math> is stationary distribution of the chain if <math>\pi</math>P = <math>\pi</math>
+
<math>\pi</math> is stationary distribution of the chain if <math>\pi</math>P = <math>\pi</math> In other words, a stationary distribution is when the markov process that have equal probability of moving to other states as its previous move.
  
 
where <math>\pi</math> is a probability vector <math>\pi</math>=(<math>\pi</math><sub>i</sub> | <math>i \in X</math>) such that all the entries are nonnegative and sum to 1. It is the eigenvector in this case.
 
where <math>\pi</math> is a probability vector <math>\pi</math>=(<math>\pi</math><sub>i</sub> | <math>i \in X</math>) such that all the entries are nonnegative and sum to 1. It is the eigenvector in this case.
Line 4,259: Line 4,412:
  
 
The above conditions are used to find the stationary distribution
 
The above conditions are used to find the stationary distribution
 +
In matlab, we could use <math>P^n</math> to find the stationary distribution.(n is usually larger than 100)<br/>
 +
  
 
'''Comments:'''<br/>
 
'''Comments:'''<br/>
Line 4,466: Line 4,621:
 
<math>\displaystyle \pi=(\frac{1}{3},\frac{4}{9}, \frac{2}{9})</math>
 
<math>\displaystyle \pi=(\frac{1}{3},\frac{4}{9}, \frac{2}{9})</math>
  
<math>\displaystyle \lambda u=A u</math>
+
Note that <math>\displaystyle \pi=\pi  p</math> looks similar to eigenvectors/values <math>\displaystyle \lambda vec{u}=A vec{u}</math>
  
<math>\pi</math> can be considered as an eigenvector of P with eigenvalue = 1.
+
<math>\pi</math> can be considered as an eigenvector of P with eigenvalue = 1. But note that the vector <math>vec{u}</math> is a column vector and o we need to transform our <math>\pi</math> into a column vector.
But the vector u here needs to be a column vector. So we need to transform <math>\pi</math> into a column vector.
 
  
<math>\pi</math><sup>T</sup>= P<sup>T</sup><math>\pi</math><sup>T</sup>
+
<math>=> \pi</math><sup>T</sup>= P<sup>T</sup><math>\pi</math><sup>T</sup><br/>
 
Then <math>\pi</math><sup>T</sup> is an eigenvector of P<sup>T</sup> with eigenvalue = 1. <br />
 
Then <math>\pi</math><sup>T</sup> is an eigenvector of P<sup>T</sup> with eigenvalue = 1. <br />
 
MatLab tips:[V D]=eig(A), where D is a diagonal matrix of eigenvalues and V is a matrix of eigenvectors of matrix A<br />
 
MatLab tips:[V D]=eig(A), where D is a diagonal matrix of eigenvalues and V is a matrix of eigenvectors of matrix A<br />
 
==== MatLab Code ====
 
==== MatLab Code ====
 +
<pre style='font-size:14px'>
  
 +
P = [1/3 1/3 1/3; 1/4 3/4 0; 1/2 0 1/2]
 +
 +
pii = [1/3 4/9 2/9]
 +
 +
[vec val] = eig(P')            %% P' is the transpose of matrix P
 +
 +
vec(:,1) = [-0.5571 -0.7428 -0.3714]      %% this is in column form
 +
 +
a = -vec(:,1)
 +
 +
>> a =
 +
[0.5571 0.7428 0.3714]   
 +
 +
%% a is in column form
 +
 +
%% Since we want this vector a to sum to 1, we have to scale it
 +
 +
b = a/sum(a)
 +
 +
>> b =
 +
[0.3333 0.4444 0.2222] 
 +
 +
%% b is also in column form
 +
 +
%% Observe that b' = pii
 +
 +
</pre>
 +
</br>
 
==== Limiting distribution ====
 
==== Limiting distribution ====
 
A Markov chain has limiting distribution <math>\pi</math> if
 
A Markov chain has limiting distribution <math>\pi</math> if
Line 4,490: Line 4,673:
  
 
If the limiting distribution <math>\pi</math> exists, it must be equal to the stationary distribution.<br/>
 
If the limiting distribution <math>\pi</math> exists, it must be equal to the stationary distribution.<br/>
 +
 +
This convergence means that,in the long run(n to infinity),the probability of finding the <br/>
 +
Markov chain in state j is approximately <math>\pi_j</math> no matter in which state <br/>
 +
the chain began at time 0. <br/>
  
 
'''Example:'''
 
'''Example:'''
Line 4,499: Line 4,686:
 
, find stationary distribution.<br/>
 
, find stationary distribution.<br/>
 
We have:<br/>
 
We have:<br/>
<math>0*\pi_0+0*\pi_1+1*\pi_2=\pi_0</math><br/>
+
<math>0\times \pi_0+0\times \pi_1+1\times \pi_2=\pi_0</math><br/>
<math>1*\pi_0+0*\pi_1+0*\pi_2=\pi_1</math><br/>
+
<math>1\times \pi_0+0\times \pi_1+0\times \pi_2=\pi_1</math><br/>
<math>0*\pi_0+1*\pi_1+0*\pi_2=\pi_2</math><br/>
+
<math>0\times \pi_0+1\times \pi_1+0\times \pi_2=\pi_2</math><br/>
<math>\pi_0+\pi_1+\pi_2=1</math><br/>
+
<math>\,\pi_0+\pi_1+\pi_2=1</math><br/>
 
this gives <math>\pi = \left [ \begin{matrix}
 
this gives <math>\pi = \left [ \begin{matrix}
 
\frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\[6pt]
 
\frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\[6pt]
Line 4,510: Line 4,697:
 
In general, there are chains with stationery distributions that don't converge, this means that they have stationary distribution but are not limiting.<br/>
 
In general, there are chains with stationery distributions that don't converge, this means that they have stationary distribution but are not limiting.<br/>
  
 +
=== MatLab Code ===
 +
<pre style='font-size:14px'>
 +
MATLAB
 +
>> P=[0, 1, 0;0, 0, 1; 1, 0, 0]
  
'''Example:'''
+
P =
  
<math> P= \left [ \begin{matrix}
+
    0     1     0
\frac{4}{5} & \frac{1}{5} & 0 & 0 \\[6pt]
+
    0     0     1
\frac{1}{5} & \frac{4}{5} & 0 & 0 \\[6pt]
+
    1    0     0
0 & 0 & \frac{4}{5} & \frac{1}{5} \\[6pt]
 
0 & 0 & \frac{1}{10} & \frac{9}{10} \\[6pt]
 
\end{matrix} \right] </math>
 
  
This chain converges but is not a limiting distribution as the rows are not the same and it doesn't converge to the stationary distribution.<br />
+
>> pii=[1/3, 1/3, 1/3]
<br />
 
Double Stichastic Matrix: a double stichastic matrix is a matrix whose all colums sum to 1 and all rows sum to 1.<br />
 
If a given transition matrix is a double stichastic matrix with n colums and n rows, then the stationary distribution matrix has all<br/>
 
elements equals to 1/n.<br/>
 
<br/>
 
Example:<br/>
 
For a stansition matrix <math> P= \left [ \begin{matrix}
 
0 & \frac{1}{2} & \frac{1}{2} \\[6pt]
 
\frac{1}{2} & 0 & \frac{1}{2} \\[6pt]
 
\frac{1}{2} & \frac{1}{2} & 0 \\[6pt]
 
\end{matrix} \right] </math>,<br/>
 
The stationary distribution is <math>\pi = \left [ \begin{matrix}
 
\frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\[6pt]
 
\end{matrix} \right] </math> <br/>
 
  
 +
pii =
  
<span style="font-size:20px;color:red">The following contents are problematic. Please correct it if possible.</span><br />
+
    0.3333    0.3333    0.3333
Suppose we're given that the limiting distribution <math> \pi </math> exists for  stochastic matrix P, that is, <math> \pi = \pi * P </math> <br>
 
  
WLOG assume P is diagonalizable, (if not we can always consider the Jordan form and the computation below is exactly the same. <br>
+
>> pii*P
  
Let <math> P = U  \Sigma  U^{-1} </math> be the eigenvalue decomposition of <math> P </math>, where <math>\Sigma = diag(\lambda_1,\ldots,\lambda_n) ; |\lambda_i| > |\lambda_j|, \forall i < j </math><br>
+
ans =
  
Suppose <math> \pi^T = \sum a_i u_i </math> where <math> a_i \in \mathcal{R} </math> and <math> u_i </math> are eigenvectors of <math> P </math> for <math> i = 1\ldots n </math> <br>
+
    0.3333    0.3333    0.3333
  
By definition: <math> \pi^k = \pi P = \pi P^k \implies \pi = \pi(U  \Sigma  U^{-1}) (U  \Sigma  U^{-1} ) \ldots (U  \Sigma  U^{-1}) </math> <br>
+
>> P^1000
  
Therefore <math> \pi^k = \sum a_i  \lambda_i^k  u_i </math> since <math> <u_i , u_j> = 0, \forall i\neq j </math>. <br>
+
ans =
  
Therefore <math> \lim_{k \rightarrow \infty} \pi^k = \lim_{k \rightarrow \infty}  \lambda_i^k  a_1  u_1 = u_1 </math>
+
    0    1    0
 +
    0    0    1
 +
    1    0    0
  
=== MatLab Code ===
+
>> P^10000
<pre style='font-size:14px'>
 
>> P=[1/3, 1/3, 1/3; 1/4, 3/4, 0; 1/2, 0, 1/2]      % We input a matrix P.This is the same matrix as last class. 
 
  
P =
+
ans =
  
     0.3333    0.3333    0.3333
+
    0    1     0
     0.2500    0.7500        0
+
    0     0     1
     0.5000        0    0.5000
+
    1     0    0
  
>> P^2
+
>> P^10002
  
 
ans =
 
ans =
  
     0.3611    0.3611    0.2778
+
    1     0     0
     0.2708    0.6458    0.0833
+
    0     1     0
     0.4167    0.1667    0.4167
+
    0     0    1
  
>> P^3
+
>> P^10003
  
 
ans =
 
ans =
  
     0.3495    0.3912    0.2593
+
    0    1     0
     0.2934    0.5747    0.1319
+
    0     0     1
     0.3889    0.2639    0.3472
+
    1     0    0
  
>> P^10
+
>> %P^10000 = P^10003
 +
>> % This chain does not have limiting distribution, it has a stationary distribution. 
  
the example of code and an example of stand distribution, then the all the pi probability in the matrix are the same.
+
This chain does not converge, it has a cycle.
 +
</pre>
  
ans =
+
The first condition of limiting distribution is satisfied; however, the second condition where <math>\pi</math><sub>j</sub> has to be independent of i (i.e. all rows of the matrix are the same) is not met.<br>
  
    0.3341    0.4419    0.2240
+
This example shows the distinction between having a stationary distribution and convergence(having a limiting distribution).Note: <math>\pi=(1/3,1/3,1/3)</math> is the stationary distribution as <math>\pi=\pi*p</math>. However, upon repeatedly multiplying P by itself (repeating the step <math>P^n</math> as n goes to infinite) one will note that the results become a cycle (of period 3) of the same sequence of matrices. The chain has a stationary distribution, but does not converge to it. Thus, there is no limiting distribution.<br>
    0.3314    0.4507    0.2179
 
    0.3360    0.4358    0.2282
 
  
>> P^100                                            % The stationary distribution is [0.3333 0.4444 0.2222]  since values keep unchanged.
+
'''Example:'''
  
ans =
+
<math> P= \left [ \begin{matrix}
 +
\frac{4}{5} & \frac{1}{5} & 0 & 0 \\[6pt]
 +
\frac{1}{5} & \frac{4}{5} & 0 & 0 \\[6pt]
 +
0 & 0 & \frac{4}{5} & \frac{1}{5} \\[6pt]
 +
0 & 0 & \frac{1}{10} & \frac{9}{10} \\[6pt]
 +
\end{matrix} \right] </math>
  
    0.3333    0.4444    0.2222
+
This chain converges but is not a limiting distribution as the rows are not the same and it doesn't converge to the stationary distribution.<br />
    0.3333    0.4444    0.2222
+
<br />
    0.3333    0.4444    0.2222
+
Double Stichastic Matrix: a double stichastic matrix is a matrix whose all colums sum to 1 and all rows sum to 1.<br />
 
+
If a given transition matrix is a double stichastic matrix with n colums and n rows, then the stationary distribution matrix has all<br/>
 
+
elements equals to 1/n.<br/>
>> [vec val]=eigs(P')                              % We can find the eigenvalues and eigenvectors from the transpose of matrix P.
+
<br/>
 
+
Example:<br/>
vec =
+
For a stansition matrix <math> P= \left [ \begin{matrix}
 +
0 & \frac{1}{2} & \frac{1}{2} \\[6pt]
 +
\frac{1}{2} & 0 & \frac{1}{2} \\[6pt]
 +
\frac{1}{2} & \frac{1}{2} & 0 \\[6pt]
 +
\end{matrix} \right] </math>,<br/>
 +
We have:<br/>
 +
<math>0\times \pi_0+\frac{1}{2}\times \pi_1+\frac{1}{2}\times \pi_2=\pi_0</math><br/>
 +
<math>\frac{1}{2}\times \pi_0+0\times \pi_1+\frac{1}{2}\times \pi_2=\pi_1</math><br/>
 +
<math>\frac{1}{2}\times \pi_0+\frac{1}{2}\times \pi_1+0\times \pi_2=\pi_2</math><br/>
 +
<math>\pi_0+\pi_1+\pi_2=1</math><br/>
 +
The stationary distribution is <math>\pi = \left [ \begin{matrix}
 +
\frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\[6pt]
 +
\end{matrix} \right] </math> <br/>
  
  -0.5571    0.2447    0.8121
 
  -0.7428  -0.7969  -0.3324
 
  -0.3714    0.5523  -0.4797
 
  
 +
<span style="font-size:20px;color:red">The following contents are problematic. Please correct it if possible.</span><br />
 +
Suppose we're given that the limiting distribution <math> \pi </math> exists for  stochastic matrix P, that is, <math> \pi = \pi \times P </math> <br>
  
val =
+
WLOG assume P is diagonalizable, (if not we can always consider the Jordan form and the computation below is exactly the same. <br>
  
    1.0000        0        0
+
Let <math> P = U  \Sigma  U^{-1} </math> be the eigenvalue decomposition of <math> P </math>, where <math>\Sigma = diag(\lambda_1,\ldots,\lambda_n) ; |\lambda_i| > |\lambda_j|, \forall i < j </math><br>
        0    0.6477        0
 
        0        0  -0.0643
 
  
>> a=-vec(:,1)                                    % The eigenvectors can be mutiplied by (-1) since  λV=AV  can be written as  λ(-V)=A(-V)
+
Suppose <math> \pi^T = \sum a_i u_i </math> where <math> a_i \in \mathcal{R} </math> and <math> u_i </math> are eigenvectors of <math> P </math> for <math> i = 1\ldots n </math> <br>
  
a =
+
By definition: <math> \pi^k = \pi P = \pi P^k \implies \pi = \pi(U  \Sigma  U^{-1}) (U  \Sigma  U^{-1} ) \ldots (U  \Sigma  U^{-1}) </math> <br>
 +
 
 +
Therefore <math> \pi^k = \sum a_i  \lambda_i^k  u_i </math> since <math> <u_i , u_j> = 0, \forall i\neq j </math>. <br>
  
    0.5571
+
Therefore <math> \lim_{k \rightarrow \infty} \pi^k = \lim_{k \rightarrow \infty}  \lambda_i^k  a_1  u_1 = u_1 </math>
    0.7428
 
    0.3714
 
  
>> sum(a)
+
=== MatLab Code ===
 +
<pre style='font-size:14px'>
 +
>> P=[1/3, 1/3, 1/3; 1/4, 3/4, 0; 1/2, 0, 1/2]      % We input a matrix P. This is the same matrix as last class. 
  
ans =
+
P =
  
     1.6713
+
     0.3333    0.3333    0.3333
 +
    0.2500    0.7500        0
 +
    0.5000        0    0.5000
  
>> a/sum(a)
+
>> P^2
  
 
ans =
 
ans =
  
     0.3333
+
     0.3611    0.3611    0.2778
     0.4444
+
     0.2708    0.6458    0.0833
     0.2222
+
     0.4167    0.1667    0.4167
</pre>
 
  
This is <math>\pi_j = lim[p^n]_(ij)</math> exist and is independent of i
+
>> P^3
  
Example: Find the stationary distribution of P= <math>\left[ {\begin{array}{ccc}
+
ans =
0 & 1 & 0 \\
 
0 & 0 & 1 \\
 
1 & 0 & 0 \end{array} } \right]</math>
 
  
<math>\pi=\pi~P</math><br>
+
    0.3495    0.3912    0.2593
 +
    0.2934    0.5747    0.1319
 +
    0.3889    0.2639    0.3472
  
<math>\pi=</math> [<math>\pi</math><sub>0</sub>, <math>\pi</math><sub>1</sub>, <math>\pi</math><sub>2</sub>]<br>
+
>> P^10
  
The system of equations is:
+
The example of code and an example of stand distribution, then the all the pi probability in the matrix are the same.
  
0*<math>\pi</math><sub>0</sub>+0*<math>\pi</math><sub>1</sub>+1*<math>\pi</math><sub>2</sub> = <math>\pi</math><sub>0</sub> => <math>\pi</math><sub>2</sub> = <math>\pi</math><sub>0</sub><br>
+
ans =
1*<math>\pi</math><sub>0</sub>+0*<math>\pi</math><sub>1</sub>+0*<math>\pi</math><sub>2</sub> = <math>\pi</math><sub>1</sub> => <math>\pi</math><sub>1</sub> = <math>\pi</math><sub>0</sub><br>
 
0*<math>\pi</math><sub>0</sub>+1*<math>\pi</math><sub>1</sub>+0*<math>\pi</math><sub>2</sub> = <math>\pi</math><sub>2</sub> <br>
 
<math>\pi</math><sub>0</sub>+<math>\pi</math><sub>1</sub>+<math>\pi</math><sub>2</sub> = 1<br>
 
  
<math>\pi</math><sub>0</sub>+<math>\pi</math><sub>0</sub>+<math>\pi</math><sub>0</sub> = 3<math>\pi</math><sub>0</sub> = 1, which gives <math>\pi</math><sub>0</sub> = 1/3 <br>
+
    0.3341    0.4419    0.2240
Also, <math>\pi</math><sub>1</sub> = <math>\pi</math><sub>2</sub> = 1/3 <br>
+
    0.3314    0.4507    0.2179
So, <math>\pi</math> = <math>[\frac{1}{3}, \frac{1}{3}, \frac{1}{3}]</math> <br>
+
    0.3360    0.4358    0.2282
  
when the p matrix is a standard matrix, then all the probabilities of pi are the same in the matrix.
+
>> P^100                                  % The stationary distribution is [0.3333 0.4444 0.2222]  since values keep unchanged.
  
=== MatLab Code ===
+
ans =
<pre style='font-size:14px'>
 
MATLAB
 
>> P=[0, 1, 0;0, 0, 1; 1, 0, 0]
 
  
P =
+
    0.3333    0.4444    0.2222
 +
    0.3333    0.4444    0.2222
 +
    0.3333    0.4444    0.2222
  
    0    1    0
 
    0    0    1
 
    1    0    0
 
  
>> pii=[1/3, 1/3, 1/3]
+
>> [vec val]=eigs(P')                    % We can find the eigenvalues and eigenvectors from the transpose of matrix P.
  
pii =
+
vec =
  
    0.3333   0.3333   0.3333
+
  -0.5571    0.2447    0.8121
 +
  -0.7428  -0.7969  -0.3324
 +
   -0.3714   0.5523  -0.4797
  
>> pii*P
 
  
ans =
+
val =
  
     0.3333   0.3333    0.3333
+
     1.0000        0        0
 +
        0    0.6477        0
 +
        0        0  -0.0643
  
>> P^1000
+
>> a=-vec(:,1)                            % The eigenvectors can be mutiplied by (-1) since  λV=AV  can be written as  λ(-V)=A(-V)
  
ans =
+
a =
  
    0    1     0
+
     0.5571
    0     0     1
+
     0.7428
    1    0     0
+
     0.3714
  
>> P^10000
+
>> sum(a)
  
 
ans =
 
ans =
  
    0     1     0
+
     1.6713
    0    0    1
 
    1    0    0
 
  
>> P^10002
+
>> a/sum(a)
  
 
ans =
 
ans =
  
    1    0     0
+
     0.3333
    0    1    0
+
     0.4444
    0    0    1
+
     0.2222
 
 
>> P^10003
 
 
 
ans =
 
 
 
    0    1     0
 
    0     0     1
 
    1    0    0
 
 
 
>> %P^10000 = P^10003
 
>> % This chain does not have limiting distribution, it has a stationary distribution. 
 
 
 
This chain does not converge, it has a cycle.
 
 
</pre>
 
</pre>
  
The first condition of limiting distribution is satisfied; however, the second condition where <math>\pi</math><sub>j</sub> has to be independent of i (i.e. all rows of the matrix are the same) is not met.<br>
+
This is <math>\pi_j = lim[p^n]_(ij)</math> exist and is independent of i
 
 
This example shows the distinction between having a stationary distribution and convergence(having a limiting distribution).Note: <math>\pi=(1/3,1/3,1/3)</math> is the stationary distribution as <math>\pi=\pi*p</math>. However, upon repeatedly multiplying P by itself (repeating the step <math>P^n</math> as n goes to infinite) one will note that the results become a cycle (of period 3) of the same sequence of matrices. The chain has a stationary distribution, but does not converge to it. Thus, there is no limiting distribution.<br>
 
  
 
Another example:
 
Another example:
Line 4,752: Line 4,921:
  
 
'''Note:'''if there's a finite number N then every other state can be reached in N steps.
 
'''Note:'''if there's a finite number N then every other state can be reached in N steps.
'''Note:'''Also note that a Ergodic chain is irreducible (all states communicate) and aperiodic (d = 1). An Ergodic chain is promised to have a stationary and limiting distribution.
+
'''Note:'''Also note that a Ergodic chain is irreducible (all states communicate) and aperiodic (d = 1). An Ergodic chain is promised to have a stationary and limiting distribution.<br/>
 +
'''Ergodicity:''' A state i is said to be ergodic if it is aperiodic and positive recurrent. In other words, a state i is ergodic if it is recurrent, has a period of 1 and it has finite mean recurrence time. If all states in an irreducible Markov chain are ergodic, then the chain is said to be ergodic.<br/>
 +
'''Some more:'''It can be shown that a finite state irreducible Markov chain is ergodic if it has an aperiodic state. A model has the ergodic property if there's a finite number N such that any state can be reached from any other state in exactly N steps. In case of a fully connected transition matrix where all transitions have a non-zero probability, this condition is fulfilled with N=1.<br/>
  
  
Line 4,822: Line 4,993:
 
<math> \pi_0 = \frac{4}{19} </math> <br>
 
<math> \pi_0 = \frac{4}{19} </math> <br>
 
<math> \pi = [\frac{4}{19}, \frac{15}{19}] </math> <br>
 
<math> \pi = [\frac{4}{19}, \frac{15}{19}] </math> <br>
<math> \pi </math> is the long run distribution
+
<math> \pi </math> is the long run distribution, and this is also a limiting distribution.
  
 
We can use the stationary distribution to compute the expected waiting time to return to state 'a' <br/>
 
We can use the stationary distribution to compute the expected waiting time to return to state 'a' <br/>
Line 4,829: Line 5,000:
 
state 'a' given that we start at state 'a' is 19/4.<br/>
 
state 'a' given that we start at state 'a' is 19/4.<br/>
  
definition of limiting distribution.
+
definition of limiting distribution: when the stationary distribution is convergent, it is a limiting distribution.<br/>
  
 
remark:satisfied balance of <math>\pi_i P_{ij} = P_{ji} \pi_j</math>, so there is other way to calculate the step probability.
 
remark:satisfied balance of <math>\pi_i P_{ij} = P_{ji} \pi_j</math>, so there is other way to calculate the step probability.
Line 4,954: Line 5,125:
  
 
<math>\pi_2 P_{2,3} = 4/9 \times 0 = 0,\, P_{3,2} \pi_3 = 0 \times 2/9 = 0 \Rightarrow \pi_2 P_{2,3} = P_{3,2} \pi_3</math><br>
 
<math>\pi_2 P_{2,3} = 4/9 \times 0 = 0,\, P_{3,2} \pi_3 = 0 \times 2/9 = 0 \Rightarrow \pi_2 P_{2,3} = P_{3,2} \pi_3</math><br>
Remark:Detailed balance of <math> \pi_i * Pij = Pji * \pi_j</math> , so there is other way to calculate the step probability<br />
+
Remark:Detailed balance of <math> \pi_i \times Pij = Pji \times \pi_j</math> , so there is other way to calculate the step probability<br />
 
<math>\pi</math> is stationary but is not limiting.
 
<math>\pi</math> is stationary but is not limiting.
Detailed balance guarantees that <math>\pi</math> is stationary distribution.
+
Detailed balance implies that <math>\pi</math> = <math>\pi</math> * P as shown in the proof and guarantees that <math>\pi</math> is stationary distribution.
  
 
== Class 15 - Tuesday June 25th 2013 ==
 
== Class 15 - Tuesday June 25th 2013 ==
Line 4,987: Line 5,158:
 
=== PageRank (http://en.wikipedia.org/wiki/PageRank) ===
 
=== PageRank (http://en.wikipedia.org/wiki/PageRank) ===
  
 +
*PageRank is a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. PageRank can be calculated for collections of documents of any size.
 
*PageRank is a link-analysis algorithm developed by and named after Larry Page from Google; used for measuring a website's importance, relevance and popularity.
 
*PageRank is a link-analysis algorithm developed by and named after Larry Page from Google; used for measuring a website's importance, relevance and popularity.
 
*PageRank is a graph containing web pages and their links to each other.
 
*PageRank is a graph containing web pages and their links to each other.
Line 4,994: Line 5,166:
  
 
<br />'''The order of importance'''<br />
 
<br />'''The order of importance'''<br />
1. A web page is important if many other pages point to it<br />
+
1. A web page is more important if many other pages point to it<br />
 
2. The more important a web page is, the more weight should be assigned to its outgoing links<br/ >
 
2. The more important a web page is, the more weight should be assigned to its outgoing links<br/ >
 
3. If a webpage has many outgoing links, then its links have less value (ex: if a page links to everyone, like 411, it is not as important as pages that have incoming links)<br />
 
3. If a webpage has many outgoing links, then its links have less value (ex: if a page links to everyone, like 411, it is not as important as pages that have incoming links)<br />
Line 5,016: Line 5,188:
 
Page 2 comes after page 4 since it has the third most links pointing to it<br/>
 
Page 2 comes after page 4 since it has the third most links pointing to it<br/>
 
Page 1 and page 5 are the least important since no links point to them<br/ >
 
Page 1 and page 5 are the least important since no links point to them<br/ >
<math>As page 1</math> and page 2 has the most outgoing links, then their links have less value compared to the other pages. <br/ >
+
As page 1 and page 2 have the most outgoing links, then their links have less value compared to the other pages. <br/ >
  
 
:<math>
 
:<math>
Line 5,025: Line 5,197:
  
 
<br />
 
<br />
C<sub>j</sub> The number of outgoing links of page <math>j</math>:
+
<math>C_j=</math> The number of outgoing links of page <math>j</math>:
 
<math>C_j=\sum_i L_{ij}</math>
 
<math>C_j=\sum_i L_{ij}</math>
 
(i.e. sum of entries in column j)<br />
 
(i.e. sum of entries in column j)<br />
Line 5,036: Line 5,208:
 
<math>P_i=\sum_j L_{ij}</math> <br />(i.e. sum of entries in row i)
 
<math>P_i=\sum_j L_{ij}</math> <br />(i.e. sum of entries in row i)
  
for each row, if there is a 1 in the third column, it means page three point to that page.
+
For each row of <math>L</math>, if there is a 1 in the third column, it means page three point to that page.
 +
 
 +
However, we should not define the rank of the page this way because links shouldn't be treated the same. The weight of the link is based on different factors. One of the factors is the importance of the page that link is coming from. For example, in this case, there are two links going to Page 4: one from Page 2 and one from Page 5. So far, both links have been treated equally with the same weight 1. But we must rerate the two links based on the importance of the pages they are coming from.
  
 
A PageRank results from a mathematical algorithm based on the webgraph, created by all World Wide Web pages as nodes and hyperlinks as edges, taking into consideration authority hubs such as cnn.com or usa.gov. The rank value indicates an importance of a particular page. A hyperlink to a page counts as a vote of support. (This would be represented in our diagram as an arrow pointing towards the page. Hence in our example, Page 3 is the most important, since it has the most 'votes of support). The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it ("incoming links"). A page that is linked to by many pages with high PageRank receives a high rank itself. If there are no links to a web page, then there is no support for that page (In our example, this would be Page 1 and Page 5).
 
A PageRank results from a mathematical algorithm based on the webgraph, created by all World Wide Web pages as nodes and hyperlinks as edges, taking into consideration authority hubs such as cnn.com or usa.gov. The rank value indicates an importance of a particular page. A hyperlink to a page counts as a vote of support. (This would be represented in our diagram as an arrow pointing towards the page. Hence in our example, Page 3 is the most important, since it has the most 'votes of support). The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it ("incoming links"). A page that is linked to by many pages with high PageRank receives a high rank itself. If there are no links to a web page, then there is no support for that page (In our example, this would be Page 1 and Page 5).
Line 5,061: Line 5,235:
  
 
=== Page Rank ===
 
=== Page Rank ===
<math>L_{ij}</math> equals 1 if j has a link to i, and equals 0 otherwise. <br>
+
*<math>
<math>C_j</math> :The number of outgoing links for page j, where <math>c_j=\sum_i L_{ij}</math>  
+
L_{ij} = \begin{cases}
 +
1, & \text{if j has a link to i }  \\
 +
0, & \text{otherwise} \end{cases} </math> <br/>
 +
 
 +
*<math>C_j</math>: number of outgoing links for page j, where <math>c_j=\sum_i L_{ij}</math>  
  
 
P is N by 1 vector contains rank of all N pages; for page i, the rank is <math>P_i</math>
 
P is N by 1 vector contains rank of all N pages; for page i, the rank is <math>P_i</math>
  
 
<math>P_i= (1-d) + d\cdot \sum_j \frac {L_{ji}P_j}{c_j}</math>
 
<math>P_i= (1-d) + d\cdot \sum_j \frac {L_{ji}P_j}{c_j}</math>
 +
pi is the rank of a new created page(that no one knows about) is 0 since <math>L_ij</math> is 0 <br/>
 +
where 0 < d < 1 is constant (in original page rank algorithm d = 0.8), and <math>L_{ij}</math> is 1 if j has link to i, 0 otherwise.
  
where 0 < d < 1 is constant (in original page rank algorithm d = 0.8), and <math>L_{ij}</math> is 1 if j has link to i, 0 otherwise.
+
Note that the rank of a page is proportional to the number of its incoming links and inversely proportional to the number of its outgoing links.
  
 
Interpretation of the formula:<br/>
 
Interpretation of the formula:<br/>
Line 5,076: Line 5,256:
 
4) finally, we take a linear combination of the page rank obtained from above and a constant 1. This ensures that every page has a rank greater than zero.<br/>
 
4) finally, we take a linear combination of the page rank obtained from above and a constant 1. This ensures that every page has a rank greater than zero.<br/>
 
5) d is the damping factor.  It represents the probability a user, at any page, will continue clicking to another page.<br/>
 
5) d is the damping factor.  It represents the probability a user, at any page, will continue clicking to another page.<br/>
 +
If there is no damping (i.e. d=1), then there are no assumed outgoing links for nodes with no links. However, if there is damping (e.g. d=0.8), then these nodes are assumed to have links to all pages in the web.
  
 
Note that this is a system of N equations with N unknowns.<br/>
 
Note that this is a system of N equations with N unknowns.<br/>
Line 5,093: Line 5,274:
 
0 & 0 & ... & c_N \end{matrix} } \right]</math>
 
0 & 0 & ... & c_N \end{matrix} } \right]</math>
  
Then <math>P=~(1-d)e+dLD^{-1}P</math><br/> where e =[1 1 ....]<sup>T</sup> , i.e. a N by 1 vector.<br/>
+
Then <math>P=~(1-d)e+dLD^{-1}P</math>, P is an iegenvector of matrix A corresponding to an eigenvalue equal to 1.<br/> where e =[1 1 ....]<sup>T</sup> , i.e. a N by 1 vector.<br/>
 
We assume that rank of all N pages sums to N. The sum of rank of all N pages can be any number, as long as the ranks have certain propotion. <br/>
 
We assume that rank of all N pages sums to N. The sum of rank of all N pages can be any number, as long as the ranks have certain propotion. <br/>
 
i.e. e<sup>T</sup> P = N, then <math>~\frac{e^{T}P}{N} = 1</math>
 
i.e. e<sup>T</sup> P = N, then <math>~\frac{e^{T}P}{N} = 1</math>
Line 5,117: Line 5,298:
  
 
<math>P=[(1-d)~\frac{ee^T}{N}+dLD^{-1}]P</math>
 
<math>P=[(1-d)~\frac{ee^T}{N}+dLD^{-1}]P</math>
 +
 +
<math>=> P=A*P</math>
  
 
'''Explanation of an eigenvector'''
 
'''Explanation of an eigenvector'''
Line 5,123: Line 5,306:
 
That is, A*v = c*v. Where c is the eigenvalue of A corresponding to the eigenvector v. In our case of Page Rank, the eigenvalue c=1. <br>
 
That is, A*v = c*v. Where c is the eigenvalue of A corresponding to the eigenvector v. In our case of Page Rank, the eigenvalue c=1. <br>
  
P=AP
+
We obtain that <math>P=AP</math> where <math>A=(1-d)~\frac{ee^T}{N}+dLD^{-1}</math><br/>
 +
Thus, <math>P</math> is an eigenvector of <math>P</math> correspond to an eigen value equals 1.<br/>
 +
 
  
 
Since,
 
Since,
Line 5,129: Line 5,314:
 
D<sup>-1</sup> is a N*N matrix,  
 
D<sup>-1</sup> is a N*N matrix,  
 
P is a N*1 matrix <br/>
 
P is a N*1 matrix <br/>
Then as a result, <math>LD^{-1}P</math> is <math>N*1</math> matrix. <br/>
+
Then as a result, <math>LD^{-1}P</math> is a N*1 matrix. <br/>
  
 
N is a N*N matrix, d is a constant between 0 and 1.
 
N is a N*N matrix, d is a constant between 0 and 1.
Line 5,295: Line 5,480:
 
Consider: 1 -> ,<-2 ->3
 
Consider: 1 -> ,<-2 ->3
  
L= [0 1 0; 1 0 0; 0 1 0]; c=[1,1,1]; D= [1 0 0; 0 1 0; 0 0 1]
+
<math>L=  
 +
\left[ {\begin{matrix}
 +
0 & 1 & 0 \\
 +
1 & 0 & 0 \\
 +
0 & 1 & 0 \end{matrix} } \right]\;
 +
c=  
 +
\left[ {\begin{matrix}
 +
1 & 1 & 1 \end{matrix} } \right]\;
 +
D=  
 +
\left[ {\begin{matrix}
 +
1 & 0 & 0 \\
 +
0 & 1 & 0 \\
 +
0 & 0 & 1 \end{matrix} } \right]</math>
  
 
==== Example 4 ====
 
==== Example 4 ====
  
 
<math>1 \leftrightarrow 2 \rightarrow 3 \leftrightarrow 4 </math>
 
<math>1 \leftrightarrow 2 \rightarrow 3 \leftrightarrow 4 </math>
<br />
 
 
<br />
 
<br />
 
<br />
 
<br />
Line 5,308: Line 5,504:
 
1 & 0 & 0 & 0 \\
 
1 & 0 & 0 & 0 \\
 
0 & 1 & 0 & 1 \\
 
0 & 1 & 0 & 1 \\
0 & 0 & 1 & 0 \end{matrix} } \right]\;
+
0 & 0 & 1 & 0 \end{matrix} } \right]\;</math><br />
c=
 
\left[ {\begin{matrix}
 
1 & 2 & 1 & 1 \end{matrix} } \right]\;
 
D=
 
\left[ {\begin{matrix}
 
1 & 0 & 0 & 0 \\
 
0 & 2 & 0 & 0 \\
 
0 & 0 & 1 & 0  \\
 
0 & 0 & 0 & 1 \end{matrix} } \right]</math><br />
 
 
 
Matlab code
 
<pre style='font-size:14px'>
 
  
 +
'''Matlab Code:'''<br>
 +
<pre style='font-size:16px'>
 
>> L=L= [0 1 0 0;1 0 0 0;0 1 0 1;0 0 1 0];
 
>> L=L= [0 1 0 0;1 0 0 0;0 1 0 1;0 0 1 0];
 
>> C=sum(L);
 
>> C=sum(L);
Line 5,327: Line 5,513:
 
>> d=0.8;
 
>> d=0.8;
 
>> N=4;
 
>> N=4;
>> A=(1-d)*ones(N)/N+d*L*pinv(D)
+
>> A=(1-d)*ones(N)/N+d*L*pinv(D);
 +
>> [vec val]=eigs(A);
 +
>> a=vec(:,1);
 +
>> a=a/sum(a)
 +
    a =
 +
        0.1029 <- Page 1
 +
        0.1324 <- Page 2
 +
        0.3971 <- Page 3
 +
        0.3676 <- Page 4
  
A =
+
        % Therefore the PageRank for this matrix is: 3,4,2,1
 +
</pre>
 +
<br>
  
    0.0500    0.4500    0.0500    0.0500
+
==== Example 5 ====
    0.8500    0.0500    0.0500    0.0500
 
    0.0500    0.4500    0.0500    0.8500
 
    0.0500    0.0500    0.8500    0.0500
 
 
 
>> [vec val]=eigs(A)
 
 
 
vec =
 
 
 
    0.1817  -0.0000  -0.4082    0.4082
 
    0.2336    0.0000    0.5774    0.5774
 
    0.7009  -0.7071    0.4082  -0.4082
 
    0.6490    0.7071  -0.5774  -0.5774
 
 
 
 
 
val =
 
 
 
    1.0000        0        0        0
 
        0  -0.8000        0        0
 
        0        0  -0.5657        0
 
        0        0        0    0.5657
 
 
 
>> a=vec(:,1)
 
 
 
>> a=vec(:,1)
 
 
 
a =
 
 
 
    0.1817
 
    0.2336
 
    0.7009
 
    0.6490
 
 
 
>> a=a/sum(a)
 
 
 
a =
 
 
 
    0.1029
 
    0.1324
 
    0.3971
 
    0.3676
 
</pre>
 
'''NOTE:''' The ranking of each page is as follows: Page 3, Page 4, Page 2 and Page 1. Page 3 is the highest since it has the most incoming links. All of the other pages only have one incoming link but since Page 3, highest ranked page, links to Page 4, Page 4 is the second highest ranked. Lastly, since Page 2 links into Page 3 it is the next highest rank.
 
 
 
Page 2 has 2 outgoing links. Pages with the same incoming links can be ranked closest to the highest ranked page. If the highest page P1 is incoming into a page P2,  then P2 is ranked second, and so on.
 
 
 
==== Example 5 ====
 
  
 
<math>L=  
 
<math>L=  
Line 5,429: Line 5,579:
 
<br />
 
<br />
  
Matlab Code<br />
+
'''Matlab Code:'''<br />
 
<pre style="font-size:16px">
 
<pre style="font-size:16px">
>> d=0.8
+
>> d=0.8;
 +
>> L=[0 1 0 0 1;1 0 0 0 0;0 1 0 0 0;0 1 1 0 1;0 0 0 1 0];
 +
>> c=sum(L);
 +
>> D=diag(c);
 +
>> N=5;
 +
>> A=(1-d)*ones(N)/N+d*L*pinv(D);
 +
>> [vec val]=eigs(A);
 +
>> a=-vec(:,1);
 +
>> a=a/sum(a) 
 +
    a =
 +
        0.1933 <- Page 1
 +
        0.1946 <- Page 2
 +
        0.0919 <- Page 3
 +
        0.2668 <- Page 4
 +
        0.2534 <- Page 5
  
d =
+
        % Therefore the PageRank for this matrix is: 4,5,2,1,3
 +
</pre>
 +
<br>
  
    0.8000
+
== Class 17 - Tuesday July 2nd 2013 ==
 +
=== Markov Chain Monte Carlo (MCMC) ===
  
>> L=[0 1 0 0 1;1 0 0 0 0;0 1 0 0 0;0 1 1 0 1;0 0 0 1 0]
+
===Introduction===
 +
It is, in general, very difficult to simulate the value of a random vector X whose component random variables are dependent. We will present a powerful approach for generating a vector whose distribution is approximately that of X. This approach, called the Markov Chain Monte Carlo Methods, has the added significance of only requiring that the mass(or density) function of X be specified up to a multiplicative constant, and this, we will see, is of great importance in applications.
 +
(referenced by Sheldon M.Ross,Simulation)
 +
The basic idea used here is to generate a Markov Chain whose stationary distribution is the same as the target distribution.
  
L =
+
====Definition:====
 +
Markov Chain
 +
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> \displaystyle X_{t-1}</math>.
  
    0    1    0    0    1
+
For example,
    1     0    0    0    0
+
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_{n-1})</math>
    0    1    0    0    0
+
A random Walk is the best example  of a Markov process
    0    1    1    0    1
 
    0    0    0    1    0
 
  
>> c=sum(L)
+
<br>'''Transition Probability:'''<br>
 +
The probability of going from one state to another state.
 +
:<math>p_{ij} = \Pr(X_{n}=j\mid X_{n-1}= i). \,</math>
  
c =
+
<br>'''Transition Matrix:'''<br>
 +
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:
 +
Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a large number of steps is then used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps. (http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo)</span>
  
    1    3    1    1    2
+
<a style="color:red" href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-165.pdf">some notes form UCb</a>
  
>> D=diag(c)
+
'''One of the main purposes of MCMC''' : to simulate samples from a joint distribution where the joint random variables are dependent. In general, this is not easily sampled from. Other methods learned in class allow us to simulate i.i.d random variables, but not dependent variables . In this case, we could sample non-independent random variables using a Markov Chain. Its Markov properties help to simplify the simulation process.
  
D =
 
  
    1    0    0    0    0
+
<b>Basic idea:</b>  Given a probability distribution <math>\pi</math> on a set <math>\Omega</math>, we want to generate random elements of <math>\Omega</math> with distribution <math>\pi</math>. MCMC does that by constructing a Markov Chain with stationary distribution <math>\pi</math> and simulating the chain. After a large number of iterations, the Markov Chain will reach its stationary distribution. By sampling from the Markov chain for large amount of iterations, we are effectively sampling from the desired distribution as the Markov Chain would converge to its stationary distribution <br/>
    0    3    0    0    0
 
    0    0    1    0    0
 
    0    0    0    1    0
 
    0    0    0    0    2
 
  
>> N=5
+
Idea: generate a Markov chain whose stationary distribution is the same as target distribution. <br/>
  
N =
 
  
    5
+
'''Notes'''
  
>> A=(1-d)*ones(N)/N+d*L*pinv(D)
+
# Regardless of the chosen starting point, the Markov Chain will converge to its stationary distribution (if it exists). However, the time taken for the chain to converge depends on its chosen starting point. Typically, the burn-in period is longer if the chain is initialized with a value of low probability density.
 
+
# Markov Chain Monte Carlo can be used for sampling from a distribution, estimating the distribution, and computing the mean and optimization (e.g. simulated annealing, more on that later).
A =
+
# Markov Chain Monte Carlo is used to sample using “local” information. It is used as a generic “problem solving technique” to solve decision/optimization/value problems, but is not necessarily very efficient.
 +
# MCMC methods do not suffer as badly from the "curse of dimensionality" that badly affects efficiency in the acceptance-rejection method. This is because a point is always generated at each time-step according to the Markov Chain regardless of how many dimensions are introduced.
 +
# The goal when simulating with a Markov Chain is to create a chain with the same stationary distribution as the target distribution.
 +
# The MCMC method is usually used in continuous cases but a discrete example is given below.
  
    0.0400    0.3067    0.0400    0.0400    0.4400
 
    0.8400    0.0400    0.0400    0.0400    0.0400
 
    0.0400    0.3067    0.0400    0.0400    0.0400
 
    0.0400    0.3067    0.8400    0.0400    0.4400
 
    0.0400    0.0400    0.0400    0.8400    0.0400
 
  
>> [vec val]=eigs(A)
+
'''Some properties of the stationary distribution <math>\pi</math>'''
  
vec =
+
<math>\pi</math> indicates the proportion of time the process spends in each of the states 1,2,...,n. Therefore <math>\pi</math> satisfies the following two inequalities: <br>
  
  Columns 1 through 4
+
# <math>\pi_j = \sum_{i=1}^{n}\pi_i P_{ij}</math> <br /> This is because <math>\pi_i</math> is the proportion of time the process spends in state i, and <math>P_{ij}</math> is the probability the process transition out of state i into state j. Therefore, <math>\pi_i p_{ij}</math> is the proportion of time it takes for the process to enter state j. Therefore, <math>\pi_j</math> is the sum of this probability over overall states i.
 +
#<math> \sum_{i=1}^{n}\pi_i= 1 </math> as <math>\pi</math> shows the proportion of time the chain is in each state. If we view it as the probability of the chain being in state i at time t for t sufficiently large, then it should sum to one as the chain must be in one of the states.
  
  -0.4129            0.4845 + 0.1032i  0.4845 - 0.1032i -0.0089 + 0.2973i
+
====Motivation example====
  -0.4158            -0.6586            -0.6586            -0.5005 + 0.2232i
+
- Suppose we want to generate a random variable X according to distribution <math>\pi=(\pi_1, \pi_2, ...  , \pi_m)</math> <br/>
  -0.1963            0.2854 - 0.0608i  0.2854 + 0.0608i -0.2570 - 0.2173i
+
X can take m possible different values from <math>{1,2,3,\cdots, m}</math><br />
  -0.5700            0.1302 + 0.2612i  0.1302 - 0.2612i  0.1462 - 0.3032i
+
- We want to generate <math>\{X_t: t=0, 1, \cdots\}</math> according to <math>\pi</math><br />
  -0.5415            -0.2416 - 0.3036i  -0.2416 + 0.3036i  0.6202         
 
  
  Column 5
+
Suppose our example is of a bias die. <br/>
 +
Now we have m=6, <math>\pi=[0.1,0.1,0.1,0.2,0.3,0.2]</math>, <math>X \in [1,2,3,4,5,6]</math><br/>
  
  -0.0089 - 0.2973i
+
Suppose <math>X_t=i</math>. Consider an arbitrary probability transition matrix Q with entry <math>q_{ij}</math> being the probability of moving to state j from state i. (<math>q_{ij}</math> can not be zero.) <br/>
  -0.5005 - 0.2232i
 
  -0.2570 + 0.2173i
 
  0.1462 + 0.3032i
 
  0.6202         
 
  
 +
<math> \mathbf{Q} =
 +
\begin{bmatrix}
 +
q_{11} & q_{12} & \cdots & q_{1m} \\
 +
q_{21} & q_{22} & \cdots & q_{2m} \\
 +
\vdots & \vdots & \ddots & \vdots \\
 +
q_{m1} & q_{m2} & \cdots & q_{mm}
 +
\end{bmatrix}
 +
</math> <br/>
  
val =
 
  
  Columns 1 through 4
+
We generate Y = j according to the i-th row of Q. Note that the i-th row of Q is a probability vector that shows the probability of moving to any state j from the current state i, i.e.<math>P(Y=j)=q_{ij}</math><br />
  
  1.0000                  0                  0                  0         
+
In the following algorithm: <br>
        0            -0.5886 - 0.1253i        0                  0         
+
<math>q_{ij}</math> is the <math>ij^{th}</math> entry of matrix Q. It is the probability of Y=j given that <math>x_t = i</math>. <br/>
        0                  0            -0.5886 + 0.1253i        0         
+
<math>r_{ij}</math> is the probability of accepting Y as <math>x_{t+1}</math>. <br/>
        0                  0                  0            0.1886 - 0.3911i
 
        0                  0                  0                  0         
 
  
  Column 5
 
  
        0         
+
'''How to get the acceptance probability?'''
        0         
 
        0         
 
        0         
 
  0.1886 + 0.3911i
 
  
>> a=-vec(:,1)
+
If <math>\pi </math> is the stationary distribution, then it must satisfy the detailed balance condition:<br/>
 +
If <math>\pi_i P_{ij}</math> = <math>\pi_j P_{ji}</math><br/>then <math>\pi </math> is the stationary distribution of the chain
  
a =
+
Since <math>P_{ij}</math> = <math>q_{ij} r_{ij}</math>, we have <math>\pi_i q_{ij} r_{ij}</math> = <math>\pi_j q_{ji} r_{ji}</math>.<br/>
 +
We want to find a general solution: <math>r_{ij} = a(i,j) \pi_j q_{ji}</math>, where a(i,j) = a(j,i).<br/>
  
    0.4129
+
'''Recall'''
    0.4158
+
<math>r_{ij}</math> is the probability of acceptance, thus it must be that <br/>
    0.1963
 
    0.5700
 
    0.5415
 
  
>> a=a/sum(a)
+
1.<math>r_{ij} = a(i,j)</math> <math>\pi_j q_{ji} </math>≤1, then we get: <math>a(i,j) </math>≤ <math>1/(\pi_j q_{ji})</math>
  
a =
+
2. <math>r_{ji} = a(j,i) </math> <math>\pi_i q_{ij} </math> ≤ 1, then we get: <math>a(j,i)</math> ≤ <math>1/(\pi_i q_{ij})</math>
  
    0.1933
+
So we choose a(i,j) as large as possible, but it needs to satisfy the two conditions above.<br/>
    0.1946
 
    0.0919
 
    0.2668 % (the most important)
 
    0.2534
 
</pre>
 
For the matrix, the rank is: page 4, page 5, page 2, page 1, page 3.<br />
 
  
== Class 17 - Tuesday July 2nd 2013 ==
+
<math>a(i,j) = \min \{\frac{1}{\pi_j q_{ji}},\frac{1}{\pi_i q_{ij}}\} </math><br/>
=== Markov Chain Monte Carlo (MCMC) ===
 
  
===Introduction===
+
Thus, <math> r_{ij} = \min \{\frac{\pi_j q_{ji}}{\pi_i q_{ij}}, 1\} </math><br/>
It is, in general, very difficult to simulate the value of a random vector X whose component random variables are dependent. We will present a powerful approach for generating a vector whose distribution is approximately that of X. This approach, called the Markov Chain Monte Carlo Methods, has the added significance of only requiring that the mass(or density) function of X be specified up to a multiplicative constant, and this, we will see, is of great importance in applications.
 
(referenced by Sheldon M.Ross,Simulation)
 
  
====Definition:====
+
'''Note''':  
Markov Chain
+
1 is the upper bound to make r<sub>ij</sub> a probability
A Markov Chain is a special form of stochastic process in which <math>\displaystyle X_t</math> depends only on <math> \displaystyle X_{t-1}</math>.
 
  
For example,
 
:<math>\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_{n-1})</math>
 
A random Walk is the best example  of a Markov process
 
  
<br>'''Transition Probability:'''<br>
+
'''Algorithm:''' <br/>
The probability of going from one state to another state.
+
*<math> (*) P(Y=j) = q_{ij} </math>. <math>\frac{\pi_j q_{ji}}{\pi_i q_{ij}}</math> is a positive ratio.
:<math>p_{ij} = \Pr(X_{n}=j\mid X_{n-1}= i). \,</math>
 
  
<br>'''Transition Matrix:'''<br>
+
*<math> r_{ij} = \min \{\frac{\pi_j q_{ji}}{\pi_i q_{ij}}, 1\} </math> <br/>
For n states, transition matrix P is an <math>N \times N</math> matrix with entries <math>\displaystyle P_{ij}</math> as below:
+
*<math>
Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a large number of steps is then used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps. (http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo)</span>
+
x_{t+1} = \begin{cases}
 +
Y, & \text{with probability } r_{ij} \\
 +
x_t, & \text{otherwise} \end{cases} </math> <br/>
 +
* go back to the first step (*) <br/>
  
<a style="color:red" href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-165.pdf">some notes form UCb</a>
+
We can compare this with the Acceptance-Rejection model we learned before. <br/>
 +
* <math>U</math> ~ <math>Uniform(0,1)</math> <br/>
 +
* If <math>U < r_{ij}</math>, then accept. <br/>
 +
EXCEPT that a point is always generated at each time-step. <br>
  
'''One of the main purposes of MCMC''' : to simulate samples from a joint distribution where the joint random variables are dependent. In general, this is not easily sampled from. Other methods learned in class allow us to simulate i.i.d random variables, but not dependent variables . In this case, we could sample non-independent random variables using a Markov Chain. Its Markov properties help to simplify the simulation process.
+
The algorithm generates a stochastic sequence that only depends on the last state, which is a Markov Chain.<br>
  
 +
====Metropolis Algorithm====
  
<b>Basic idea:</b>  Given a probability distribution <math>\pi</math> on a set <math>\Omega</math>, we want to generate random elements of <math>\Omega</math> with distribution <math>\pi</math>. MCMC does that by constructing a Markov Chain with stationary distribution <math>\pi</math> and simulating the chain. After a large number of iterations, the Markov Chain will reach its stationary distribution. By sampling from the Markov chain for large amount of iterations, we are effectively sampling from the desired distribution as the Markov Chain would converge to its stationary distribution <br/>
+
'''Proposition: ''' Metropolis works:
  
Idea: generate a Markov chain whose stationary distribution is the same as target distribution. <br/>
+
The <math>P_{ij}</math>'s from Metropolis Algorithm satisfy detailed balance property w.r.t <math>\pi</math> . i.e. <math>\pi_i P_{ij} = \pi_j P_{ji}</math>. The new Markov Chain has a stationary distribution <math>\pi</math>. <br/>
 +
'''Remarks:''' <br/>
 +
1) We only need to know ratios of values of <math>\pi_i</math>'s.<br/>
 +
2) The MC might converge to <math>\pi</math> at varying speeds depending on the proposal distribution and the value the chain is initialized with<br/>
  
  
'''Notes'''
+
This algorithm generates <math>\{x_t:  t=0,...,m\}</math>. <br/>
 +
In the long run, the marginal distribution of <math> x_t </math> is the stationary distribution <math>\underline{\Pi} </math><br>
 +
<math>\{x_t: t = 0, 1,...,m\}</math> is a Markov chain with probability transition matrix (PTM), P.<br>
  
# Regardless of the chosen starting point, the Markov Chain will converge to its stationary distribution (if it exists). However, the time taken for the chain to converge depends on its chosen starting point. Typically, the burn-in period is longer if the chain is initialized with a value of low probability density.
+
This is a Markov Chain since <math> x_{t+1} </math> only depends on <math> x_t </math>, where <br>
# Markov Chain Monte Carlo can be used for sampling from a distribution, estimating the distribution, and computing the mean and optimization (e.g. simulated annealing, more on that later).
+
<math> P_{ij}= \begin{cases}
# Markov Chain Monte Carlo is used to sample using “local” information. It is used as a generic “problem solving technique” to solve decision/optimization/value problems, but is not necessarily very efficient.
+
q_{ij} r_{ij}, & \text{if }i \neq j  (q_{ij} \text{is the probability of generating j from i and } r_{ij} \text{ is the probiliity of accepting)}\\[6pt]
# MCMC methods do not suffer as badly from the "curse of dimensionality" that badly affects efficiency in the acceptance-rejection method. This is because a point is always generated at each time-step according to the Markov Chain regardless of how many dimensions are introduced.
+
1 - \displaystyle\sum_{k \neq i} q_{ik} r_{ik}, & \text{if }i = j \end{cases} </math><br />
# The goal when simulating with a Markov Chain is to create a chain with the same stationary distribution as the target distribution.
 
# The MCMC method is usually used in continuous cases but a discrete example is given below.
 
  
 +
<math>q_{ij}</math> is the probability of generating state j; <br/>
 +
<math> r_{ij}</math> is the probability of accepting state j as the next state. <br/>
  
'''Some properties of the stationary distribution <math>\pi</math>'''
+
Therefore, the final probability of moving from state i to j when i does not equal to j is <math>q_{ij}*r_{ij}</math>. <br/>
 +
For the probability of moving from state i to state i, we deduct all the probabilities of moving from state i to any j that are not equal to i, therefore, we get the second probability.
  
<math>\pi</math> indicates the proportion of time the process spends in each of the states 1,2,...,n. Therefore <math>\pi</math> satisfies the following two inequalities: <br>
+
===Proof of the proposition:===
  
# <math>\pi_j = \sum_{i=1}^{n}\pi_i P_{ij}</math> <br /> This is because <math>\pi_i</math> is the proportion of time the process spends in state i, and <math>P_{ij}</math> is the probability the process transition out of state i into state j. Therefore, <math>\pi_i p_{ij}</math> is the proportion of time it takes for the process to enter state j. Therefore, <math>\pi_j</math> is the sum of this probability over overall states i.
+
A good way to think of the detailed balance equation is that they balance the probability from state i to state j with that from state j to state i.
#<math> \sum_{i=1}^{n}\pi_i= 1 </math> as <math>\pi</math> shows the proportion of time the chain is in each state. If we view it as the probability of the chain being in state i at time t for t sufficiently large, then it should sum to one as the chain must be in one of the states.
+
We need to show that the stationary distribition of the Markov Chain is <math>\underline{\Pi}</math>, i.e. <math>\displaystyle \underline{\Pi} = \underline{\Pi}P</math><br />
 +
<div style="text-size:20px">
 +
Recall<br/>
 +
If a Markov chain satisfies the detailed balance property, i.e. <math>\displaystyle \pi_i P_{ij} = \pi_j P_{ji} \, \forall i,j</math>, then <math>\underline{\Pi}</math> is the stationary distribution of the chain.<br /><br />
 +
</div>
  
====Motivation example====
+
'''Proof:'''
- Suppose we want to generate a random variable X according to distribution <math>\pi=(\pi_1, \pi_2,  ...  , \pi_m)</math> <br/>
 
X can take m possible different values from <math>{1,2,3,\cdots, m}</math><br />
 
- We want to generate <math>\{X_t: t=0, 1, \cdots\}</math> according to <math>\pi</math><br />
 
  
Suppose our example is of a bias die. <br/>
+
WLOG, we can assume that <math>\frac{\pi_j q_{ji}}{\pi_i q_{ij}}<1</math><br/>
Now we have m=6, <math>\pi=[0.1,0.1,0.1,0.2,0.3,0.2]</math>, <math>X \in [1,2,3,4,5,6]</math><br/>
 
  
Suppose <math>X_t=i</math>. Consider an arbitrary probability transition matrix Q with entry <math>q_{ij}</math> being the probability of moving to state j from state i. (<math>q_{ij}</math> can not be zero.) <br/>  
+
LHS:<br />
 +
<math>\pi_i P_{ij} = \pi_i q_{ij} r_{ij} = \pi_i q_{ij} \cdot \min(\frac{\pi_j q_{ji}}{\pi_i q_{ij}},1) = \cancel{\pi_i q_{ij}} \cdot \frac{\pi_j q_{ji}}{\cancel{\pi_i q_{ij}}} = \pi_j q_{ji}</math><br />
  
<math> \mathbf{Q} =
+
RHS:<br />
\begin{bmatrix}
+
Note that by our assumption, since <math>\frac{\pi_j q_{ji}}{\pi_i q_{ij}}<1</math>, its reciprocal <math>\frac{\pi_i q_{ij}}{\pi_j q_{ji}} \geq 1</math><br />
q_{11} & q_{12} & \cdots & q_{1m} \\
+
So <math>\displaystyle \pi_j P_{ji} = \pi_ j q_{ji} r_{ji} = \pi_ j q_{ji} \cdot \min(\frac{\pi_i q_{ij}}{\pi_j q_{ji}},1) =  \pi_j q_{ji} \cdot 1 = \pi_ j q_{ji}</math><br />
q_{21} & q_{22} & \cdots & q_{2m} \\
 
\vdots & \vdots & \ddots & \vdots \\
 
q_{m1} & q_{m2} & \cdots & q_{mm}
 
\end{bmatrix}
 
</math> <br/>
 
  
 +
Hence LHS=RHS
  
We generate Y = j according to the i-th row of Q. Note that the i-th row of Q is a probability vector that shows the probability of moving to any state j from the current state i, i.e.<math>P(Y=j)=q_{ij}</math><br />
+
If we assume that <math>\frac{\pi_j q_{ji}}{\pi_i q_{ij}}=1</math><br/> (essentially <math>\frac{\pi_j q_{ji}}{\pi_i q_{ij}}>=1</math>)<br/>
  
In the following algorithm: <br>
+
LHS:<br />
<math>q_{ij}</math> is the <math>ij^{th}</math> entry of matrix Q. It is the probability of Y=j given that <math>x_t = i</math>. <br/>
+
<math>\pi_i P_{ij} = \pi_i q_{ij} r_{ij} = \pi_i q_{ij} \cdot \min(\frac{\pi_j q_{ji}}{\pi_i q_{ij}},1)  =\pi_i q_{ij} \cdot 1 = \pi_i q_{ij}</math><br />
<math>r_{ij}</math> is the probability of accepting Y as <math>x_{t+1}</math>. <br/>
 
  
 +
RHS:<br />
 +
'''Note''' <br/>
 +
by our assumption, since <math>\frac{\pi_j q_{ji}}{\pi_i q_{ij}}\geq 1</math>, its reciprocal <math>\frac{\pi_i q_{ij}}{\pi_j q_{ji}} \leq 1 </math> <br />
  
'''How to get the acceptance probability?'''
+
So <math>\displaystyle \pi_j P_{ji} = \pi_ j q_{ji} r_{ji} = \pi_ j q_{ji} \cdot \min(\frac{\pi_i q_{ij}}{\pi_j q_{ji}},1) =  \cancel{\pi_j q_{ji}} \cdot \frac{\pi_i q_{ij}}{\cancel{\pi_j q_{ji}}} = \pi_i q_{ij}</math><br />
  
If <math>\pi </math> is the stationary distribution, then it must satisfy the detailed balance condition:<br/>
+
Hence LHS=RHS which indicates <math>pi_i*P_{ij} = pi_j*P_{ji}</math><math>\square</math><br /><br />
If <math>\pi_i P_{ij}</math> = <math>\pi_j P_{ji}</math><br/>then <math>\pi </math> is the stationary distribution of the chain
 
  
Since <math>P_{ij}</math> = <math>q_{ij} r_{ij}</math>, we have <math>\pi_i q_{ij} r_{ij}</math> = <math>\pi_j q_{ji} r_{ji}</math>.<br/>
+
'''Note'''<br />  
We want to find a general solution: <math>r_{ij} = a(i,j) \pi_j q_{ji}</math>, where a(i,j) = a(j,i).<br/>  
+
1) If we instead assume <math>\displaystyle \frac{\pi_i q_{ij}}{\pi_j q_{ji}} \geq 1</math>, the proof is similar with LHS= RHS =  <math> \pi_i q_{ij} </math> <br />
  
'''Recall'''
+
2) If <math>\displaystyle i = j</math>, then detailed balance is satisfied trivially.<br />
<math>r_{ij}</math> is the probability of acceptance, thus it must be that <br/>
 
  
1.<math>r_{ij} = a(i,j)</math> <math>\pi_j q_{ji} </math>≤1, then we get: <math>a(i,j) </math>≤ <math>1/(\pi_j q_{ji})</math>
+
since <math>{\pi_i q_{ij}}</math>, and <math>{\pi_j q_{ji}}</math> are smaller than one. so the above steps show the proof of  <math>\frac{\pi_i q_{ij}}{\pi_j q_{ji}}<1</math>.
  
2. <math>r_{ji} = a(j,i) </math> <math>\pi_i q_{ij} </math> ≤ 1, then we get: <math>a(j,i)</math> ≤ <math>1/(\pi_i q_{ij})</math>
+
== Class 18 - Thursday July 4th 2013 ==
 +
=== Last class ===
  
So we choose a(i,j) as large as possible, but it needs to satisfy the two conditions above.<br/>
+
Recall: The Acceptance Probability,
 +
<math>r_{ij}=min(\frac {{\pi_j}q_{ji}}{{\pi_i}q_{ij}},1)</math> <br />
  
<math>a(i,j) = \min \{\frac{1}{\pi_j q_{ji}},\frac{1}{\pi_i q_{ij}}\} </math><br/>
+
1)  <math>r_{ij}=\frac {{\pi_j}q_{ji}}{{\pi_i}q_{ij}}</math>, and <math>r_{ji}=1 </math>,     (<math>\frac {{\pi_j}q_{ji}}{{\pi_i}q_{ij}} < 1</math>) <br />
  
Thus, <math> r_{ij} = \min \{\frac{\pi_j q_{ji}}{\pi_i q_{ij}}, 1\} </math><br/>
 
  
'''Note''':
+
2)  <math>r_{ji}=\frac {{\pi_i}q_{ij}}{{\pi_j}q_{ji}}</math>, and <math> r{ij}=1 </math>,    (<math>\frac {{\pi_j}q_{ji}}{{\pi_i}q_{ij}} \geq 1</math> ) <br />
1 is the upper bound to make r<sub>ij</sub> a probability
 
  
 +
===Example: Discrete Case===
  
'''Algorithm:'''  <br/>
 
*<math> (*) P(Y=j) = q_{ij} </math>. <math>\frac{\pi_j q_{ji}}{\pi_i q_{ij}}</math> is a positive ratio.
 
  
*<math> r_{ij} = \min \{\frac{\pi_j q_{ji}}{\pi_i q_{ij}}, 1\} </math> <br/>
+
Consider a biased die,
*<math>
+
<math>\pi</math>= [0.1, 0.1, 0.2, 0.4, 0.1, 0.1]
x_{t+1} = \begin{cases}
 
Y, & \text{with probability } r_{ij} \\
 
x_t, & \text{otherwise} \end{cases} </math> <br/>
 
* go back to the first step (*)  <br/>
 
  
We can compare this with the Acceptance-Rejection model we learned before. <br/>
+
We could use any <math>6 x 6 </math> matrix <math> \mathbf{Q} </math> as the proposal distribution <br>
* <math>U</math> ~ <math>Uniform(0,1)</math> <br/>
+
For the sake of simplicity ,using a discrete uniform distribution is the simplest. This is because all probabilities are equivalent, hence during the calculation of r, qxy and qyx will cancel each other out.
* If <math>U < r_{ij}</math>, then accept. <br/>
 
EXCEPT that a point is always generated at each time-step. <br>
 
  
The algorithm generates a stochastic sequence that only depends on the last state, which is a Markov Chain.<br>
+
<math> \mathbf{Q} =
 +
\begin{bmatrix}
 +
1/6 & 1/6 & \cdots & 1/6 \\
 +
1/6 & 1/6 & \cdots & 1/6 \\
 +
\vdots & \vdots & \ddots & \vdots \\
 +
1/6 & 1/6 & \cdots & 1/6
 +
\end{bmatrix}
 +
</math> <br/>
  
====Metropolis Algorithm====
 
  
'''Proposition: ''' Metropolis works:
 
  
The <math>P_{ij}</math>'s from Metropolis Algorithm satisfy detailed balance property w.r.t <math>\pi</math> . i.e. <math>\pi_i P_{ij} = \pi_j P_{ji}</math>. The new Markov Chain has a stationary distribution <math>\pi</math>. <br/>
+
'''Algorithm''' <br>
'''Remarks:''' <br/>
+
1. <math>x_t=5</math> (sample from the 5th row, although we can initialize the chain from anywhere within the support)<br />
1) We only need to know ratios of values of <math>\pi_i</math>'s.<br/>
+
2. Y~Unif[1,2,...,6]<br />
2) The MC might converge to <math>\pi</math> at varying speeds depending on the proposal distribution and the value the chain is initialized with<br/>
+
3. <math> r_{ij} = \min \{\frac{\pi_j q_{ji}}{\pi_i q_{ij}}, 1\} = \min \{\frac{\pi_j  1/6}{\pi_i  1/6}, 1\} = \min \{\frac{\pi_j}{\pi_i}, 1\}</math><br>
 +
Note:  current state <math>i</math> is <math>X_t</math>,  the candidate state <math>j</math> is <math>Y</math>. <br>
 +
Note: since <math>q_{ij}= q_{ji}</math> for all i and j, that is, the proposal distribution is symmetric, we have <math> r_{ij} = \min \{\frac{\pi_j}{\pi_i }, 1\} </math><br/>
 +
4. U~Unif(0,1)<br/>
 +
if <math>u \leq r_{ij}</math>, X<sub>t+1</sub>=Y<br />
 +
else X<sub>t+1</sub>=X<sub>t</sub><br />
 +
go back to 2<br>
  
 +
Notice how a point is always generated for X<sub>t+1</sub>, regardless of whether the candidate state Y is accepted <br>
  
This algorithm generates <math>\{x_t: t=0,...,m\}</math>. <br/>
+
'''Matlab'''
In the long run, the marginal distribution of <math> x_t </math> is the stationary distribution <math>\underline{\Pi} </math><br>
+
<pre style="font-size:14px">
<math>\{x_t: t = 0, 1,...,m\}</math> is a Markov chain with probability transition matrix (PTM), P.<br>
+
  pii=[.1,.1,.2,.4,.1,.1];
 +
x(1)=5;
 +
for ii=2:1000
 +
  Y=unidrnd(6);                %%% Unidrnd(x) is a built-in function which generates a number between (0) and (x)
 +
  r = min (pii(Y)/pii(x(ii-1)), 1);
 +
  u=rand;
 +
  if u<r
 +
    x(ii)=Y;
 +
  else
 +
    x(ii)=x(ii-1);
 +
  end
 +
end
 +
hist(x,6)    %generate histogram displaying all 1000 points
 +
xx = x(501,end);    %After 500, the chain will mix well and converge.
 +
hist(xx,6)                 % The result should be better.
 +
</pre>
 +
[[File:MH_example1.jpg|300px]]
  
This is a Markov Chain since <math> x_{t+1} </math> only depends on <math> x_t </math>, where <br>
 
<math> P_{ij}= \begin{cases}
 
q_{ij} r_{ij}, & \text{if }i \neq j  (q_{ij} \text{is the probability of generating j from i and } r_{ij} \text{ is the probiliity of accepting)}\\[6pt]
 
1 - \displaystyle\sum_{k \neq i} q_{ik} r_{ik}, & \text{if }i = j \end{cases} </math><br />
 
  
<math>q_{ij}</math> is the probability of generating state j; <br/>
+
'''NOTE:''' Generally, we generate a large number of points (say, 1500) and throw away some of the points that were first generated(say, 500). Those first points are called the [[burn-in period]]. A chain will converge to the limiting distribution eventually, but not immediately. The burn-in period is that beginning period before the chain has converged to the desired distribution. By discarding those 500 points, our data set will be more representative of the desired limiting distribution; once the burn-in period is over, we say that the chain "mixes well".
<math> r_{ij}</math> is the probability of accepting state j as the next state. <br/>
 
  
Therefore, the final probability of moving from state i to j when i does not equal to j is <math>q_{ij}*r_{ij}</math>. <br/>
+
===Alternate Example: Discrete Case===
For the probability of moving from state i to state i, we deduct all the probabilities of moving from state i to any j that are not equal to i, therefore, we get the second probability.
 
  
===Proof of the proposition:===
 
  
A good way to think of the detailed balance equation is that they balance the probability from state i to state j with that from state j to state i.  
+
Consider the weather. If it is sunny one day, there is a 5/7 chance it will be sunny the next. If it is rainy, there is a 5/8 chance it will be rainy the next.
We need to show that the stationary distribition of the Markov Chain is <math>\underline{\Pi}</math>, i.e. <math>\displaystyle \underline{\Pi} = \underline{\Pi}P</math><br />
+
<math>\pi= [\pi_1 \ \pi_2] </math>
<div style="text-size:20px">
 
Recall<br/>
 
If a Markov chain satisfies the detailed balance property, i.e. <math>\displaystyle \pi_i P_{ij} = \pi_j P_{ji} \, \forall i,j</math>, then <math>\underline{\Pi}</math> is the stationary distribution of the chain.<br /><br />
 
</div>
 
  
'''Proof:'''
+
Use a discrete uniform distribution as the proposal distribution, because it is the simplest.
  
WLOG, we can assume that <math>\frac{\pi_j q_{ji}}{\pi_i q_{ij}}<1</math><br/>
+
<math> \mathbf{Q} =
 +
\begin{bmatrix}
 +
5/7 & 2/7 \\
 +
3/8 & 5/8\\
 +
 +
\end{bmatrix}
 +
</math> <br/>
  
LHS:<br />
 
<math>\pi_i P_{ij} = \pi_i q_{ij} r_{ij} = \pi_i q_{ij} \cdot \min(\frac{\pi_j q_{ji}}{\pi_i q_{ij}},1) = \cancel{\pi_i q_{ij}} \cdot \frac{\pi_j q_{ji}}{\cancel{\pi_i q_{ij}}} = \pi_j q_{ji}</math><br />
 
  
RHS:<br />
 
Note that by our assumption, since <math>\frac{\pi_j q_{ji}}{\pi_i q_{ij}}<1</math>, its reciprocal <math>\frac{\pi_i q_{ij}}{\pi_j q_{ji}} \geq 1</math><br />
 
So <math>\displaystyle \pi_j P_{ji} = \pi_ j q_{ji} r_{ji} = \pi_ j q_{ji} \cdot \min(\frac{\pi_i q_{ij}}{\pi_j q_{ji}},1) =  \pi_j q_{ji} \cdot 1 = \pi_ j q_{ji}</math><br />
 
  
Hence LHS=RHS
+
'''Algorithm''' <br>
 +
1. Set initial chain state: <math>X_t=1</math> (i.e. sample from the 1st row, although we could also choose the 2nd row)<br />
 +
2. Sample from proposal distribution: Y~q(y|x) = Unif[1,2]<br />
 +
3. <math> r_{ij} = \min \{\frac{\pi_j q_{ji}}{\pi_i q_{ij}}, 1\} = \min \{\frac{\pi_j  1/6}{\pi_i  1/6}, 1\} = \min \{\frac{\pi_j}{\pi_i}, 1\}</math><br>
 +
'''Note:'''  Current state <math>i</math> is <math>X_t</math>,  the candidate state <math>j</math> is <math>Y</math>. Since <math>q_{ij}= q_{ji}</math> for all i and j, that is, the proposal distribution is symmetric, we have <math> r_{ij} = \min \{\frac{\pi_j}{\pi_i }, 1\} </math>
 +
 
 +
4. U~Unif(0,1)<br>
 +
  If  <math>U \leq r_{ij}</math>, then<br>
 +
        <math>X_t=Y</math><br>
 +
  else<br />
 +
        <math>X_{t+1}=X_t</math><br>
 +
  end if<br />
 +
5. Go back to step 2<br>
  
If we assume that <math>\frac{\pi_j q_{ji}}{\pi_i q_{ij}}=1</math><br/> (essentially <math>\frac{\pi_j q_{ji}}{\pi_i q_{ij}}>=1</math>)<br/>
 
  
LHS:<br />
+
'''Generalization of the above framework to the continuous case'''<br>
<math>\pi_i P_{ij} = \pi_i q_{ij} r_{ij} = \pi_i q_{ij} \cdot \min(\frac{\pi_j q_{ji}}{\pi_i q_{ij}},1)  =\pi_i q_{ij} \cdot 1 = \pi_i q_{ij}</math><br />
 
  
RHS:<br />
+
In place of <math>\pi</math> use <math>f(x)</math>
'''Note''' <br/>
+
In place of r<sub>ij</sub> use <math>q(y|x)</math> <br>
by our assumption, since <math>\frac{\pi_j q_{ji}}{\pi_i q_{ij}}\geq 1</math>, its reciprocal <math>\frac{\pi_i q_{ij}}{\pi_j q_{ji}} \leq 1 </math> <br />
+
In place of r<sub>ij</sub> use <math>r(x,y)</math> <br>
 +
Here, q(y|x) is a friendly distribution that is easy to sample, usually a symmetric distribution will be preferable, such that <math>q(y|x) = q(x|y)</math> to simplify the computation for <math>r(x,y)</math>.
  
So <math>\displaystyle \pi_j P_{ji} = \pi_ j q_{ji} r_{ji} = \pi_ j q_{ji} \cdot \min(\frac{\pi_i q_{ij}}{\pi_j q_{ji}},1) =  \cancel{\pi_j q_{ji}} \cdot \frac{\pi_i q_{ij}}{\cancel{\pi_j q_{ji}}} = \pi_i q_{ij}</math><br />
 
  
Hence LHS=RHS <math>\square</math><br /><br />
+
'''Remarks'''<br>
 +
1. The chain may not get to a stationary distribution if the # of steps generated are small. That is it will take a very large amount of steps to step through the whole support<br>
 +
2. The algorithm can be performed with a <math>\pi</math> that is not even a probability mass function, it merely needs to be proportional to the probability mass function we wish to sample from. This is useful as we do not need to calculate the normalization factor. <br>
  
'''Note'''<br />  
+
For example, if we are given <math>\pi^'=\pi\alpha=[5,10,11,2,100,1]</math>, we can normalize this vector by dividing the sum of all entries <math>s</math>.<br>
1) If we instead assume <math>\displaystyle \frac{\pi_i q_{ij}}{\pi_j q_{ji}} \geq 1</math>, the proof is similar with LHS= RHS =  <math> \pi_i q_{ij} </math> <br />
+
However we notice that when calculating <math>r_{ij}</math>, <br>
 +
<math>\frac{\pi^'_j/s}{\pi^'_i/s}\times\frac{q_{ji}}{q_{ij}}=\frac{\pi^'_j}{\pi^'_i}\times\frac{q_{ji}}{q_{ij}}</math> <br>
 +
<math>s</math> cancels out in this case. Therefore it is not necessary to calculate the sum and normalize the vector.<br>
  
2) If <math>\displaystyle i = j</math>, then detailed balance is satisfied trivially.<br />
+
This also applies to the continuous case,where we merely need <math> f(x) </math> to be proportional to the pdf of the distribution we wish to sample from. <br>
  
since <math>{\pi_i q_{ij}}</math>, and <math>{\pi_j q_{ji}}</math> are smaller than one. so the above steps show the proof of  <math>\frac{\pi_i q_{ij}}{\pi_j q_{ji}}<1</math>.
+
===Metropolis–Hasting Algorithm===
  
== Class 18 - Thursday July 4th 2013 ==
+
'''Definition''': <br>
=== Last class ===
+
Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult. The Metropolis–Hastings algorithm can draw samples from any probability distribution P(x), provided you can compute the value of a function f(x) which is proportional to the density of P. <br>
  
Recall : The Acceptance Probability
 
<math>r_{ij}=min(\frac {{\pi_j}q_{ji}}{{\pi_i}q_{ij}},1)</math> <br />
 
  
1)  <math>r_{ij}=\frac {{\pi_j}q_{ji}}{{\pi_i}q_{ij}}</math>, and <math>r_{ji}=1 </math>,    (<math>\frac {{\pi_j}q_{ji}}{{\pi_i}q_{ij}} < 1</math>) <br />
 
  
 +
'''Purpose''': <br>
 +
"The purpose of the Metropolis-Hastings Algorithm is to <b>generate a collection of states according to a desired distribution</b> <math>P(x)</math>. <math>P(x)</math> is chosen to be the stationary distribution of a Markov process, <math>\pi(x)</math>." <br>
 +
Source:(http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm)<br>
  
2)  <math>r_{ji}=\frac {{\pi_i}q_{ij}}{{\pi_j}q_{ji}}</math>, and <math> r{ij}=1 </math>,    (<math>\frac {{\pi_j}q_{ji}}{{\pi_i}q_{ij}} \geq 1</math> ) <br />
 
  
===Example: Discrete Case===
+
Metropolis-Hastings is an algorithm for constructing a Markov chain with a given limiting probability distribution. In particular, we consider what happens if we apply the Metropolis-Hastings algorithm repeatedly to a “proposal” distribution which has already been updated.<br>
  
  
Consider a biased die
+
The algorithm was named after Nicholas Metropolis and W. K. Hastings who extended it to the more general case in 1970.<br>
<math>\pi</math>= [0.1, 0.1, 0.2, 0.4, 0.1, 0.1]
 
  
We could use any <math>6 x 6 </math> matrix <math> \mathbf{Q} </math> as the proposal distribution <br>
+
<math>q(y|x)</math> is used instead of <math>qi,j</math>. In continuous case, we use these notation which means given state x, what's the probability of y.<br>  
For the sake of simplicity ,using a discrete uniform distribution is the simplest.
 
  
<math> \mathbf{Q} =
+
Note that the Metropolis-Hasting algorithm possess some advantageous properties. One of which is that this algorithm "can be used when \pi(x) is known up to the constant of proportionality". The second is that in this algorithm, "we do not require the conditional distribution, which, in contrast, is required for the Gibbs sampler. "
\begin{bmatrix}
+
Source:https://www.msu.edu/~blackj/Scan_2003_02_12/Chapter_11_Markov_Chain_Monte_Carlo_Methods.pdf
1/6 & 1/6 & \cdots & 1/6 \\
 
1/6 & 1/6 & \cdots & 1/6 \\
 
\vdots & \vdots & \ddots & \vdots \\
 
1/6 & 1/6 & \cdots & 1/6
 
\end{bmatrix}
 
</math> <br/>
 
  
  
  
'''Algorithm''' <br>
+
'''Differences between the discrete and continuous case of the Markov Chain''':<br/>
1. <math>x_t=5</math> (sample from the 5th row, although we can initialize the chain from anywhere within the support)<br />
 
2. Y~Unif[1,2,...,6]<br />
 
3. <math> r_{ij} = \min \{\frac{\pi_j q_{ji}}{\pi_i q_{ij}}, 1\} = \min \{\frac{\pi_j  1/6}{\pi_i  1/6}, 1\} = \min \{\frac{\pi_j}{\pi_i}, 1\}</math><br>
 
Note: current state <math>i</math> is <math>X_t</math>,  the candidate state <math>j</math> is <math>Y</math>. <br>
 
Note: since <math>q_{ij}= q_{ji}</math> for all i and j, that is, the proposal distribution is symmetric, we have <math> r_{ij} = \min \{\frac{\pi_j}{\pi_i }, 1\} </math><br/>
 
4. U~Unif(0,1)<br/>
 
if <math>u \leq r_{ij}</math>, X<sub>t+1</sub>=Y<br />
 
else X<sub>t+1</sub>=X<sub>t</sub><br />
 
go back to 2<br>
 
  
Notice how a point is always generated for X<sub>t+1</sub> regardless of whether the candidate state Y is accepted <br>
+
1. <math>q(y|x)</math> is used in continuous, instead of <math>q_{ij}</math> in discrete <br/>
 +
2. <math>r(x,y)</math> is used in continuous, instead of <math>r{ij}</math> in discrete <br/>
 +
3. <math>f</math> is used instead of <math>\pi</math> <br/>
  
'''Matlab'''
 
<pre style="font-size:14px">
 
pii=[.1,.1,.2,.4,.1,.1];
 
x(1)=5;
 
for ii=2:1000
 
  Y=unidrnd(6);                %%% Unidrnd(x) is a built-in function which generates a number between (0) and (x)
 
  r = min (pii(Y)/pii(x(ii-1)), 1);
 
  u=rand;
 
  if u<r
 
    x(ii)=Y;
 
  else
 
    x(ii)=x(ii-1);
 
  end
 
end
 
hist(x,6)    %generate histogram displaying all 1000 points
 
xx = x(501,end);    %After 500, the chain will mix well and converge.
 
hist(xx,6)                % The result should be better.
 
</pre>
 
[[File:MH_example1.jpg|300px]]
 
  
 +
'''Build the Acceptance Ratio'''<br/>
 +
Before we consider the algorithm there are a couple general steps to follow to build the acceptance ratio:<br/>
  
'''NOTE:''' Generally we generate a large number of points (say, 1500) and throw away the first points (say, 500). Those first points are called the [[burn-in period]]. Since the chain is said to converge in the long run, the burn-in period is where the chain is converging toward the limiting distribution, but has not converged yet; by discarding those 500 points, our data set will be more representative of the desired limiting distribution, once the burn-in period is over, we say that the chain "mixes well".
+
a) Find the distribution you wish to use to generate samples from<br/>
 +
b) Find a candidate distribution that fits the desired distribution, q(y|x). (the proposed moves are independent of the current state)<br/>
 +
c) Build the acceptance ratio <math>\displaystyle \frac{f(y)q(x|y)}{f(x)q(y|x)}</math>
  
===Alternate Example: Discrete Case===
 
  
  
Consider the weather. If it is sunny one day, there is a 5/7 chance it will be sunny the next. If it is rainy, there is a 5/8 chance it will be rainy the next.
+
Assume that f(y) is the target distribution; Choose q(y|x) such that it is a friendly distribution and easy to sample from.<br />
<math>\pi</math>= [pi1 pi2]
+
'''Algorithm:'''<br />
  
Use a discrete uniform distribution as the proposal distribution, because it is the simplest.
+
# Set <math>\displaystyle i = 0</math> and initialize the chain, i.e. <math>\displaystyle x_0 = s</math> where <math>\displaystyle s</math> is some state of the Markov Chain.
 +
# Sample <math>\displaystyle Y \sim q(y|x)</math>
 +
# Set <math>\displaystyle r(x,y) = min(\frac{f(y)q(x|y)}{f(x)q(y|x)},1)</math>
 +
# Sample <math>\displaystyle u \sim \text{UNIF}(0,1)</math>
 +
# If <math>\displaystyle u \leq r(x,y), x_{i+1} = Y</math><br /> Else <math>\displaystyle x_{i+1} = x_i</math>
 +
# Increment i by 1 and go to Step 2, i.e. <math>\displaystyle i=i+1</math>
  
<math> \mathbf{Q} =
+
<br> '''Note''': q(x|y) is moving from y to x and q(y|x) is moving from x to y.
\begin{bmatrix}
+
<br>We choose q(y|x) so that it is simple to sample from.
5/7 & 2/7 \\
+
<br>Usually, we choose a normal distribution.
1/8 & 5/8\\
 
 
\end{bmatrix}
 
</math> <br/>
 
  
 +
NOTE2: The proposal q(y|x) y depends on x (is conditional on x)the current state, this makes sense ,because it's a necessary condition for MC. So the proposal should depend on x (also their supports should match) e.g q(y|x) ~ N( x, b<sup>2</sup>) here the proposal depends on x.
 +
If the next state is INDEPENDENT of the current state, then our proposal will not depend on x e.g. (A4 Q2, sampling from Beta(2,2) where the proposal was UNIF(0,1)which is independent of the current state. )
  
 +
However, it is important to remember that even if generating the proposed/candidate state does not depend on the current state, the chain is still a markov chain.
  
'''Algorithm''' <br>
+
<br />
1. <math>x_t=1</math> (sample from the 1st row, although we could also choose the second)<br />
+
Comparing with previous sampling methods we have learned, samples generated from M-H algorithm are not independent of each other, since we accept future sample based on the current sample. Furthermore, unlike acceptance and rejection method, we are not going to reject any points in Metropolis-Hastings. In the equivalent of the "reject" case, we just leave the state unchanged. In other words, if we need a sample of 1000 points, we only need to generate the sample 1000 times.<br/>
2. Y~Unif[1,2]<br />
 
3. <math> r_{ij} = \min \{\frac{\pi_j q_{ji}}{\pi_i q_{ij}}, 1\} = \min \{\frac{\pi_j  1/6}{\pi_i  1/6}, 1\} = \min \{\frac{\pi_j}{\pi_i}, 1\}</math><br>
 
Note:  current state <math>i</math> is <math>X_t</math>,  the candidate state <math>j</math> is <math>Y</math>. <br>
 
Note: since <math>q_{ij}= q_{ji}</math> for all i and j, that is, the proposal distribution is symmetric, we have <math> r_{ij} = \min \{\frac{\pi_j}{\pi_i }, 1\} </math>
 
  
4. U~Unif(0,1)<br />
+
<p style="font-size:20px;color:red;">
  if <math>u \leq r_{ij}</math>,<br />X<sub>t+1</sub>=Y<br />
+
Remarks
  else<br />
+
</p>
  X<sub>t+1</sub>=X<sub>t</sub><br />
+
===='''Remark 1'''====
  end if<br />
+
<span style="text-shadow: 0px 2px 3px 3399CC;margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">
  go to (2)<br>
+
A common choice for <math>q(y|x)</math> is a normal distribution centered at x with standard deviation b. Y~<math>N(x,b^2)</math>
  
 +
In this case, <math> q(y|x)</math> is symmetric.
  
'''Generalization of the above framework to the continuous case'''<br>
+
i.e.
 +
<math>q(y|x)=q(x|y)</math><br>
 +
(we want to sample q centered at the current state.)<br>
 +
<math>q(y|x)=\frac{1}{\sqrt{2\pi}b}\,e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2b^2} (y-x)^2}</math>, (centered at x)<br>
 +
<math>q(x|y)=\frac{1}{\sqrt{2\pi}b}\,e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2b^2} (x-y)^2}</math>,(centered at y)<br>
 +
<math>\Rightarrow (y-x)^2=(x-y)^2</math><br>
 +
so <math>~q(y \mid x)=q(x \mid y)</math> <br>
 +
In this case <math>\frac{q(x \mid y)}{q(y \mid x)}=1</math> and therefore <math> r(x,y)=\min \{\frac{f(y)}{f(x)}, 1\} </math> <br/><br />
 +
This is true for any symmetric q. In general if q(y|x) is symmetric, then this algorithm is called Metropolis.<br/>
 +
When choosing function q, it makes sense to choose a distribution with the same support as the distribution you want to simulate. eg. If target is Beta, then can choose q ~ Uniform(0,1)<br>
 +
The chosen q is not necessarily symmetric. Depending on different target distribution, q can be uniform.</span>
  
In place of <math>\pi</math> use <math>f(x)</math>
+
===='''Remark 2'''====
In place of r<sub>ij</sub> use <math>q(y|x)</math> <br>
+
<span style="text-shadow: 0px 2px 3px 3399CC;margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">
In place of r<sub>ij</sub> use <math>r(x,y)</math> <br>
+
The value y is accepted if u<=<math>min\{\frac{f(y)}{f(x)},1\}</math>, so it is accepted with the probability <math>min\{\frac{f(y)}{f(x)},1\}</math>.<br/>
Here, q(y|x) is a friendly distribution that is easy to sample, usually a symmetric distribution will be preferable, such that <math>q(y|x) = q(x|y)</math> to simplify the computation for <math>r(x,y)</math>.
+
Thus, if <math>f(y)>=f(x)</math>, then y is always accepted.<br/>
 +
The higher that value of the pdf is in the vicinity of a point <math>y_1</math> , the more likely it is that a random variable will take on values around <math>y_1</math>.<br/>
 +
Therefore,we would want a high probability of acceptance for points generated near <math>y_1</math>.<br>
 +
[[File:Diag1.png‎]]<br>
  
 +
'''Note''':<br/>
 +
If the proposal comes from a region with low density, we may or may not accept; however, we accept for sure if the proposal comes from a region with high density.<br>
  
'''Remarks'''<br>
+
===='''Remark 3'''====
1. The chain may not get to a stationary distribution if the # of steps generated are small. That is it will take a very large amount of steps to step through the whole support<br>
 
2. The algorithm can be performed with a <math>\pi</math> that is not even a probability mass function, it merely needs to be proportional to the probability mass function we wish to sample from. This is useful as we do not need to calculate the normalization factor. <br>
 
  
For example, if we are given <math>\pi^'=\pi\alpha=[5,10,11,2,100,1]</math>, we can normalize this vector by dividing the sum of all entries <math>s</math>.<br>
+
One strength of the Metropolis-Hastings algorithm is that normalizing constants, which are often quite difficult to determine, can be cancelled out in the ratio <math> r </math>. For example, consider the case where we want to sample from the beta distribution, which has the pdf:<br>
However we notice that when calculating <math>r_{ij}</math>, <br>
+
(also notice that Metropolis Hastings is just a special case of Metropolis algorithm)
<math>\frac{\pi^'_j/s}{\pi^'_i/s}\times\frac{q_{ji}}{q_{ij}}=\frac{\pi^'_j}{\pi^'_i}\times\frac{q_{ji}}{q_{ij}}</math> <br>
 
<math>s</math> cancels out in this case. Therefore it is not necessary to calculate the sum and normalize the vector.<br>
 
  
This also applies to the continuous case,where we merely need <math> f(x) </math> to be proportional to the pdf of the distribution we wish to sample from. <br>
+
<math>
 +
\begin{align}
 +
f(x;\alpha,\beta)& = \frac{1}{\mathrm{B}(\alpha,\beta)}\, x^{\alpha-1}(1-x)^{\beta-1}\end{align}
 +
</math>
  
===Metropolis–Hasting Algorithm===
+
The beta function, ''B'', appears as a normalizing constant but it can be simplified by construction of the method.
  
'''Definition''': <br>
+
====='''Example'''=====
Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult. <br>
 
  
 +
<math>\,f(x)=\frac{1}{\pi^{2}}\frac{1}{1+x^{2}}</math>, where <math>\frac{1}{\pi^{2}} </math> is normalization factor and <math>\frac{1}{1+x^{2}} </math> is target distribution. <br>
 +
Then, we have <math>\,f(x)\propto\frac{1}{1+x^{2}}</math>.<br>
 +
And let us take <math>\,q(x|y)=\frac{1}{\sqrt{2\pi}b}e^{-\frac{1}{2b^{2}}(y-x)^{2}}</math>.<br>
 +
Then <math>\,q(x|y)</math> is symmetric since <math>\,(y-x)^{2} = (x-y)^{2}</math>.<br>
 +
Therefore Y can be simplified.
  
'''Purpose''': <br>
 
"The purpose of the Metropolis-Hastings Algorithm is to <b>generate a collection of states according to a desired distribution</b> <math>P(x)</math>. <math>P(x)</math> is chosen to be the stationary distribution of a Markov process, <math>\pi(x)</math>." <br>
 
Source:(http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm)<br>
 
  
 +
We get :
  
Metropolis-Hastings is an algorithm for constructing a Markov chain with a given limiting probability distribution. In particular, we consider what happens if we apply the Metropolis-Hastings algorithm repeatedly to a “proposal” distribution which has already been updated.<br>
+
<math>\,\begin{align}
 +
\displaystyle r(x,y)
 +
& =min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} \\
 +
& =min\left\{\frac{f(y)}{f(x)},1\right\} \\
 +
& =min\left\{ \frac{ \frac{1}{1+y^{2}} }{ \frac{1}{1+x^{2}} },1\right\}\\
 +
& =min\left\{ \frac{1+x^{2}}{1+y^{2}},1\right\}\\
 +
\end{align}
 +
</math>.
  
 +
<br/>
 +
<math>\pi=[0.1\,0.1\,...] </math> stands for probility;<br/>
 +
<math>\pi \propto [3\,2\, 10\, 100\, 1.5] </math> is not brobility, so we take:<br/>
 +
<math>\Rightarrow \pi=1/c \times [3\, 2\, 10\, 100\, 1.5]</math> is probility where<br/>
 +
<math>\Rightarrow c=3+2+10+100+1.5 </math><br/>
 +
<br/>
 +
<br/>
  
The algorithm was named after Nicholas Metropolis and W. K. Hastings who extended it to the more general case in 1970.<br>
+
In practice, if elements of <math>\pi</math> are functions or random variables, we need c to be the normalization factor, the summation/integration over all members of <math>\pi</math>. This is usually very difficult. Since we are taking ratios, with the Metropolis-Hasting algorithm, it is not necessary to do this.  
  
 +
<br>
 +
For example, to find the relationship between weather temperature and humidity, we only have a proportional function instead of a probability function. To make it into a probability function, we need to compute c, which is really difficult. However, we don't need to compute c as it will be cancelled out during calculation of r.<br>
  
'''Differences between the discrete and continuous case of the Markov Chain''':<br/>
+
======'''MATLAB'''======
 
+
The Matlab code of the algorithm is the following :
1. <math>q(y|x)</math> is used in continuous, instead of <math>q_{ij}</math> in discrete <br/>
+
<pre style="font-size:12px">
2. <math>r(x,y)</math> is used in continuous, instead of <math>r{ij}</math> in discrete <br/>
+
clear all
3. <math>f</math> is used instead of <math>\pi</math> <br/>
+
close all
 
+
clc
 
+
b=2;
'''Build the Acceptance Ratio'''<br/>
+
x(1)=0;
Before we consider the algorithm there are a couple general steps to follow to build the acceptance ratio:<br/>
+
for i=2:10000
 +
    y=b*randn+x(i-1);
 +
    r=min((1+x(i-1)^2)/(1+y^2),1);
 +
    u=rand;
 +
    if u<r
 +
        x(i)=y;
 +
    else
 +
        x(i)=x(i-1);
 +
    end
 +
   
 +
end
 +
hist(x,100);
 +
%The Markov Chain usually takes some time to converge and this is known as the "burning time".
 +
</pre>
 +
[[File:MH_example2.jpg|300px]]
  
a) Find the distribution you wish to use to generate samples from<br/>
+
However, while the data does approximately fit the desired distribution, it takes some time until the chain gets to the stationary distribution. To generate a more accurate graph, we modify the code to ignore the initial points.<br>
b) Find a candidate distribution that fits the desired distribution, q(y|x). (the proposed moves are independent of the current state)<br/>
 
c) Build the acceptance ratio <math>\displaystyle \frac{f(y)q(x|y)}{f(x)q(y|x)}</math>
 
  
 +
'''MATLAB'''
 +
<pre style="font-size:16px">
 +
b=2;
 +
x(1)=0;
 +
for ii=2:10500
 +
y=b*randn+x(ii-1);
 +
r=min((1+x(ii-1)^2)/(1+y^2),1);
 +
u=rand;
 +
if u<=r
 +
x(ii)=y;
 +
else
 +
x(ii)=x(ii-1);
 +
end
 +
end
 +
xx=x(501:end) %we don't display the first 500 points because they don't show the limiting behaviour of the Markov Chain
 +
hist(xx,100)
 +
</pre>
 +
[[File:MH_Ex.jpg|300px]]
 +
<br>
 +
'''If a function f(x) can only take values from <math>[0,\infty)</math>, but we need to use normal distribution as the candidate distribution, then we can use <math>q=\frac{2}{\sqrt{2\pi}}*exp(\frac{-(y-x)^2}{2})</math>, where y is from <math>[0,\infty)</math>. <br>(This is essentially the pdf of the absolute value of a normal distribution centered around x)'''<br><br>
  
 +
Example:<br>
 +
We want to sample from <math>exp(2), q(y|x)~\sim~N(x,b^2)</math><br>
 +
<math>r=\frac{f(y)}{f(x)}=\frac{2*exp^(-2y)}{2*exp^(-2x)}=exp(2*(x-y))</math><br>
 +
<math>r=min(exp(2*(x-y)),1)</math><br>
  
Assume that f(y) is the target distribution; Choose q(y|x) such that it is a friendly distribution and easy to sample from.<br />
+
'''MATLAB'''
'''Algorithm:'''<br />
+
<pre style="font-size:16px">
 
+
x(1)=0;
# Set <math>\displaystyle i = 0</math> and initialize the chain, i.e. <math>\displaystyle x_0 = s</math> where <math>\displaystyle s</math> is some state of the Markov Chain.
+
for ii=2:100
# Sample <math>\displaystyle Y \sim q(y|x)</math>
+
y=2*(randn*b+abs(x(ii-1)))
# Set <math>\displaystyle r(x,y) = min(\frac{f(y)q(x|y)}{f(x)q(y|x)},1)</math>
+
r=min(exp(2*(x-y)),1);
# Sample <math>\displaystyle u \sim \text{UNIF}(0,1)</math>
+
u=rand;
# If <math>\displaystyle u \leq r(x,y), x_{i+1} = Y</math><br /> Else <math>\displaystyle x_{i+1} = x_i</math>
+
if u<=r
# Increment i by 1 and go to Step 2, i.e. <math>\displaystyle i=i+1</math>
+
x(ii)=y;
 +
else
 +
x(ii)=x(ii-1);
 +
end
 +
end
 +
</pre>
 +
<br>
 +
 
 +
'''Definition of Burn in:'''
 +
 
 +
Typically in a MH Algorithm, a set of values generated at at the beginning of the sequence are "burned" (discarded) after which the chain is assumed to have converged to its target distribution. In the first example listed above, we "burned" the first 500 observations because we believe the chain has not quite reached our target distribution in the first 500 observations. 500 is not a set threshold, there is no right or wrong answer as to what is the exact number required for burn-in. Theoretical calculation of the burn-in is rather difficult, in the above mentioned example, we chose 500 based on experience and quite arbitrarily. 
 +
 
 +
Burn-in time can also be thought of as the time it takes for the chain to reach its stationary distribution. Therefore, in this case you will disregard everything uptil the burn-in period because the chain is not stabilized yet.  
  
<br> '''Note''': q(x|y) is moving from y to x and q(y|x) is moving from x to y.
+
The Metropolis–Hasting Algorithm is started from an arbitrary initial value <math>x_0</math> and the algorithm is run for many iterations until this initial state is "forgotten". These samples, which are discarded, are known as ''burn-in''. The remaining
<br>We choose q(y|x) so that it is simple to sample from.
+
set of accepted values of <math>x</math> represent a sample from the distribution f(x).(http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm)<br/>
<br>Usually, we choose a normal distribution.
 
  
NOTE2: The proposal q(y|x) y depends on x (is conditional on x)the current state, this makes sense ,because it's a necessary condition for MC. So the proposal should depend on x (also their supports should match) e.g q(y|x) ~ N( x, b<sup>2</sup>) here the proposal depends on x.  
+
Burn-in time can also be thought of as the time it takes for the process to reach the stationary distribution pi. Suppose it takes 5 samples after which you reach the stationary distribution. You should disregard the first five samples and consider the remaining samples as representing your target distribution f(x). <br>
If the next state is INDEPENDENT of the current state, then our proposal will not depend on x e.g. (A4 Q2, sampling from Beta(2,2) where the proposal was UNIF(0,1)which is independent of the current state. )
+
   
 +
Several extensions have been proposed in the literature to speed up the convergence and reduce the so called “burn-in” period.
 +
One common suggestion is to match the first few moments of q(y|x) to f(x).
  
However, it is important to remember that even if generating the proposed/candidate state does not depend on the current state, the chain is still a markov chain.
+
'''Aside''': The algorithm works best if the candidate density q(y|x) matches the shape of the target distribution f(x). If a normal distribution is used as a candidate distribution, the variance parameter b<sup>2</sup> has to be tuned during the burn-in period. <br/>
  
<br />
+
1. If b is chosen to be too small, the chain will mix slowly (smaller proposed move, the acceptance rate will be high and the chain will converge only slowly the f(x)).  
Comparing with previous sampling methods we have learned, samples generated from M-H algorithm are not independent of each other, since we accept future sample based on the current sample. Furthermore, unlike acceptance and rejection method, we are not going to reject any points in Metropolis-Hastings. In the equivalent of the "reject" case, we just leave the state unchanged. In other words, if we need a sample of 1000 points, we only need to generate the sample 1000 times.<br/>
 
  
<p style="font-size:20px;color:red;">
+
2. If b is chosen to be too large, the acceptance rate will be low (larger proposed move and the chain will converge only slowly the f(x)).
Remarks
 
</p>
 
===='''Remark 1'''====
 
<span style="text-shadow: 0px 2px 3px 3399CC;margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">
 
A common choice for <math>q(y|x)</math> is a normal distribution centered at x with standard deviation b. <math>q(y|x)=N(x,b^2)</math>
 
  
In this case, <math> q(y|x)</math> is symmetric.
 
  
i.e.
 
<math>q(y|x)=q(x|y)</math><br>
 
(we want to sample q centered at the current state.)<br>
 
<math>q(y|x)=\frac{1}{\sqrt{2\pi}b}\,e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2b^2} (y-x)^2}</math>, (centered at x)<br>
 
<math>q(x|y)=\frac{1}{\sqrt{2\pi}b}\,e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2b^2} (x-y)^2}</math>,(centered at y)<br>
 
<math>\Rightarrow (y-x)^2=(x-y)^2</math><br>
 
so <math>~q(y \mid x)=q(x \mid y)</math> <br>
 
In this case <math>\frac{q(x \mid y)}{q(y \mid x)}=1</math> and therefore <math> r(x,y)=\min \{\frac{f(y)}{f(x)}, 1\} </math> <br/><br />
 
This is true for any symmetric q. In general if q(y|x) is symmetric, then this algorithm is called Metropolis.<br/>
 
When choosing function q, it makes sense to choose a distribution with the same support as the distribution you want to simulate. eg. Beta ---> Choose q ~ Uniform(0,1)<br>
 
The chosen q is not necessarily symmetric. Depending on different target distribution, q can be uniform.</span>
 
  
===='''Remark 2'''====
+
'''Note''':  
<span style="text-shadow: 0px 2px 3px 3399CC;margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">
+
The histogram looks much nicer if we reject the points within the burning time.<br>
The value y is accepted if u<=<math>min\{\frac{f(y)}{f(x)},1\}</math>, so it is accepted with the probability <math>min\{\frac{f(y)}{f(x)},1\}</math>.<br/>
 
Thus, if <math>f(y)>=f(x)</math>, then y is always accepted.<br/>
 
The higher that value of the pdf is in the vicinity of a point <math>y_1</math> , the more likely it is that a random variable will take on values around <math>y_1</math>.<br/>
 
Therefore,we would want a high probability of acceptance for points generated near <math>y_1</math>.<br>
 
[[File:Diag1.png‎]]<br>
 
  
'''Note''':<br/>
 
If the proposal comes from a region with low density, we may or may not accept; however, we accept for sure if the proposal comes from a region with high density.<br>
 
  
===='''Remark 3'''====
+
Example: Use M-H method to generate sample from f(x)=2x
 +
0<x<1, 0 otherwise.
  
One strength of the Metropolis-Hastings algorithm is that normalizing constants, which are often quite difficult to determine, can be cancelled out in the ratio <math> r </math>. For example, consider the case where we want to sample from the beta distribution, which has the pdf:<br>
+
1) Initialize the chain with <math>x_i</math> and set <math>i=0</math>
(also notice that Metropolis Hastings is just a special case of Metropolis algorithm)
 
  
<math>
+
2)<math>Y~\sim~q(y|x_i)</math>
\begin{align}
+
where our proposal function would be uniform [0,1] since it matches our original ones support.
f(x;\alpha,\beta)& = \frac{1}{\mathrm{B}(\alpha,\beta)}\, x^{\alpha-1}(1-x)^{\beta-1}\end{align}
+
=><math>Y~\sim~Unif[0,1]</math>
</math>
 
  
The beta function, ''B'', appears as a normalizing constant but it can be simplified by construction of the method.
+
3)consider <math>\frac{f(y)}{f(x)}=\frac{y}{x}</math>,  
 +
<math>r(x,y)=min (\frac{y}{x},1)</math> since q(y|x<sub>i</sub>) and q(x<sub>i</sub>|y) can be cancelled together.
  
====='''Example'''=====
+
4)<math>X_{i+1}=Y</math> with prob <math>r(x,y)</math>,
 +
<math>X_{i+1}=X_i</math>, otherwise
  
<math>\,f(x)=\frac{1}{\pi^{2}}\frac{1}{1+x^{2}}</math><br>
+
5)<math>i=i+1</math>, go to 2
Then, we have <math>\,f(x)\propto\frac{1}{1+x^{2}}</math>.<br>
 
And let us take <math>\,q(x|y)=\frac{1}{\sqrt{2\pi}b}e^{-\frac{1}{2b^{2}}(y-x)^{2}}</math>.<br>
 
Then <math>\,q(x|y)</math> is symmetric since <math>\,(y-x)^{2} = (x-y)^{2}</math>.<br>
 
Therefore Y can be simplified.
 
  
 +
<br>
  
We get :
+
Example form wikipedia
  
<math>\,\begin{align}
+
===Step-by-step instructions===
\displaystyle r(x,y)
 
& =min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} \\
 
& =min\left\{\frac{f(y)}{f(x)},1\right\} \\
 
& =min\left\{ \frac{ \frac{1}{1+y^{2}} }{ \frac{1}{1+x^{2}} },1\right\}\\
 
& =min\left\{ \frac{1+x^{2}}{1+y^{2}},1\right\}\\
 
\end{align}
 
</math>.
 
  
<br/>
+
Suppose the most recent value sampled is <math>x_t\,</math>. To follow the Metropolis–Hastings algorithm, we next draw a new proposal state <math>x'\,</math> with probability density <math>Q(x'\mid x_t)\,</math>, and calculate a value
<math>\pi=[0.1\,0.1\,...] </math> stands for probility;<br/>
 
<math>\pi \propto [3\,2\, 10\, 100\, 1.5] </math> is not brobility, so we take:<br/>
 
<math>\Rightarrow \pi=1/c \times [3\, 2\, 10\, 100\, 1.5]</math> is probility where<br/>
 
<math>\Rightarrow c=3+2+10+100+1.5 </math><br/>
 
<br/>
 
<br/>
 
  
In practice, if elements of <math>\pi</math> are functions or random variables, we need c to be the normalization factor, the summation/integration over all members of <math>\pi</math>. This is usually very difficult. Since we are taking ratios, with the Metropolis-Hasting algorithm, it is not necessary to do this.
+
:<math>
 +
a = a_1 a_2\,
 +
</math>
 +
 
 +
where
 +
 
 +
:<math>
 +
a_1 = \frac{P(x')}{P(x_t)} \,\!
 +
</math>
  
<br>
+
is the likelihood ratio between the proposed sample <math>x'\,</math> and the previous sample <math>x_t\,</math>, and
For example, to find the relationship between weather temperature and humidity, we only have a proportional function instead of a probability function. To make it into a probability function, we need to compute c, which is really difficult. However, we don't need to compute c as it will be cancelled out during calculation of r.<br>
 
  
======'''MATLAB'''======
+
:<math>
The Matlab code of the algorithm is the following :
+
a_2 = \frac{Q(x_t \mid x')}{Q(x'\mid x_t)}
<pre style="font-size:12px">
+
</math>
clear all
 
close all
 
clc
 
b=2;
 
x(1)=0;
 
for i=2:10000
 
    y=b*randn+x(i-1);
 
    r=min((1+x(i-1)^2)/(1+y^2),1);
 
    u=rand;
 
    if u<r
 
        x(i)=y;
 
    else
 
        x(i)=x(i-1);
 
    end
 
   
 
end
 
hist(x,100);
 
%The Markov Chain usually takes some time to converge and this is known as the "burning time".
 
</pre>
 
[[File:MH_example2.jpg|300px]]
 
  
However, while the data does approximately fit the desired distribution, it takes some time until the chain gets to the stationary distribution. To generate a more accurate graph, we modify the code to ignore the initial points.<br>
+
is the ratio of the proposal density in two directions (from <math>x_t\,</math> to <math>x'\,</math> and ''vice versa'').
 +
This is equal to 1 if the proposal density is symmetric.
 +
Then the new state <math>\displaystyle x_{t+1}</math> is chosen according to the following rules.
  
'''MATLAB'''
+
:<math>
<pre style="font-size:16px">
+
\begin{matrix}
b=2;
+
\mbox{If } a \geq 1: &  \\
x(1)=0;
+
& x_{t+1} = x',
for ii=2:10500
+
\end{matrix}
y=b*randn+x(ii-1);
+
</math>
r=min((1+x(ii-1)^2)/(1+y^2),1);
+
:<math>
u=rand;
+
\begin{matrix}
if u<=r
+
\mbox{else} & \\
x(ii)=y;
+
& x_{t+1} = \left\{
else
+
                  \begin{array}{lr}
x(ii)=x(ii-1);
+
                      x' & \mbox{ with probability }a \\
end
+
                      x_t & \mbox{ with probability }1-a.
end
+
                  \end{array}
xx=x(501:end) %we don't display the first 500 points because they don't show the limiting behaviour of the Markov Chain
+
            \right.
hist(xx,100)
+
\end{matrix}
</pre>
+
</math>
[[File:MH_Ex.jpg|300px]]
 
<br>
 
'''If a function f(x) can only take values from <math>[0,\infty)</math>, but we need to use normal distribution as the candidate distribution, then we can use <math>q=\frac{2}{\sqrt{2\pi}}*exp(\frac{-(y-x)^2}{2})</math>, where y is from <math>[0,\infty)</math>. <br>(This is essentially the pdf of the absolute value of a normal distribution centered around x)'''<br><br>
 
  
Example:<br>
+
The Markov chain is started from an arbitrary initial value <math>\displaystyle x_0</math> and the algorithm is run for many iterations until this initial state is "forgotten". 
We want to sample from <math>exp(2), q(y|x)~\sim~N(x,b^2)</math><br>
+
These samples, which are discarded, are known as ''burn-in''. The remaining set of accepted values of <math>x</math> represent a sample from the distribution <math>P(x)</math>.
<math>r=\frac{f(y)}{f(x)}=\frac{2*exp^(-2y)}{2*exp^(-2x)}=exp(2*(x-y))</math><br>
 
<math>r=min(exp(2*(x-y)),1)</math><br>
 
  
'''MATLAB'''
+
The algorithm works best if the proposal density matches the shape of the target distribution <math>\displaystyle P(x)</math> from which direct sampling is difficult, that is <math>Q(x'\mid x_t) \approx P(x') \,\!</math>.
<pre style="font-size:16px">
+
If a Gaussian proposal density <math>\displaystyle Q</math> is used the variance parameter <math>\displaystyle \sigma^2</math> has to be tuned during the burn-in period.
x(1)=0;
+
This is usually done by calculating the ''acceptance rate'', which is the fraction of proposed samples that is accepted in a window of the last <math>\displaystyle N</math> samples.
for ii=2:100
+
The desired acceptance rate depends on the target distribution, however it has been shown theoretically that the ideal acceptance rate for a one dimensional Gaussian distribution is approx 50%, decreasing to approx 23% for an <math>\displaystyle N</math>-dimensional Gaussian target distribution.<ref name=Roberts/>
y=2*(randn*b+abs(x(ii-1)))
 
r=min(exp(2*(x-y)),1);
 
u=rand;
 
if u<=r
 
x(ii)=y;
 
else
 
x(ii)=x(ii-1);
 
end
 
end
 
</pre>
 
<br>
 
  
'''Definition of Burn in:'''
+
If <math>\displaystyle \sigma^2</math> is too small the chain will ''mix slowly'' (i.e., the acceptance rate will be high but successive samples will move around the space slowly and the chain will converge only slowly to <math>\displaystyle P(x)</math>).  On the other hand,
 +
if <math>\displaystyle \sigma^2</math> is too large the acceptance rate will be very low because the proposals are likely to land in regions of much lower probability density, so <math>\displaystyle a_1</math> will be very small and again the chain will converge very slowly.
  
Typically in a MH Algorithm, a set of values generated at at the beginning of the sequence are "burned" (discarded) after which the chain is assumed to have converged to its target distribution. In the first example listed above, we "burned" the first 500 observations because we believe the chain has not quite reached our target distribution in the first 500 observations. 500 is not a set threshold, there is no right or wrong answer as to what is the exact number required for burn-in. Theoretical calculation of the burn-in is rather difficult, in the above mentioned example, we chose 500 based on experience and quite arbitrarily. 
+
== Class 19 - Tuesday July 9th 2013 ==
 +
'''Recall: Metropolis–Hasting Algorithm'''
  
Burn-in time can also be thought of as the time it takes for the chain to reach its stationary distribution. Therefore, in this case you will disregard everything uptil the burn-in period because the chain is not stabilized yet.  
+
1) <math>X_i</math> = State of chain at time i. Set <math>X_0</math> = 0<br>
 +
2) Generate proposal distribution: Y ~ q(y|x) <br>
 +
3) Set <math>\,r=min[\frac{f(y)}{f(x)}\,\frac{q(x|y)}{q(y|x)}\,,1]</math><br>
 +
4) Generate U ~ U(0,1)<br>
 +
  If <math>U<r</math>, then<br>
 +
        <math>X_{i+1} = Y</math> % i.e. we accept Y as the next point in the Markov Chain <br>
 +
  else <br>
 +
        <math>X_{i+1}</math> = <math>X_i</math><br>
 +
  End if<br>
 +
5) Set i = i + 1. Return to Step 2. <br>
  
The Metropolis–Hasting Algorithm is started from an arbitrary initial value <math>x_0</math> and the algorithm is run for many iterations until this initial state is "forgotten". These samples, which are discarded, are known as ''burn-in''. The remaining
 
set of accepted values of <math>x</math> represent a sample from the distribution f(x).(http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm)<br/>
 
  
Burn-in time can also be thought of as the time it takes for the process to reach the stationary distribution pi. Suppose it takes 5 samples after which you reach the stationary distribution. You should disregard the first five samples and consider the remaining samples as representing your target distribution f(x). <br>
+
Why can we use this algorithm to generate a Markov Chain?<br>
   
 
Several extensions have been proposed in the literature to speed up the convergence and reduce the so called “burn-in” period.
 
One common suggestion is to match the first few moments of q(y|x) to f(x).
 
  
'''Aside''': The algorithm works best if the candidate density q(y|x) matches the shape of the target distribution f(x). If a normal distribution is used as a candidate distribution, the variance parameter b<sup>2</sup> has to be tuned during the burn-in period. <br/>
+
<math>\,Y</math>~<math>\,q(y|x)</math> satisfies the Markov Property, as the current state does not depend on previous trials. Note that Y does not '''''have''''' to depend on X<sub>t-1</sub>; the Markov Property is satisfied as long as Y is not dependent on  X<sub>0</sub>, X<sub>1</sub>,..., X<sub>t-2</sub>. Thus, time t will not affect the choice of state.<br>
  
1. If b is chosen to be too small, the chain will mix slowly (smaller proposed move, the acceptance rate will be high and the chain will converge only slowly the f(x)).
 
  
2. If b is chosen to be too large, the acceptance rate will be low (larger proposed move and the chain will converge only slowly the f(x)).
+
==='''Choosing b: 3 cases'''===
 +
If y and x have the same domain, say R, we could use normal distribution to model <math>q(y|x)</math>. <math>q(x|y)~normal(y,b^2), and q(y|x)~normal(x,b^2)</math>.
 +
In the continuous case of MCMC, <math>q(y|x)</math> is the probability of observing y, given you are observing x. We normally assume <math>q(y|x)</math> ~ N(x,b^2). A reasonable choice of b is important to ensure the MC does indeed converges to the target distribution f. If b is too small it is not possible to explore the whole support because the jumps are small. If b is large than the probability of accepting the proposed state y is small, and it is very likely that we reject the possibilities of leaving the current state, hence the chain will keep on producing the initial state of the Markov chain.  
  
 +
To be precise, we are discussing the choice of variance for the proposal distribution.Large b simply implies larger variance for our choice of proposal distribution (Gaussian) in this case. Therefore, many points will be rejected and we will generate same points many times since there are many points that have been rejected.<br>
  
 +
In this example, <math>q(y|x)=N(x, b^2)</math><br>
  
'''Note''':
+
Demonstrated as follows, the choice of b will be significant in determining the quality of the Metropolis algorithm. <br>
The histogram looks much nicer if we reject the points within the burning time.<br>
 
  
 +
This parameter affects the probability of accepting the candidate states, and the algorithm will not perform well if the acceptance probability is too large or too small, it also affects the size of the "jump" between the sampled <math>Y</math> and the previous state x<sub>i+1</sub>, as a larger variance implies a larger such "jump".<br>
  
Example: Use M-H method to generate sample from f(x)=2x
+
If the jump is too large, we will have to repeat the previous stage; thus, we will repeat the same point for many times.<br>
0<x<1, 0 otherwise.
 
  
1) Initialize the chain with <math>x_i</math> and set <math>i=0</math>
+
'''MATLAB b=2, b= 0.2, b=20 '''
 +
<pre style="font-size:12px">
 +
clear all
 +
close all
 +
clc
 +
b=2 % b=0.2 b=20;
 +
x(1)=0;
 +
for i=2:10000
 +
    y=b*randn+x(i-1);
 +
    r=min((1+x(i-1)^2)/(1+y^2),1);
 +
    u=rand;
 +
    if u<r
 +
        x(i)=y;
 +
    else
 +
        x(i)=x(i-1);
 +
    end
 +
   
 +
end
 +
figure(1);
 +
hist(x(5000:end,100));
 +
figure(2);
 +
plot(x(5000:end));
 +
%The Markov Chain usually takes some time to converge and this is known as the "burning time"
 +
%Therefore, we don't display the first 5000 points because they don't show the limiting behaviour of the Markov Chain
  
2)<math>Y~\sim~q(y|x_i)</math>
+
generate the Markov Chain with 10000 random variable, using a large b and a small  b.
where our proposal function would be uniform [0,1] since it matches our original ones support.
+
</pre>
=><math>Y~\sim~Unif[0,1]</math>
 
  
3)consider <math>\frac{f(y)}{f(x)}=\frac{y}{x}</math>,
+
b tells where the next point is going to be. The appropriate b is supposed to explore all the support area.  
<math>r(x,y)=min (\frac{y}{x},1)</math> since q(y|x<sub>i</sub>) and q(x<sub>i</sub>|y) can be cancelled together.
 
  
4)<math>X_{i+1}=Y</math> with prob <math>r(x,y)</math>,
+
f(x) is the stationary distribution list of the chain in MH. We generating y using q(y|x) and accept it with respect to r.
<math>X_{i+1}=X_i</math>, otherwise
 
  
5)<math>i=i+1</math>, go to 2
+
===='''b too small====
 +
If <math>b = 0.02</math>, the chain takes small steps so the chain doesn't explore enough of sample space.
  
<br>
+
If <math>b = 20</math>, jumps are very unlikely to be accepted; i.e. <math> y </math> is rejected as <math> u> r </math> and <math> Xt+1 = Xt</math>.
 +
i.e <math>\frac {f(y)}{f(x)}</math> and consequent <math> r </math> is very small and very unlikely that <math> u < r </math>, so the current value will be repeated.
  
Example form wikipedia
+
==='''Detailed Balance Holds for Metropolis-Hasting'''===
  
===Step-by-step instructions===
+
In metropolis-hasting, we generate y using q(y|x) and accept it with probability r, where <br>
  
Suppose the most recent value sampled is <math>x_t\,</math>. To follow the Metropolis–Hastings algorithm, we next draw a new proposal state <math>x'\,</math> with probability density <math>Q(x'\mid x_t)\,</math>, and calculate a value
+
<math>r(x,y) = min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} = min\left\{\frac{f(y)}{f(x)},1\right\}</math><br>
  
:<math>
+
Without loss of generality we assume <math>\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)} > 1</math><br>
a = a_1 a_2\,
 
</math>
 
  
where
+
Then r(x,y) (probability of accepting y given we are currently in x) is <br>
  
:<math>
+
<math>r(x,y) = min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} = \frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)}</math><br>
a_1 = \frac{P(x')}{P(x_t)} \,\!
 
</math>
 
  
is the likelihood ratio between the proposed sample <math>x'\,</math> and the previous sample <math>x_t\,</math>, and
+
Now suppose that the current state is y and we are generating x; the probability of accepting x given that we are currently in state y is <br>
  
:<math>
+
<math>r(x,y) = min\left\{\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)},1\right\} = 1 </math><br>
a_2 = \frac{Q(x_t \mid x')}{Q(x'\mid x_t)}
 
</math>
 
  
is the ratio of the proposal density in two directions (from <math>x_t\,</math> to <math>x'\,</math> and ''vice versa'').
+
This is because <math>\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)} < 1 </math> and its reverse <math>\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)} > 1 </math>. Then <math>r(x,y) = 1</math>.<br>
This is equal to 1 if the proposal density is symmetric.
+
We are interested in the probability of moving from from x to y in the Markov Chain generated by MH algorithm: <br>
Then the new state <math>\displaystyle x_{t+1}</math> is chosen according to the following rules.
+
P(y|x) depends on two probabilities:
 +
1. Probability of generating y, and<br>
 +
2. Probability of accepting y. <br>
  
:<math>
+
<math>P(y|x) = q(y|x)*r(x,y) = q(y|x)*{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)}} = \frac{f(y)*q(x|y)}{f(x)} </math> <br>
\begin{matrix}
 
\mbox{If } a \geq 1: &  \\
 
& x_{t+1} = x',
 
\end{matrix}
 
</math>
 
:<math>
 
\begin{matrix}
 
\mbox{else} & \\
 
& x_{t+1} = \left\{
 
                  \begin{array}{lr}
 
                      x' & \mbox{ with probability }a \\
 
                      x_t & \mbox{ with probability }1-a.
 
                  \end{array}
 
            \right.
 
\end{matrix}
 
</math>
 
  
The Markov chain is started from an arbitrary initial value <math>\displaystyle x_0</math> and the algorithm is run for many iterations until this initial state is "forgotten". 
+
The probability of moving to x given the current state is y:
These samples, which are discarded, are known as ''burn-in''. The remaining set of accepted values of <math>x</math> represent a sample from the distribution <math>P(x)</math>.
+
 
 +
<math>P(x|y) = q(x|y)*r(y,x) = q(x|y)</math><br>
 +
 
 +
So does detailed balance hold for MH? <br>
 +
 
 +
If it holds we should have <math>f(x)*P(y|x) = f(y)*P(x|y)</math>.<br>
  
The algorithm works best if the proposal density matches the shape of the target distribution <math>\displaystyle P(x)</math> from which direct sampling is difficult, that is <math>Q(x'\mid x_t) \approx P(x') \,\!</math>.
+
Left-hand side: <br>
If a Gaussian proposal density <math>\displaystyle Q</math> is used the variance parameter <math>\displaystyle \sigma^2</math> has to be tuned during the burn-in period.
 
This is usually done by calculating the ''acceptance rate'', which is the fraction of proposed samples that is accepted in a window of the last <math>\displaystyle N</math> samples.
 
The desired acceptance rate depends on the target distribution, however it has been shown theoretically that the ideal acceptance rate for a one dimensional Gaussian distribution is approx 50%, decreasing to approx 23% for an <math>\displaystyle N</math>-dimensional Gaussian target distribution.<ref name=Roberts/>
 
  
If <math>\displaystyle \sigma^2</math> is too small the chain will ''mix slowly'' (i.e., the acceptance rate will be high but successive samples will move around the space slowly and the chain will converge only slowly to <math>\displaystyle P(x)</math>).  On the other hand,
+
<math>f(x)*P(y|x) = f(x)*{\frac{f(y)*q(x|y)}{f(x)}} = f(y)*q(x|y)</math><br>
if <math>\displaystyle \sigma^2</math> is too large the acceptance rate will be very low because the proposals are likely to land in regions of much lower probability density, so <math>\displaystyle a_1</math> will be very small and again the chain will converge very slowly.
 
  
== Class 19 - Tuesday July 9th 2013 ==
+
Right-hand side: <br>
'''Recall: Metropolis–Hasting Algorithm'''
 
  
1) X<sub>0</sub>= state of chain at time 0.  Set i = 0<br>
+
<math>f(y)*P(x|y) = f(y)*q(x|y)</math><br>
2) <math>\,Y</math>~<math>\,q(y|x)</math><br>
 
3) <math>\,r=min[\frac{f(y)}{f(x)}\,\frac{q(x|y)}{q(y|x)}\,,1]</math><br>
 
4) <math>\,U</math>~<math>\,Uniform(0,1)</math><br>
 
5)
 
If <math>U<r</math>, then<br>
 
  x<sub>(i+1)</sub> = y  % i.e. we accept y as the next point in the Markov Chain<br>
 
else<br>
 
  x<sub>(i+1)</sub> = x<sub>t</sub><br>
 
End if<br>
 
6) i = i + 1. Return to Step 2. <br>
 
  
 +
Thus LHS and RHS are equal and the detailed balance holds for MH algorithm. <br>
 +
Therefore, f(x) is the stationary distribution of the chain.<br>
  
Why can we use this algorithm to generate a Markov Chain?<br>
+
== Class 20 - Thursday July 11th 2013 ==
 +
=== Simulated annealing ===
 +
<br />
 +
'''Definition:''' Simulated annealing (SA) is a generic probabilistic metaheuristic for the global optimization problem of locating a good approximation to the global optimum of a given function in a large search space. It is often used when the search space is discrete (e.g., all tours that visit a given set of cities). <br />
 +
(http://en.wikipedia.org/wiki/Simulated_annealing) <br />
 +
"Simulated annealing is a popular algorithm in simulation for minimizing functions." (from textbook)<br />
  
The current state will only be affected by the previous state, which satisfies the memoryless property of a Markov Chain. Thus, time t will not affect the choice of state.<br>
+
Simulated annealing is developed to solve the traveling salesman problem: finding the optimal path to travel all the cities needed<br/>
  
 +
It is called "Simulated annealing" because it mimics the process undergone by misplaced atoms in a metal when<br />
 +
its heated and then slowly cooled.<br />
 +
(http://mathworld.wolfram.com/SimulatedAnnealing.html)<br />
  
==='''Choosing b: 3 cases'''===
+
It is a probabilistic method proposed in Kirkpatrick, Gelett and Vecchi (1983) and Cerny (1985) for finding the global minimum of a function that may have multiple local minimums.<br />
In the continuous case of MCMC, <math>q(y|x)</math> is the probability of observing y, given you are observing x. We normally assume <math>q(y|x)</math> ~ N(x,b^2). A reasonable choice of b is important to ensure the MC does indeed converges to the target distribution f. If b is too small it is not possible to explore the whole support because the jumps are small. If b is large than the probability of accepting the proposed state y is small, and it is very likely that we reject the possibilities of leaving the current state, hence the chain will keep on producing the initial state of the Markov chain.
+
(http://www.mit.edu/~dbertsim/papers/Optimization/Simulated%20annealing.pdf)<br />
  
To be precise, we are discussing the choice of variance for the proposal distribution.Large b simply implies larger variance for our choice of proposal distribution (Gaussian) in this case. So many points will be rejected and we will generate same points many times since there are many points that have been rejected.<br>
+
Simulated annealing was developed as an approach for finding the minimum of complex functions <br />
 +
with multiple peaks; where standard hill-climbing approaches may trap the algorithm at a less that optimal peak.<br />
  
In this example, <math>q(y|x)=N(x, b^2)</math><br>
+
Suppose we generated a point <math> x </math> by an existing algorithm, and we would like to get a "better" point. <br>
 +
(eg. If we have generated a local min of a function and we want the global min) <br>
 +
Then we would use simulated annealing as a method to "perturb" <math> x </math> to obtain a better solution. <br>
 +
 +
Suppose we would like to min <math> h(x)</math>, for any arbitrary constant <math> T > 0</math>, this problem is equivalent to  max <math>e^{-h(x)/T}</math><br />
 +
Note that the exponential function is monotonic. <br />
 +
Consider f proportional  to  e<sup>-h(x)/T</sup>, sample of this distribution when T is small and
 +
close to the optimal point of h(x). Based on this observation, SA algorithm is introduced as :<br />
 +
<b>1.</b> Set T to be a large number<br />
 +
<b>2.</b> Initialize the chain: set <math>\,X_{t}  (ie.  i=0, x_0=s)</math><br />
 +
<b>3.</b> <math>\,y</math>~<math>\,q(y|x)</math><br/>
 +
(q should be symmetric)<br />
 +
<b>4.</b> <math>r = \min\{\frac{f(y)}{f(x)},1\}</math><br />
 +
<b>5.</b> U ~ U(0,1)<br />
 +
<b>6.</b> If U < r, <math>X_{t+1}=y</math> <br/>
 +
else, <math>X_{t+1}=X_t</math><br/>
 +
<b>7.</b> end  decrease T, and let i=i+1. Go back to 3. (This is where the difference lies between SA and MH. <br />
 +
(repeat the procedure until T is very small)<br/>
 +
<br/>
 +
<b>Note</b>: q(y|x) does not have to be symmetric. If q is non-symmetric, then the original MH formula is used.<br />
  
Demonstrated as follows, the choice of b will be significant in determining the quality of the Metropolis algorithm. <br>
+
The significance of T <br />
 +
Initially we set T to be large when initializing the chain so as to explore the entire sample space and to avoid the possibility of getting stuck/trapped in one region of the sample space. Then we gradually start decreasing T so as to get closer and closer to the actual solution.  
  
This parameter affects the probability of accepting the candidate states, and the algorithm will not perform well if the acceptance probability is too large or too small, it also affects the size of the "jump" between the sampled <math>Y</math> and the previous state x<sub>i+1</sub>, as a larger variance implies a larger such "jump".<br>
+
Notice that we have:
 +
    <math> r = \min\{\frac{f(y)}{f(x)},1\} </math><br/>
 +
    <math> = \min\{\frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}},1\} </math>  <br/>
 +
    <math> = \min\{e^{\frac{h(x)-h(y)}{T}},1\} </math><br/>
  
If the jump is too large, we will have to repeat the previous stage; thus, we will repeat the same point for many times.<br>
+
Reasons we start with a large T but not a small T at the beginning:<br />
  
'''MATLAB b=2, b= 0.2, b=20 '''
+
<ul><li>A point in the tail when T is small would be rejected <br />
<pre style="font-size:12px">
+
</li><li>Chances that we reject points get larger as we move from large T to small T <br />
clear all
+
</li><li>Large T helps get to the mode of maximum value<br />
close all
+
</li></ul>
clc
 
b=2 % b=0.2 b=20;
 
x(1)=0;
 
for i=2:10000
 
    y=b*randn+x(i-1);
 
    r=min((1+x(i-1)^2)/(1+y^2),1);
 
    u=rand;
 
    if u<r
 
        x(i)=y;
 
    else
 
        x(i)=x(i-1);
 
    end
 
   
 
end
 
figure(1);
 
hist(x(5000:end,100));
 
figure(2);
 
plot(x(5000:end));
 
%The Markov Chain usually takes some time to converge and this is known as the "burning time"
 
%Therefore, we don't display the first 5000 points because they don't show the limiting behaviour of the Markov Chain
 
  
generate the Markov Chain with 10000 random variable, using a large b and a small b.
+
Assume T is large <br />
</pre>
+
1. h(y) < h(x), e<sup>(h(x)-h(y))/T </sup> > 1, then r = 1, y will always be accepted.<br />
 +
2. h(y) > h(x), e<sup>(h(x)-h(y))/T </sup>< 1, then r < 1, y will be accepted with probability r.  '''Remark:'''this will help to scape from local minimum, because the algorithm prevents it from reaching and staying in the local minimum forever. <br />
 +
Assume T is small<br />
 +
1. h(y) < h(x), then r = 1, y will always be accepted.<br />
 +
2. h(y) > h(x), e<sup>(h(x)-h(y))/T </sup> approaches to 0, then r goes to 0 and y will almost never be accepted.
  
b tells where the next point is going to be. The appropriate b is supposed to explore all the support area.  
+
<p><br /> All in all, choose a large T to start off with in order for a higher chance that the points can explore. <br />
  
f(x) is the stationary distribution list of the chain in MH. We generating y using q(y|x) and accept it with respect to r.
+
'''Note''': The variable T is known in practice as the "Temperature", thus the higher T is, the more variability there is in terms of the expansion and contraction of materials. The term "Annealing" follows from here, as annealing is the process of heating materials and allowing them to cool slowly.<br />
  
===='''b too small====
+
Asymptotically this algorithm is guaranteed to generate the global optimal answer, however in practice, we never sample forever and this may not happen.
If b = 0.02, the chain takes small steps so the chain doesn't explore enough of sample space.
 
  
If b = 20, jumps are very unlikely to be accepted; i.e. y is rejected as u> r and Xt+1 = Xt
+
</p><p><br />
i.e f(y)/f(x) and consequent r is very small and very unlikely that u < r.so the current value will be repeated.
+
</p><p>Example: Consider <math>h(x)=3x^2</math>, 0&lt;x&lt;1
 +
</p><p><br />1) Set T to be large, for example, T=100<br />
 +
<br />2) Initialize the chain<br />
 +
<br />3) Set <math>q(y|x)~\sim~Unif[0,1]</math><br />
 +
<br />4) <math>r=min(exp(\frac{(3x^2-3y^2)}{100}),1)</math><br />
 +
<br />5) <math>U~\sim~U[0,1]</math><br />
 +
<br />6) If <i>U</i> &lt; <i>r</i> then <i>X</i><sub><i>t</i> + 1</sub> = <i>y</i> <br>
 +
<i>e</i><i>l</i><i>s</i><i>e</i>,<i>X</i><sub><i>t</i> + 1</sub> = <i>x</i><sub><i>t</i></sub><br />
 +
<br />7) Decrease T, go back to 3<br />
 +
</p>
 +
<div style="border:1px red solid">
 +
<p><b>MATLAB </b>
 +
</p>
 +
<pre style="font-size:12px">
 +
Syms x
 +
Ezplot('(x-3)^2',[-6,12])
 +
Ezplot('exp(-((x-3)^2))', [-6, 12])
 +
</pre>
  
==='''Detailed Balance Holds for Metropolis-Hasting'''===
+
[[File:Snip2013.png|350px]]
  
In metropolis-hasting, we generate y using q(y|x) and accept it with probability r, where <br>
+
[[File:Snip20131.png|350px]]
  
<math>r(x,y) = min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} = min\left\{\frac{f(y)}{f(x)},1\right\}</math><br>
+
[[File:STAT_340.JPG]]
 +
http://www.wolframalpha.com/input/?i=graph+exp%28-%28x-3%29%5E2%2F10%29
 +
<b>MATLAB </b>
  
Without loss of generality we assume <math>\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)} > 1</math><br>
+
Note that when T is small, the graph consists of a much higher bump; when T is large, the graph is flatter.
  
Then r(x,y) (probability of accepting y given we are currently in x) is <br>
+
<pre style="font-size:14px">
  
<math>r(x,y) = min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} = \frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)}</math><br>
+
clear all
 +
close all
 +
T=100;
 +
x(1)=randn;
 +
ii=1;
 +
b=1;
 +
while T&gt;0.001
 +
  y=b*randn+x(ii);
 +
  r=min(exp((H(x(ii))-H(y))/T),1);
 +
  u=rand;
 +
  if u&lt;r
 +
      x(ii+1)=y;
 +
  else
 +
      x(ii+1)=x(ii);
 +
  end
  
Now suppose that the current state is y and we are generating x; the probability of accepting x given that we are currently in state y is <br>
+
T=0.99*T;
 +
ii=ii+1;
 +
end
 +
plot(x)
  
<math>r(x,y) = min\left\{\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)},1\right\} = 1 </math><br>
+
</pre>
 
+
[[File:SA_example.jpg|350px]]
This is because <math>\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)} < 1 </math> and its reverse <math>\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)} > 1 </math>. <br>
+
</div>
We are interested in the probability of moving from from x to y in the Markov Chain generated by MH algorithm: <br>
+
<p>Helper function:
P(y|x) depends on two probabilities:
+
</p><p>an example is for H(x)=(x-3)^2
1. Probability of generating y, and<br>
+
</p>
2. Probability of accepting y. <br>
+
<pre style="font-size:12px">
 
+
function c=H(x)
<math>P(y|x) = q(y|x)*r(x,y) = q(y|x)*{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)}} = \frac{f(y)*q(x|y)}{f(x)} </math> <br>
+
c=(x-3)^2;
 
+
end
The probability of moving to x given the current state is y:
+
</pre>
 
+
<p><b>Another Example:</b>
<math>P(x|y) = q(x|y)*r(y,x) = q(x|y)</math><br>
+
<span class="texhtml"><i>h</i>(<i>x</i>) = ((<i>x</i> &minus; 2)<sup>2</sup> &minus; 4)((<i>x</i> &minus; 4)<sup>2</sup> &minus; 8)</span>
 
+
</p>
So does detailed balance hold for MH? <br>
+
<pre style="font-size:12px">
 +
&gt;&gt;syms x
 +
&gt;&gt;ezplot(((x-2)^2-4)*((x-4)^2-8),[-1,8])
 +
</pre>
 +
<pre style="font-size:12px">
 +
function c=H(x)
 +
c=((x-2)^2-4)*((x-4)^2-8);
 +
end
 +
</pre>
 +
[[File:SA_example2.jpg|350px]]
 +
<p>Run earlier code with the new H(x) function
 +
</p>
 +
<h3> <span class="mw-headline" id="Motivation:_Simulated_Annealing_and_the_Travelling_Salesman_Problem"> Motivation: Simulated Annealing and the Travelling Salesman Problem </span></h3>
 +
<p>The Travelling Salesman Problem asks:  <br />
 +
Given n numbers of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the original city? By letting two permutations if one results from an interchange of two of the coordinates of the other, we can use simulated annealing to approximate the best path.
 +
<p>[[File:Salesman n5.png|350px]]
 +
</p>
 +
<ul><li>An example of a solution of a travelling salesman problem on n=5. This is only one of many solutions, but we want to ensure we find the optimal solution.
 +
</li></ul>
 +
 
 +
<ul><li>Given n=5 cities, we search for the best route with the minimum distance to visit all cities and return to the starting city.
 +
</li></ul>
 +
<p><b>The idea of using Simulated Annealing algorithm</b>&nbsp;:
 +
Let Y (let Y be all possible combinations of route in terms of cities index) be generated by permutation of all cities. Let the target or objective distribution (f(x)) be the distance of the route given Y.
 +
Then use the Simulated Annealing algorithm to find the minimum value of f(x).<br />
 +
</p><p><b>Note</b>: in this case, Q is the permutation of the numbers. There will be may possible paths, especially when n is large. If n is very large, then it will take forever to check all the combination of routes.
 +
</p>
 +
<ul><li>This sort of knowledge would be very useful for those in a situation where they are on a limited budget or must visit many points in a short period of time. For example, a truck driver may have to visit multiple cities in southern Ontario and make it back to his original starting point within a 6-hour period. <br />
 +
</li></ul>
  
If it holds we should have <math>f(x)*P(y|x) = f(y)*P(x|y)</math>.<br>
+
'''Disadvantages of Simulated Annealing:'''<br/>
 +
1. This method converges very slowly, and therefore very expensive.<br/>
 +
2. This algorithm cannot tell whether it has found the global minimum.<br/><ref>
 +
Reference: http://cs.adelaide.edu.au/~paulc/teaching/montecarlo/node140.html
 +
</ref>
 +
 
 +
== Class 21 - Tuesday July 16, 2013 ==
 +
=== Gibbs Sampling===
 +
'''Definition'''<br>
 +
In statistics and in statistical physics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximately from a specified multivariate probability distribution (i.e. from the joint probability distribution of two or more random variables), when direct sampling is difficult.<br/>
 +
(http://en.wikipedia.org/wiki/Gibbs_sampling)<br/>
  
Left-hand side: <br>
+
The Gibbs sampling method was originally developed by Geman and Geman [1984]. It was later brought into mainstream statistics by Gelfand and Smith [1990] and Gelfand, et al. [1990]<br/>
 +
Source:  https://www.msu.edu/~blackj/Scan_2003_02_12/Chapter_11_Markov_Chain_Monte_Carlo_Methods.pdf<br/>
  
<math>f(x)*P(y|x) = f(x)*{\frac{f(y)*q(x|y)}{f(x)}} = f(y)*q(x|y)</math><br>
+
Gibbs sampling is a general method for probabilistic inference which is often used when dealing with incomplete information. However, generality comes at some computational cost, and for many applications including those involving missing information, there are often alternative methods that have been proven to be more efficient in practice. For example, say we want to sample from a joint distribution <math>p(x_1,...,x_k)</math> (i.e. a posterior distribution). If we knew the full conditional distributions for each parameter (i.e. <math>p(x_i|x_1,x_2,...,x_{i-1},x_{i+1},...,x_k)</math>), we can use the Gibbs sampler to sample from these conditional distributions. <br>
  
Right-hand side: <br>
+
When utilizing the Gibbs sampler, the candidate state is always accepted as the next state of the chain.(from text book)<br/>
  
<math>f(y)*P(x|y) = f(y)*q(x|y)</math><br>
+
*Another Markov Chain Monte Carlo (MCMC) method (first MCMC method introduced in this course is the MH Algorithm) <br/>
 +
*a special case of Metropolis-Hastings sampling where the random value is always accepted, i.e. as long as a point is proposed, it is accepted. <br/>
 +
* useful and make it simple and easier for sampling a d-dimensional random vector <math>\vec{x} = (x_1, x_2,...,x_d)</math><br />
 +
* then the observations of d-dimensional random vectors <math>{\vec{x_1}, \vec{x_2}, ... , \vec{x_n}}</math> form a d-dimensional Markov Chain and the joint density <math>f(x_1, x_2, ... , x_d)</math> is an invariant distribution for the chain. i.e. for sampling multivariate distributions.<br />
 +
* useful if sampling from conditional pdf, since they are easier to sample, in comparison to the joint distribution.<br/>
 +
*Definition of univariate conditional distribution: all the random variables are fixed except for one; we need to use n such univariate conditional distributions to simulate n random variables.
  
Thus LHS and RHS are equal and the detailed balance holds for MH algorithm. <br>
+
'''Difference between Gibbs Sampling & MH'''<br>
 +
Gibbs Sampling generates new value based on the conditional distribution of other components (unlike MH, which does not require conditional distribution).<br/>
 +
eg. We are given the following about <math> f(x_1,x_2) , f(x_1|x_2),f(x_2|x_1) </math><br/>
 +
1. let <math>x^*_1 \sim f(x_1|x_2)</math><br/>
 +
2. <math>x^*_2 \sim f(x_2|x^*_1)</math><br/>
 +
3. substitute <math>x^*_2</math> back into first step and repeat the process. <br/>
  
== Class 20 - Thursday July 11th 2013 ==
+
Also, for Gibbs sampling, we will "always accept a candidate point", unlike MH<br/>
=== Simulated annealing ===
+
Source:  https://www.msu.edu/~blackj/Scan_2003_02_12/Chapter_11_Markov_Chain_Monte_Carlo_Methods.pdf<br/>
"Simulated annealing is a popular algorithm in simulation for minimizing functions." (from textbook)<br />
 
  
Simulated annealing was developed as an approach for finding the maximum of complex functions <br />
+
<div style = "align:left; background:#F5F5DC; font-size: 120%">
with multiple peaks, where standard hill-climbing approaches may trap the algorithm at a less that optimal peak.<br />
+
'''Gibbs Sampling as a special form of the Metropolis Hastings algorithm'''<br>
  
Suppose we generated a point <math> x </math> by an existing algorithm, and we would like to get a "better" point. <br>
+
The Gibbs Sampler is simply a case of the Metropolis Hastings algorithm<br>
(eg. If we have generated a local min of a function and we want the global min) <br>
+
 
Then we would use simulated annealing as a method to "perturb" <math> x </math> to obtain a better solution. <br>
+
here, the proposal distribution is <math>q(Y|X)=f(X^j|X^*_i, i\neq j)=\frac{f(Y)}{f(X_i, i\neq j)}</math> for <math>X=(X_1,...,X_n)</math>, <br>
+
which is simply the conditional distribution of each element conditional on all the other elements in the vector. <br>
Suppose we would like to min <math> h(x)</math>, for any arbitrary constant <math> T > 0</math>, this problem is equivalent to  max <math>e^{-h(x)/T}</math><br />
+
similarly <math>q(X|Y)=f(X|Y^*_i, i\neq j)=\frac{f(X)}{f(Y_i, i\neq j)}</math><br>
Note that the exponential function is monotonic. <br />
+
notice that <math>(Y_i, i\neq j)</math> and <math>(X_i, i\neq j)</math> are identically distributed. <br>
Consider f proportional  to  e<sup>-h(x)/T</sup>, sample of this distribution when T is small and
+
 
close to the optimal point of h(x). Based on this observation, SA algorithm is introduced as :<br />
+
the distribution we wish to simulate from is <math>p(X) = f(X) </math>
<b>1.</b> Set T to be a large number<br />
+
also, <math>p(Y) = f(Y) </math>
<b>2.</b> Initialize the chain: set <math>\,X_{t}  (ie.  i=0, x_0=s)</math><br />
+
 
<b>3.</b> <math>\,y</math>~<math>\,q(y|x)</math><br/>
+
Hence, the acceptance ratio in the Metropolis-Hastings algorithm is: <br>
(q should be symmetric)<br />
+
<math>r(x,y) = min\left\{\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)},1\right\} = min\left\{\frac{f(x)}{f(y)}\frac{f(y)}{f(x)},1\right\} = 1 </math><br>
<b>4.</b> <math>r = \min\{\frac{f(y)}{f(x)},1\}</math><br />
+
so the new point will always be accepted, and no points are rejected and the Gibbs Sampler is an efficient algorithm in that aspect. <br>
<b>5.</b> U ~ U(0,1)<br />
+
</div>
<b>6.</b> If U < r, <math>X_{t+1}=Y</math> <br/>
 
else, <math>X_{t+1}=X_t</math><br/>
 
<b>7.</b> end  decrease T, and let i=i+1. Go back to 3. (This is where the difference lies between SA and MH. <br />
 
(repeat the procedure until T is very small)<br/>
 
<br/>
 
<b>Note</b>: q(y|x) does not have to be symmetric. If q is non-symmetric, then the original MH formula is used.<br />
 
  
The reason we start with a large T and not a small T at the beginning:<br />
+
<b>Advantages </b><ref>
 +
http://wikicoursenote.com/wiki/Stat341#Gibbs_Sampling_-_June_30.2C_2009
 +
</ref>
  
<ul><li>A point in the tail when T is large would be rejected <br />
+
*The algorithm has an acceptance rate of 1. Thus, it is efficient because we keep all the points that we sample from.
</li><li>Chance we reject pointes gets larger and larger as we move from large to small T <br />
+
*It is simple and straightforward if and only if we know the conditional pdf. 
</li><li>Large T helps get to mode of maximum value<br />
+
*It is useful for high-dimensional distributions. (ie. for sampling multivariate PDF)
</li></ul>
+
*It is useful if sampling from conditional PDF are easier than sampling from the joint.
  
Assume T is large <br />
+
<br />
1. h(y) < h(x), then  e<sup>(h(x)-h(y))/T </sup> > 1,  r = 1, y will always be accepted.<br />
+
<b>Disadvantages</b><ref>
2. h(y) > h(x), then e<sup>(h(x)-h(y))/T </sup>< 1, r < 1<br />
+
http://wikicoursenote.com/wiki/Stat341#Gibbs_Sampling_-_June_30.2C_2009
Assume distribution T is small<br />
+
</ref>
1. h(y) < h(x), then r = 1, we always accept r<br />
 
2. h(y) > h(x), then e<sup>(h(x)-h(y))/T </sup> approaches to 0.<br />
 
Then r goes to 0 and almost never accept.
 
  
<p><br /> All in all, choose a large T to start off with in order for a higher chance that the points can explore. <br />
+
*We rarely know how to sample from the conditional distributions.
 +
*The probability functions of the conditional probability are usually unknown or hard to sample from.
 +
*The algorithm can be extremely slow to converge.
 +
*It is often difficult to know when convergence has occurred.
 +
*The method is not practical when there are relatively small correlations between the random variables.
  
'''Note''': The variable T is known in practice as the "Temperature", thus the higher T is, the more variability there is in terms of the expansion and contraction of materials. The term "Annealing" follows from here, as annealing is the process of heating materials and allowing them to cool slowly.<br />
+
'''Gibbs Sampler Steps:'''<br\><ref>
 +
http://www.people.fas.harvard.edu/~plam/teaching/methods/mcmc/mcmc.pdf
 +
</ref>
 +
Let's suppose that we are interested in sampling from the posterior p(x|y), where x is a vector of three parameters, x1, x2, x3. <br\>
 +
The steps to a Gibbs Sampler are:<br\>
 +
1. Pick a vector of starting value x(0). Any x(0) will converge eventually, but it can be chosen to take fewer iterations<br\>
 +
2. Start with any x(order does not matter, but I will start with x1 for convenience). Draw a value x1(1)from the full conditional p(x1|x2(0),x3(0),y)<br\>
 +
3. Draw a value x2(1) from the full conditional p(x2|x1(1),x3(0),y). Note that we must use the updated value of x1(1).<br\>
 +
4. Draw a value x3(1) from the full conditional p(x3|x1(1),x2(1),y) using both updated values.<br\>
 +
5. Draw x2 using x1 and continually using the most updated values. <br\>
 +
6. Repeat until we get M draws, we each draw being a vector x(t).<br\>
 +
7. Optional burn-in or thinning.<br\>
 +
Our result is a Markov chain with a bunch of draws of x that are approximately from our posterior.
  
Asymptotically this algorithm is guaranteed to generate the global optimal answer, however in practice we never sample forever and this may not happen.
+
'''The Basic idea:''' <br>
 +
The distinguishing feature of Gibbs sampling is that the underlying Markov chain is constructed from a sequence of conditional distributions. The essential idea is updating one part of the previous element while keeping the other parts fixed - it is useful in many instances where the state variable is a random variable taking values in a general space, not just in R<sup>n</sup>. (Simulation and the Monte Carlo Method, Reuven Y. Rubinstein)
  
</p><p><br />
+
'''Note:''' <br>
</p><p>Example: Consider <math>h(x)=3x^2</math>, 0&lt;x&lt;1
+
1.Other optimizing algorithms introduced such as Simulated Annealing settles on a minimum eventually,which means that if we generate enough observations and plot them in a time series plot, the plot will eventually flatten at the optimal value.<br\>
</p><p><br />1) Set T to be large, for example, T=100<br />
+
2.For Gibbs Sampling however, when convergence is achieved, instead of staying at the optimal value, the Gibbs Sampler continues to wonder through the target distribution (i.e. will not stay at the optimal point) forever.<br\>
<br />2) Initialize the chain<br />
+
'''Special Example'''<br\>
<br />3) Set <math>q(y|x)~\sim~Unif[0,1]</math><br />
+
<pre>
<br />4) <math>r=min(exp(\frac{(3x^2-3y^2)}{100}),1)</math><br />
+
function gibbs2(n, thin)
<br />5) <math>U~\sim~U[0,1]</math><br />
+
  x_samp = zeros(n,1)
<br />6) If <i>U</i> &lt; <i>r</i> then <i>X</i><sub><i>t</i> + 1</sub> = <i>y</i> <br>
+
  y_samp = zeros(n,1)
<i>e</i><i>l</i><i>s</i><i>e</i>,<i>X</i><sub><i>t</i> + 1</sub> = <i>x</i><sub><i>t</i></sub><br />
+
  x=0.0
<br />7) Decrease T, go back to 3<br />
+
  y=0.0
</p>
+
  for i=1:n
<div style="border:1px red solid">
+
      for j=1:thin
<p><b>MATLAB </b>
+
        x=(y^2+4)*randg(3)
</p>
+
        y=1/(1+x)+randn()/sqrt(2*x+2)
<pre style="font-size:12px">
+
      end
Syms x
+
      x_samp[i] = x
Ezplot('(x-3)^2',[-6,12])
+
      y_samp[i] = y
Ezplot('exp(-((x-3)^2))', [-6, 12])
+
  end
 +
  return x_samp, y_samp
 +
end
 +
1
 +
2
 +
julia> @elapsed gibbs2(50000,1000)
 +
7.6084020137786865
 
</pre>
 
</pre>
  
[[File:Snip2013.png|350px]]
+
'''Theoretical Example''' <br/>
 +
 
 +
Gibbs Sampler Application (Inspired by Example 10b in the Ross Simulation (4th Edition Textbook))
 +
 
 +
Suppose we are a truck driver who randomly puts n basketballs into a 3D storage cube sized so that each edge of the cube is 300cm in length. The basket balls are spherical and have a radius of 25cm each.
 +
 
 +
Because the basketballs have a radius of 25cm, the centre of each basketball must be at least 50cm away from the centre of another basketball. That is to say, if two basketballs are touching (as close together as possible) their centres will be 50cm apart.
  
[[File:Snip20131.png|350px]]
+
Clearly the distribution of n basketballs will need to conditioned on the fact that no basketball is placed so that its centre is closer than 50cm to another basketball.
  
<b>MATLAB </b>
+
This gives:
  
<pre style="font-size:12px">
+
Beta = P{the centre of no two basketballs are within 50cm of each other}
  
clear all
+
That is to say, the placement of basketballs is conditioned on the fact that two balls cannot overlap.
close all
 
T=100;
 
x(1)=randn;
 
ii=1;
 
b=1;
 
while T&gt;0.001
 
  y=b*randn+x(ii);
 
  r=min(exp((H(x(ii))-H(y))/T),1);
 
  u=rand;
 
  if u&lt;r
 
      x(ii+1)=y;
 
  else
 
      x(ii+1)=x(ii);
 
  end
 
  
T=0.99*T;
+
This distribution of n balls can be modelled using the Gibbs sampler.
ii=ii+1;
 
end
 
plot(x)
 
  
</pre>
+
1. Start with n basketballs positioned in the cube so that no two centres are within 50cm of each other<br />
[[File:SA_example.jpg|350px]]
+
2. Generate a random number U and let I = floor(n*U) + 1<br />
</div>
+
3. Generate another random point <math>X_k</math> in the storage box.<br />
<p>when T is large, it is helpful for generating the function.
+
4. If <math>X_k</math> is not within 50cm of any other point, excluding point <math>X_I</math>: <br />
</p><p>an example is for H(x)=(x-3)^2
+
then replace <math>X_I</math> by this new point. <br />
</p>
+
Otherwise: return to step 3.<br />
<pre style="font-size:12px">
+
 
function c=H(x)
+
After many iterations, the set of n points will approximate the distribution.
c=(x-3)^2;
 
end
 
</pre>
 
<p><b>Another Example:</b>
 
<span class="texhtml"><i>h</i>(<i>x</i>) = ((<i>x</i> &minus; 2)<sup>2</sup> &minus; 4)((<i>x</i> &minus; 4)<sup>2</sup> &minus; 8)</span>
 
</p>
 
<pre style="font-size:12px">
 
&gt;&gt;syms x
 
&gt;&gt;ezplot(((x-2)^2-4)*((x-4)^2-8),[-1,8])
 
</pre>
 
<pre style="font-size:12px">
 
function c=H(x)
 
c=((x-2)^2-4)*((x-4)^2-8);
 
end
 
</pre>
 
[[File:SA_example2.jpg|350px]]
 
<p>Run earlier code with the new H(x) function
 
</p>
 
<h3> <span class="mw-headline" id="Motivation:_Simulated_Annealing_and_the_Travelling_Salesman_Problem"> Motivation: Simulated Annealing and the Travelling Salesman Problem </span></h3>
 
<p>The Travelling Salesman Problem asks:  <br />
 
Given n numbers of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the original city?
 
<p>[[File:Salesman n5.png|350px]]
 
</p>
 
<ul><li>An example of a solution of a travelling salesman problem on n=5. This is only one of many solutions, but we want to ensure we find the optimal solution.
 
</li></ul>
 
  
<ul><li>Given n=5 cities, we search for the best route with the minimum distance to visit all cities and return to the starting city.
 
</li></ul>
 
<p><b>The idea of using Simulated Annealing algorithm</b>&nbsp;:
 
Let Y (let Y be all possible combinations of route in terms of cities index) be generated by permutation of all cities. Let the target or objective distribution (f(x)) be the distance of the route given Y.
 
Then use the Simulated Annealing algorithm to find the minimum value of f(x).<br />
 
</p><p><b>Note</b>: in this case, Q is the permutation of the numbers. There will be may possible paths, especially when n is large. If n is very large, then it will take forever to check all the combination of routes.
 
</p>
 
<ul><li>This sort of knowledge would be very useful for those in a situation where they are on a limited budget or must visit many points in a short period of time. For example, a truck driver may have to visit multiple cities in southern Ontario and make it back to his original starting point within a 6-hour period. <br />
 
</li></ul>
 
  
'''Disadvantages of Simulated Annealing:'''<br/>
+
'''Example1''' <br/>
1. This method converges very slowly, and therefore very expensive.<br/>
+
We want to sample from a target joint distribution f(x<sub>1</sub>, x<sub>2</sub>), which is not easy to sample from but the conditional pdfs f(x<sub>1</sub>|x<sub>2</sub>) & f(x<sub>2</sub>|x<sub>1</sub>) are very easy to sample from. We can find the stationary distribution (target distribution) using Gibbs sampling: <br/>
2. This algorithm cannot tell whether it has found the global minimum.<br/><ref>
+
1. x<sub>1</sub>* ~ f(x<sub>1</sub>|x<sub>2</sub>) (here x<sub>2</sub> is given) => x = [x<sub>1</sub>* x<sub>2</sub>] <br/>
Reference: http://cs.adelaide.edu.au/~paulc/teaching/montecarlo/node140.html
+
2. x<sub>2</sub>* ~ f(x<sub>2</sub>|x<sub>1</sub>*) (here x<sub>1</sub>* is generated from above) => x = [x<sub>1</sub>* x<sub>2</sub>*] <br/>
</ref>
+
3. x<sub>1</sub>* ~ f(x<sub>1</sub>*|x<sub>2</sub>*) (here x<sub>2</sub>* is generated from above)  => x = [x<sub>1</sub>* x<sub>2</sub>* ] <br/>
 +
4. x<sub>2</sub>* ~ f(x<sub>2</sub>*|x<sub>1</sub>*) <br/>
 +
5. Repeat steps 3 and 4 until the chain reaches its stationary distribution [x<sub>1</sub>* x<sub>2</sub>*]. <br/>
  
== Class 21 - Tuesday July 16, 2013 ==
 
=== Gibbs Sampling===
 
'''Definition'''<br>
 
Gibbs sampling is a general method for probabilistic inference which is often used when dealing with incomplete information.However, generality comes at some computational cost, and for many applications including those involving missing information, there are often alternative methods that have been shown to be more efficient in practice. Suppose we want to sample from a joint distribution <math>p(x_1,...,x_k)</math> (i.e. a posterior distribution). If we knew the full condition distributions for each parameter (i.e. <math>p(x_i|x_1,x_2,...,x_{i-1},x_{i+1},...,x_k)</math>), we can use the Gibbs sampler to sample from the joint distribution. <br>
 
  
*another Markov Chain Monte Carlo (MCMC) method (first MCMC method introduced in this course is the MH Algorithm) <br/>
+
Suppose we want to sample from multivariate pdf f(x), where <math>\vec{x} = (x_1, x_2,...,x_d)</math> is a d-dimentional vector.<br/>
*a special case of Metropolis-Hastings sampling where the random value is always accepted, i.e. as long as a point is proposed, it is accepted. <br/>
+
Suppose <math>\vec{x} _t = (x_t,_1, x_t,_2,...,x_t,_d)</math> is the current value. <br/>  
* useful and make it simple and easier for sampling a d-dimensional random vector <math>\vec{x} = (x_1, x_2,...,x_d)</math><br />
 
* then the observations of d-dimensional random vectors <math>{\vec{x_1}, \vec{x_2}, ... , \vec{x_n}}</math> form a d-dimensional Markov Chain and the joint density <math>f(x_1, x_2, ... , x_d)</math> is an invariant distribution for the chain. <br/>
 
i.e. for sampling multivariate distributions.<br />
 
* useful if sampling from conditional pdf, since they are easier to sample, in comparison to the joint distribution.<br/>
 
  
'''Difference between Gibbs Sampling & MH'''<br>
+
Suppose <math>\vec{y} = (y_1, y_2,...,y_d)</math> is the proposed point. <br/>
Gibbs Sampling generates new value based on the conditional distribution of other components.<br/>
+
<math>\vec{x} _{t+1} = \vec{y} </math><br /><br/>
eg. We are given the following about <math> f(x_1,x_2) , f(x_1|x_2),f(x_2|x_1) </math><br/>
 
1. let <math>x^*_1 \sim f(x_1|x_2)</math><br/>
 
2. <math>x^*_2 \sim f(x^*_1|x_2)</math><br/>
 
3. substitute <math>x^*_2</math> back into first step and repeat the process. <br/>
 
  
 +
Let  <math>\displaystyle f(x_i|x_1, x_2,...,x_{i-1},....x_d)</math> represents the conditional pdf of component x<sub>i</sub>, given other components. <br/>
 +
Then Gibbs sampler is as follows:<br/>
 +
# <math>\displaystyle y_1 \sim  f(x_1 | x_{t,2}, x_{t,3}, ..., x_{t,d})</math>
 +
# <math>\displaystyle y_i \sim  f(x_i | y_1, ...., y_{i-1}, x_{t,i+1} , ..., x_{t,d})</math>
 +
# <math>\displaystyle y_d \sim f(x_d | y_1, ... , y_{d-1})</math>
 +
# <math>\displaystyle \vec{Y} = (y_1,y_2, ...,y_d)</math><br>
  
<b>Advantages </b><ref>
 
http://wikicoursenote.com/wiki/Stat341#Gibbs_Sampling_-_June_30.2C_2009
 
</ref>
 
  
*The algorithm has an acceptance rate of 1. Thus, it is efficient because we keep all the points that we sample from.
+
'''A simpler illustration of the above example'''
*It is useful for high-dimensional distributions. (ie. for sampling multivariate PDF)
+
Consider four variables (w,x,y,z), the sampler becomes<br/>
*It is useful if sampling from conditional PDF are easier than sampling from the joint.  
+
# <math>\displaystyle w_i \sim  p(w | x = x_{i - 1}, y = y_{i - 1},z = z_{i - 1} )</math>
 +
# <math>\displaystyle x_i \sim  p(x | w = w_i, y = y_{i - 1},z = z_{i - 1} )</math>
 +
# <math>\displaystyle y_i \sim  p(y | w = w_i, x = x_i,z = z_{i - 1} )</math>
 +
# <math>\displaystyle z_i \sim  p(z | w = w_i, x = x_i,y = y_i)</math>
 +
The reference is here<br/>
 +
http://web.mit.edu/~wingated/www/introductions/mcmc-gibbs-intro.pdf
  
<br />
+
'''Example2'''<br>
<b>Disadvantages</b><ref>
+
Suppose we want to sample from a bivariate normal distribution. <br/> <math>\mu =
http://wikicoursenote.com/wiki/Stat341#Gibbs_Sampling_-_June_30.2C_2009
+
\left [ \begin{matrix}
</ref>
+
1 \\
 +
2 \end{matrix} \right] </math>
  
*We rarely know how to sample from the conditional distributions.
+
<math>\Sigma=
*The algorithm can be extremely slow to converge.
+
\left [ \begin{matrix}
*It is often difficult to know when convergence has occurred.
+
1 & \rho \\
*The method is not practical when there are relatively small correlations between the random variables.
+
\rho & 1 \end{matrix} \right] </math>  (the covariance matrix)
  
'''Note:''' <br>
+
where <math>\rho</math>= 0.9. Then it can be shown that all conditionals are normal of this form: <br/>
1.Other optimizing algorithms introduced such as Simulated Annealing settles on a minimum eventually,which means that if we generate enough observations and plot them in a time series plot, the plot will eventually flatten at the optimal value.
+
f(x<sub>1</sub>|x<sub>2</sub>) = N (u<sub>1</sub> + r(x<sub>2</sub>-u<sub>2</sub>), 1-r<math>^2</math>) <br/>
2.For Gibbs Sampling however, when convergence is achieved, instead of staying at the optimal value, the Gibbs Sampler continues to wonder through the target distribution (i.e. will not stay at the optimal point) forever.<br>   
+
f(x<sub>2</sub>|x<sub>1</sub>) = N (u<sub>2</sub> + r(x<sub>1</sub>-u<sub>1</sub>), 1-r<math>^2</math>) <br/><br/>
'''Special Example'''<br>
+
   
<pre>
+
'''Matlab Code'''
function gibbs2(n, thin)
+
<pre style="font-size:16px">
  x_samp = zeros(n,1)
+
close all
  y_samp = zeros(n,1)
+
clear all
  x=0.0
+
mu = [1;2];
  y=0.0
+
x(:,1) = [1;1]; % covariance matrix
  for i=1:n
+
r = 0.9; % covariance matrix
      for j=1:thin
+
for ii = 1:1000
        x=(y^2+4)*randg(3)
+
x(1, ii+1) = sqrt(1-r^2)*randn + (mu(1) + r*(x(2,ii) - mu(2))); % N (u1 + r(x2-u2), 1-r2)
        y=1/(1+x)+randn()/sqrt(2*x+2)
+
x(2, ii+1) = sqrt(1-r^2)*randn + (mu(2) + r*(x(1,ii+1) - mu(1))); % N (u2 + r(x1-u1), 1-r2)
      end
 
      x_samp[i] = x
 
      y_samp[i] = y
 
  end
 
  return x_samp, y_samp
 
 
end
 
end
1
+
plot(x(1,:),x(2,:),'.')
2
+
</pre><br>
julia> @elapsed gibbs2(50000,1000)
 
7.6084020137786865
 
</pre>
 
'''Example1''' <br/>
 
We want to sample from a target joint distribution f(x<sub>1</sub>, x<sub>2</sub>), which is not easy to sample from but the conditional pdfs f(x<sub>1</sub>|x<sub>2</sub>) & f(x<sub>2</sub>|x<sub>1</sub>) are very easy to sample from. We can find the stationary distribution (target distribution) using Gibbs sampling: <br/>
 
1. x<sub>1</sub>* ~ f(x<sub>1</sub>|x<sub>2</sub>) (here x<sub>2</sub> is given) => x = [x<sub>1</sub>* x<sub>2</sub>] <br/>
 
2. x<sub>2</sub>* ~ f(x<sub>2</sub>|x<sub>1</sub>*) (here x<sub>1</sub>* is generated from above) => x = [x<sub>1</sub>* x<sub>2</sub>*] <br/>
 
3. x<sub>1</sub>* ~ f(x<sub>1</sub>*|x<sub>2</sub>*) (here x<sub>2</sub>* is generated from above)  => x = [x<sub>1</sub>* x<sub>2</sub>* ] <br/>
 
4. x<sub>2</sub>* ~ f(x<sub>2</sub>*|x<sub>1</sub>*) <br/>
 
5. Repeat steps 3 and 4 until the chain reaches its stationary distribution [x<sub>1</sub>* x<sub>2</sub>*]. <br/>
 
  
 +
'''Example3''' <br>
 +
Consider the flowing bivariate normal distribution. <br/>
 +
<math>\mu = \left[\begin{matrix}0\\0 \end{matrix}\right] \qquad \Sigma=\left [ \begin{matrix}1 & \rho \\ \rho & 1 \end{matrix} \right] </math>  (the covariance matrix)
  
Suppose we want to sample from multivariate pdf f(x), where <math>\vec{x} (x_1, x_2,...,x_d)</math> is a d-dimentional vector.<br/>
+
where <math>\rho</math>= 0.5. Then it can be shown that all conditionals are normal of this form: <br/>
Suppose <math>\vec{x} _t = (x_t,_1, x_t,_2,...,x_t,_d)</math> is the current value. <br/>  
+
<math> x_{1,t+1}|x_{2,t} \sim N(\rho x_{2,t},1-\rho^2</math>) <br/>
 +
<math> x_{2,t+1}|x_{1,t} \sim N(\rho x_{1,t},1-\rho^2</math>) <br/><br/>
  
Suppose <math>\vec{y} = (y_1, y_2,...,y_d)</math> is the proposed point. <br/>
 
<math>\vec{x} _{t+1} = \vec{y} </math><br /><br/>
 
  
Let  <math>\displaystyle f(x_i|x_1, x_2,...,x_{i-1},....x_d)</math> represents the conditional pdf of component x<sub>i</sub>, given other components. <br/>
+
'''Matlab Code:'''
Then Gibbs sampler is as follows:<br/>
+
<pre style="font-size:16px">
# <math>\displaystyle y_1 \sim  f(x_1 | x_{t,2}, x_{t,3}, ..., x_{t,d})</math>
+
close all
# <math>\displaystyle y_i \sim  f(x_i | y_1, ...., y_{i-1}, x_{t,i+1} , ..., x_{t,d})</math>
+
clear all
# <math>\displaystyle y_d \sim f(x_d | y_1, ... , y_{d-1})</math>
+
mu = [0;0];
# <math>\displaystyle \vec{Y} = (y_1,y_2, ...,y_d)</math><br>
+
x(:,1) = [1;1];
 
+
r = 0.5;
 
+
for ii = 1:1000
'''Example2'''<br>
 
Suppose we want to sample from a bivariate normal distribution. <br/> <math>\mu =
 
\left [ \begin{matrix}
 
1 \\
 
2 \end{matrix} \right] </math>
 
 
 
<math>\Sigma=
 
\left [ \begin{matrix}
 
1 & \rho \\
 
\rho & 1 \end{matrix} \right] </math>  (the covariance matrix)
 
 
 
where <math>\rho</math>= 0.9. Then it can be shown that all conditionals are normal of this form: <br/>
 
f(x<sub>1</sub>|x<sub>2</sub>) = N (u<sub>1</sub> + r(x<sub>2</sub>-u<sub>2</sub>), 1-r<math>^2</math>) <br/>
 
f(x<sub>2</sub>|x<sub>1</sub>) = N (u<sub>2</sub> + r(x<sub>1</sub>-u<sub>1</sub>), 1-r<math>^2</math>) <br/><br/>
 
 
'''Matlab Code'''
 
<pre style="font-size:16px">
 
close all
 
clear all
 
mu = [1;2];
 
x(:,1) = [1;1]; % covariance matrix
 
r = 0.9; % covariance matrix
 
for ii = 1:1000
 
x(1, ii+1) = sqrt(1-r^2)*randn + (mu(1) + r*(x(2,ii) - mu(2))); % N (u1 + r(x2-u2), 1-r2)
 
x(2, ii+1) = sqrt(1-r^2)*randn + (mu(2) + r*(x(1,ii+1) - mu(1))); % N (u2 + r(x1-u1), 1-r2)
 
end
 
plot(x(1,:),x(2,:),'.')
 
</pre><br>
 
 
 
'''Example3''' <br>
 
Consider the flowing bivariate normal distribution. <br/>
 
<math>\mu = \left[\begin{matrix}0\\0 \end{matrix}\right] \qquad \Sigma=\left [ \begin{matrix}1 & \rho \\ \rho & 1 \end{matrix} \right] </math>  (the covariance matrix)
 
 
 
where <math>\rho</math>= 0.5. Then it can be shown that all conditionals are normal of this form: <br/>
 
<math> x_{1,t+1}|x_{2,t} \sim N(\rho x_{2,t},1-\rho^2</math>) <br/>
 
<math> x_{2,t+1}|x_{1,t} \sim N(\rho x_{1,t},1-\rho^2</math>) <br/><br/>
 
 
 
 
 
'''Matlab Code:'''
 
<pre style="font-size:16px">
 
close all
 
clear all
 
mu = [0;0];
 
x(:,1) = [1;1];
 
r = 0.5;
 
for ii = 1:1000
 
 
x(1, ii+1) = sqrt(1-r^2)*randn + r*(x(2,ii));
 
x(1, ii+1) = sqrt(1-r^2)*randn + r*(x(2,ii));
 
x(2, ii+1) = sqrt(1-r^2)*randn + r*(x(1,ii+1));
 
x(2, ii+1) = sqrt(1-r^2)*randn + r*(x(1,ii+1));
Line 6,633: Line 6,811:
 
ezsurf(exp(-x given  ~ 2 ...)/i) this gives a n dimensional plot  
 
ezsurf(exp(-x given  ~ 2 ...)/i) this gives a n dimensional plot  
 
</pre>
 
</pre>
 +
 +
ezsurf(fun) creates a graph of fun(x,y) using the surf function. fun is plotted over the default domain: -2π < x < 2π, -2π < y < 2π.
 +
http://www.mathworks.com/help/matlab/ref/ezsurf.html<br>
  
 
'''Example:''' ezsurf((x+y)^2+(x-y)^3)<br/>
 
'''Example:''' ezsurf((x+y)^2+(x-y)^3)<br/>
Line 6,641: Line 6,822:
 
'''Example:''' Face recognition <br />
 
'''Example:''' Face recognition <br />
  
X is a greyscale image of the perso and Y is the person.<br />
+
X is a greyscale image of the person and Y is the person.<br />
 
Here,We have a 100 x 100 grid where each cell is a number from 0 to 255 representing the darkness of the cell (from white to black).<br>
 
Here,We have a 100 x 100 grid where each cell is a number from 0 to 255 representing the darkness of the cell (from white to black).<br>
 
Let x be a vector of length 100*100=10,000 and y be a vector with each element being a picture of a person's face.<br>
 
Let x be a vector of length 100*100=10,000 and y be a vector with each element being a picture of a person's face.<br>
Line 6,647: Line 6,828:
  
 
<br />[[Frequentist approach]]<br />
 
<br />[[Frequentist approach]]<br />
*A frequentist would say X is a random variable and Y is not, so they would use Pr{x|y} (given that y is Tom, how likely is it that x is an image of Tom?).
+
*A frequentist would say X is a random variable and Y is not, so they would use Pr{x|y} (given that y is Tom, how likely is it that x is an image of Tom?).
  
 
<math>\displaystyle P(X|Y)</math>, y is person and x is how likely the picture is of this person. Here, y is known. <br/>  
 
<math>\displaystyle P(X|Y)</math>, y is person and x is how likely the picture is of this person. Here, y is known. <br/>  
Line 6,657: Line 6,838:
  
 
<math>P(Y|X) = \frac {P(x|y)P(y)}{\int P(x|y)P(y)dy}</math> Here, everything is a random variable.<br>  
 
<math>P(Y|X) = \frac {P(x|y)P(y)}{\int P(x|y)P(y)dy}</math> Here, everything is a random variable.<br>  
 +
Proof:<br/>
 +
<math>P(y|x)P(x) = P(x,y)= P(x|y)P(y)  P(x) = \int P(x|y)P(y)dy</math>
  
 
*Bayesian: Probability is subjective, which states someone's belief. <br>
 
*Bayesian: Probability is subjective, which states someone's belief. <br>
Line 6,692: Line 6,875:
 
'''Definition'''<br/>
 
'''Definition'''<br/>
 
*Variance reduction is a procedure used to increase the precision of the estimates that can be obtained for a given number of iterations. Every output random variable from the simulation is associated with a variance which limits the precision of the simulation results. <br/>
 
*Variance reduction is a procedure used to increase the precision of the estimates that can be obtained for a given number of iterations. Every output random variable from the simulation is associated with a variance which limits the precision of the simulation results. <br/>
*In order to make a simulation statistically efficient,(i.e. to obtain a greater precision and smaller confidence intervals for the output random variable of interest),variance reduction techniques can be used. The main ones are: Common random numbers, antithetic variates, control variates, importance sampling and stratified sampling), We will only be learning one of the methods - importance sampling. http://en.wikipedia.org/wiki/Variance_reduction
+
*In order to make a simulation statistically efficient,(i.e. to obtain a greater precision and smaller confidence intervals for the output random variable of interest),variance reduction techniques can be used. The main ones are: Common random numbers, antithetic variates, control variates, importance sampling and stratified sampling), We will only be learning one of the methods - importance sampling. Importance sampling is used to generate more statistically significant points rather than generating those points that do not have any value, such as generating in the middle of the bell curve rather than at the tail end of the bell curve. http://en.wikipedia.org/wiki/Variance_reduction
 
*<br />It can be seen that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. <br /><ref>
 
*<br />It can be seen that the integral <math>\displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx</math> is just <math> \displaystyle E_g(h(x)) \rightarrow</math>the expectation of h(x) with respect to g(x), where <math>\displaystyle \frac{f(x)}{g(x)} </math> is a weight <math>\displaystyle\beta(x)</math>. In the case where <math>\displaystyle f > g</math>, a greater weight for <math>\displaystyle\beta(x)</math> will be assigned. <br /><ref>
 
http://wikicoursenote.com/wiki/Stat341#Importance_Sampling_2
 
http://wikicoursenote.com/wiki/Stat341#Importance_Sampling_2
 
</ref>
 
</ref>
 +
*Variance reduction uses the fact that the variance of a finite integral is zero. <br/>
  
We wish to use simulation for this algorithm. We can utilize Monte Carlo Integration framework from previous classes.
+
We would like to use simulation for this algorithm. We can use Monte Carlo Integration framework from previous classes.
 
<math>E_f [h(x)] = \int h(x)f(x) dx</math>. The motivation is that a lot of integrals need to be calculated. <br/><br/>
 
<math>E_f [h(x)] = \int h(x)f(x) dx</math>. The motivation is that a lot of integrals need to be calculated. <br/><br/>
 
+
'''Some addition knowledge:''' <br/>
 +
Common Random Numbers: The common random numbers variance reduction technique is a popular and useful variance reduction technique which applies when we are comparing at least two alternative configurations (of a system) instead of investigating a single configuration.  <br/> <br/>
 
'''Case 1 Basic Monte Carlo Integration''' <br/>
 
'''Case 1 Basic Monte Carlo Integration''' <br/>
 
'''Idea:'''Evaluating an integral means calculating the area under the desired curve f(x).<br/>
 
'''Idea:'''Evaluating an integral means calculating the area under the desired curve f(x).<br/>
The Monte Carlo Integration method evaluates the area under the curve by computing the area randomly many times and then take average of the results. <ref>
+
The Monte Carlo Integration method evaluates the area under the curve by computing the area randomly many times and then take average of the results. <ref>
 
http://www.cs.dartmouth.edu/~fabio/teaching/graphics08/lectures/15_MonteCarloIntegration_Web.pdf
 
http://www.cs.dartmouth.edu/~fabio/teaching/graphics08/lectures/15_MonteCarloIntegration_Web.pdf
 
</ref>
 
</ref>
Line 6,716: Line 6,901:
 
This is referred to as '''Monte Carlo Integration.''' http://web.mit.edu/~wingated/www/introductions/mcmc-gibbs-intro.pdf
 
This is referred to as '''Monte Carlo Integration.''' http://web.mit.edu/~wingated/www/introductions/mcmc-gibbs-intro.pdf
  
Suppose we have integral of this form<br />
+
Suppose we have an integral of this form<br />
 
<math>I = \int_a ^b h(x)dx =\int_a ^b h(x) (\frac {b-a}{b-a} )dx =\int_a ^b h(x)(b-a) (\frac {1}{b-a} )dx  =\int_a ^b w(x) f(x)dx </math><br />
 
<math>I = \int_a ^b h(x)dx =\int_a ^b h(x) (\frac {b-a}{b-a} )dx =\int_a ^b h(x)(b-a) (\frac {1}{b-a} )dx  =\int_a ^b w(x) f(x)dx </math><br />
  
where <math> w(x) = h(x)(b-a)</math> , and <math>f(x) =(\frac {1}{b-a})</math><br/>
+
where <math>w(x) = h(x)(b-a)</math> , and <math>f(x) =(\frac {1}{b-a})</math><br/>
  
 
'''Note:''' <math>f(x)</math> is the pdf of a uniform distribution <math>~\sim U(a, b)</math>.
 
'''Note:''' <math>f(x)</math> is the pdf of a uniform distribution <math>~\sim U(a, b)</math>.
Line 6,774: Line 6,959:
 
'''Example'''<br />
 
'''Example'''<br />
 
Consider <math>I = \int_0 ^1 x^2+2x dx</math><br />
 
Consider <math>I = \int_0 ^1 x^2+2x dx</math><br />
 
+
<math>\hat{I} = \frac{1}{n} \sum_{i = 1}^{n} w({x_{i})} = \frac{1}{n} \sum_{i = 1}^{n} (x_{i})^2+2(x_{i})  </math>  where <math> x_i ~\sim UNIF[0,1]</math><br/>
 
It evaluates to 4/3, now to simulate this, here is the code:
 
It evaluates to 4/3, now to simulate this, here is the code:
  
Line 6,794: Line 6,979:
 
Consider <math>I = \int_0 ^1 e^x dx</math><br />
 
Consider <math>I = \int_0 ^1 e^x dx</math><br />
  
The exact answer is e - 1.  
+
The exact answer is (e^1 - e^0) = 2.718281828 - 1 = 1.718281828
 
Comparing to the simulation, the matlab code is as follows:
 
Comparing to the simulation, the matlab code is as follows:
  
Line 6,808: Line 6,993:
 
     1.7178
 
     1.7178
 
</pre>
 
</pre>
The answer 1.7178 is really closed enough to the exact answer e - 1 = 1.71828182846.
+
The answer 1.7178 is really close enough to the exact answer e - 1 = 1.71828182846. The accuracy will increase if n is larger, for example n=100000000.
 
<br/>
 
<br/>
  
  
 +
'''Multiple Variables Example'''<br />
 +
Consider <math>I = \iint e^(x+y) dx</math><br />
 +
 +
The exact answer is (e - 1)^2.
 +
The matlab code is similar to the above example, with an additional variable:
  
'''Case 2'''<br />
+
Matlab Code:<br/>
We can generalize this idea. Suppose we wish to compute  
+
<pre style="font-size:16px">
 +
>> n = 100000;
 +
>> x = rand(1, n);
 +
>> y = rand(1, n);
 +
>> w = exp(x+y);
 +
>> sum (w)/n
 +
 
 +
ans =
 +
 
 +
    2.9438
 +
</pre>
 +
Note that this is close to the exact answer (e - 1)^2 = 2.95249.
 +
<br/>
 +
 
 +
 
 +
'''Case 2'''<br />
 +
We can generalize this idea. Suppose we wish to compute  
 
<math>I = \int h(x)f(x)dx </math><br />
 
<math>I = \int h(x)f(x)dx </math><br />
 
If f(x) is uniform, this will be same as case 1 for general f<br />
 
If f(x) is uniform, this will be same as case 1 for general f<br />
Line 6,833: Line 7,039:
 
x = -log(u) %% Inverse transform method for generating exp(1);
 
x = -log(u) %% Inverse transform method for generating exp(1);
 
sum(sqrt(x))/n;
 
sum(sqrt(x))/n;
 +
</pre>
 +
<br/>
 +
 +
'''Tips:''' <br />
 +
It is important to know when Case 2 is appropriate to be used when evaluating a integral using simulation. Normally case 2 can be distinguished from case 1 if the bounds of the integral are improper i.e either the lower, upper or both the bounds approach infinity. <br />
 +
Now, when it is identified that Case 2 should be used, understand that f(x) must be a pdf. That is integral of f(x) should equal 1, when evaluating along the bounds of the integral. If this is not true we cannot use the summation formula and need to modify the integral to make sure we have a pdf inside the integral. <br />
 +
 +
'''Example'''. Use simulation to approximate the following integral <math> \int_{-2}^{2} e^{x+x^2}dx </math>. The exact value for this integral is around 93.163.<br/>
 +
'''Solution'''<br/>
 +
<math> I = 4E[e^{x+x^2}] = 4 \int_{-2}^{2} \frac{1}{4}e^{x+x^2}dx</math> where <math>x~\sim U[-2,2]</math> <br/>
 +
<math>\widehat{I} = \frac{1}{n} \sum_{i = 1}^{n} e^{x_i+x_i^2}</math> where <math>x_i~\sim U[-2,2]</math>
 +
 +
Matlab Code:<br/>
 +
<pre style="font-size:16px">
 +
close all
 +
clear all
 +
n=10000;
 +
u=rand(1,n);
 +
%xi~U[-2,2]
 +
x=4*u-2;
 +
s=exp(x+x.^2);
 +
4*sum(s)/n
 +
>>93.2680
 
</pre>
 
</pre>
 
<br/>
 
<br/>
Line 6,860: Line 7,089:
 
'''Note:'''If we want to compute cdf when x has a small value, for example -3, the probability that h(x) equals 1 is small, <br/>
 
'''Note:'''If we want to compute cdf when x has a small value, for example -3, the probability that h(x) equals 1 is small, <br/>
 
so the variance can be large. As x gets smaller, we can increase the sample size to make our simulation more accurate.<br/>
 
so the variance can be large. As x gets smaller, we can increase the sample size to make our simulation more accurate.<br/>
<div>
 
<pre>
 
  
 
special example from:http://www.math.wsu.edu/faculty/genz/416/lect/l08-6.pdf
 
special example from:http://www.math.wsu.edu/faculty/genz/416/lect/l08-6.pdf
  
</pre>
 
 
https://fbcdn-sphotos-e-a.akamaihd.net/hphotos-ak-ash4/q71/s720x720/999682_413300238783048_1464937675_n.jpg
 
https://fbcdn-sphotos-e-a.akamaihd.net/hphotos-ak-ash4/q71/s720x720/999682_413300238783048_1464937675_n.jpg
<pre>
 
  
 +
'''What is the variance of estimation?'''<br\>
 +
<math>\begin{align}
 +
& Var(x)= E(x^2)-[E(x)]^2 \\
 +
& =E(w^2)-[E(w)]^2 \\
 +
\end{align}</math><br\>
 +
Suppose that f(x) is the function that we want to estimate and <math>\widehat{f(x)} = \frac{1}{n} \sum_{i = 1}^{n} w(x_i)</math><br\>
 +
The range for f(x) is from 0 to <math>\infty</math> (e.g. if we take <math>x_i</math>=-N to N where i from 1 to 2N)<br\>
 +
The variance of our estimate would be:<br\>
 +
<math>\begin{align}
 +
& Var(f)= E(w^2)-[E(w)]^2 \\
 +
& = \sum_{i = 1}^{2N} x_i^2*\widehat{f(x_i)} - (\sum_{i = 1}^{2N} x_i*\widehat{f(x_i)})^2 \\
 +
\end{align}</math><br\>
  
 +
== Class 23, Tuesday July 23 ==
 +
===Importance Sampling===
 +
Start with
 +
<math>I = \int^{b}_{a} f(x)\,dx </math><br/> = <math>\int f(x)*(b-a) * \frac{1}{(b-a)}\,dx </math><br/>
 +
<math>\widehat{I} = \frac{1}{n} \sum_{i = 1}^{n} w({x_{i})}</math> where <math>w_{i} ~\sim Unif(a,b)</math>
  
 +
Recall the definition of crude Monte Carlo Integration: <br/>
 +
<math>E[h(X)]=\int f(x)h(x)\,dx</math><br/>
 +
If <math>x~\sim U(0,1)</math> and hence <math>\,f(x)=1</math>, then we have the basis of other variance reduction techniques. Now we consider what happens if X is not uniformly distributed.
  
</pre>
+
In the control variate case, we change the formula b adding and subtracting a known function h(x): basically, by adding zero to the integral, keeping it unbiased and allowing us to have an easier time of solving it. In importance sampling, we will instead multiply by 1. The known function in this case will be g(x), which is selected under a few assumptions.
</div>
 
  
== Class 23, Tuesday July 23 ==
+
There are cases where another distribution gives a better fit to integral to approximate, and results in a more accurate estimate;  importance sampling is useful here.
===Importance Sampling===
 
 
Motivation:<br/>
 
Motivation:<br/>
 
- Consider <math>I = \int h(x)f(x)\,dx </math><br/>
 
- Consider <math>I = \int h(x)f(x)\,dx </math><br/>
 
- There are cases in which we do not know how to sample from f(x) because the distribution of f(x) is complex; or it's very difficult to sample from f.<br/>
 
- There are cases in which we do not know how to sample from f(x) because the distribution of f(x) is complex; or it's very difficult to sample from f.<br/>
 
- There are cases in which h(x) is a rare event with respect to f.<br/>
 
- There are cases in which h(x) is a rare event with respect to f.<br/>
- Importance sampling is useful to overcome these cases.<br/>
+
Importance sampling is useful to overcome these cases.<br/>
 
- rare event is the event when you sample from its distribution, you rarely get an satisfied sample.<br/>
 
- rare event is the event when you sample from its distribution, you rarely get an satisfied sample.<br/>
 
<br/>
 
<br/>
 +
*Importance sampling can solve the cases listed above. It makes use of some functions that are easier to sample from. <br/>
 
*Importance sampling is a variance reduction technique that can be used in the Monte Carlo method. Although it is not exactly like a Markov Chain Monte Carlo (MCMC) algorithm, it also approximately samples a vector where the mass function is specified up to some constant.<br/>
 
*Importance sampling is a variance reduction technique that can be used in the Monte Carlo method. Although it is not exactly like a Markov Chain Monte Carlo (MCMC) algorithm, it also approximately samples a vector where the mass function is specified up to some constant.<br/>
 
*The idea behind importance sampling is that, certain values of the input random variables in a simulation have more impact on the parameter being estimated than the others. If these "important" values are emphasized by being sampled more frequently, then the estimator variance can be reduced.<br/>  
 
*The idea behind importance sampling is that, certain values of the input random variables in a simulation have more impact on the parameter being estimated than the others. If these "important" values are emphasized by being sampled more frequently, then the estimator variance can be reduced.<br/>  
*Hence, the basic methodology in importance sampling is to choose a distribution which "encourages" the important values. (http://en.wikipedia.org/wiki/Importance_sampling)<br/>
+
*Hence, the basic methodology in importance sampling is to choose a distribution which "encourages" the important values. This use of "biased" distributions will result in a biased estimator if it is applied directly in the simulation. (http://en.wikipedia.org/wiki/Importance_sampling)<br/>
Example:<br/>
+
*However, the simulation outputs are weighted to correct for the use of the biased distribution, and this ensures that the new importance sampling estimator is unbiased.  (http://en.wikipedia.org/wiki/Importance_sampling)
 +
 
 +
'''Example''':<br/>
 +
 
 
* Bit Error Rate on a channel.<br/>
 
* Bit Error Rate on a channel.<br/>
The bit error rate (BER) is the number of bit errors over the total number of bits during a specific time. BER has no unit associate to it. BER is often wrote in percentage. <br/>
+
The bit error rate (BER) is the number of bit errors over the total number of bits during a specific time. BER has no unit associated to it. BER is often written as a percentage. <br/>
 
* Failure Probability of a reliable system.<br/>
 
* Failure Probability of a reliable system.<br/>
 +
* A well chosen distribution can result in saving huge amount of running-time for importance sampling algorithm. 
  
 
Recall <math>I = \int h(x)f(x)\,dx </math>, where the preceding is an n-dimensional integral over all possible values of x.<br/>
 
Recall <math>I = \int h(x)f(x)\,dx </math>, where the preceding is an n-dimensional integral over all possible values of x.<br/>
  
We have <math>I = \int h(x)f(x)* g(x) / g(x)\, dx = \int w(x)g(x)\,dx</math>, where '''w(x)=h(x)*f(x)/g(x)''', and we know this integral since <math>g(x)</math> is a known distribution (we can assume <math> g(x)=b-a</math>) and <math>I</math> is the expectation of <math> w(x) </math> with respect to <math> g(x) </math>, or <math>\hat{I} = \sum_{i=1}^{n} \frac{w(x)}{n} </math>; <math>x ~\sim g(x)</math> <br/>
+
We have <math>I = \int \frac {h(x)f(x)}{g(x)} g(x)\, dx = \int w(x)g(x)\,dx</math>, where <math>w(x)= \frac{h(x)f(x)}{g(x)}</math>, and we know this integral since <math>g(x)</math> is a known distribution (we can assume <math> g(x)=b-a</math>) and <math>I</math> is the expectation of <math> w(x) </math> with respect to <math> g(x) </math>, or <math>\hat{I} = \sum_{i=1}^{n} \frac{w(x)}{n} </math>; <math>x ~\sim g(x)</math> <br/>
  
This is the importance sampling estimator of <math>I</math>, and is unbiased. That is, the estimation procedure is to generate i.i.d. samples from <math>g(x)</math>, and for each sample which exceeds the upper bound of the integral,, the estimate is incremented by the weight W, evaluated at the sample value. The results are averaged over N trials. <br/>
+
As n approaches infinity, <math>\hat{I}</math> approaches <math>{I}</math>
 +
 
 +
'''Note''': <br/>
 +
 
 +
Even though the uniform distribution sampling method only works for a definite integral, you can use still uniform distribution sampling method for I in the case of indefinite integral - this can be done by manipulating the function to adjust the integral range, such that the integral becomes definite.
 +
 
 +
w(x) is called the Importance Function. <br/>
 +
*A good importance function will be large when the integrand is large and small otherwise.<br/>
 +
 
 +
This is the importance sampling estimator of <math>I</math>, and is unbiased. That is, the estimation procedure is to generate i.i.d. samples from <math>g(x)</math>, and for each sample which exceeds the upper bound of the integral, the estimate is incremented by the weight W, evaluated at the sample value. The results are averaged over N trials. <br/>
 
http://en.wikipedia.org/wiki/Importance_sampling <br/>
 
http://en.wikipedia.org/wiki/Importance_sampling <br/>
  
 +
Choosing a good fit biased distribution is the key of importance sampling.<br/>
 
Note that <math> g(x) </math> is selected under the following assumptions:<br/>
 
Note that <math> g(x) </math> is selected under the following assumptions:<br/>
 
1. <math> g(x) </math> (or at least a constant times <math> g(x) </math>) is a pdf.<br/>
 
1. <math> g(x) </math> (or at least a constant times <math> g(x) </math>) is a pdf.<br/>
 
2. We have a way to generate from <math> g(x) </math> (known function that we know how to generate using software). <br/>
 
2. We have a way to generate from <math> g(x) </math> (known function that we know how to generate using software). <br/>
3. <math> h(x)f(x) / g(x)</math> ~ constant => hence small variability <br/>
+
3. <math> \frac{h(x)f(x)}{g(x)}</math> ~ constant => hence small variability <br/>
 
4. g(x) should not be 0 at the same time as f(x) "too often" (From Stat340w13 and Course Note Material) <br/>
 
4. g(x) should not be 0 at the same time as f(x) "too often" (From Stat340w13 and Course Note Material) <br/>
 +
5. g(x) is another density function whose support is the same as that of f(x) <br/>
 +
6. g(x) should have thicker tails compare to f to ensure f(x)/g(x) is reasonably small. <br/>
 +
7. g(x) should have a similar shape to f(x) in general. <br/>
 +
  
'''Example1:'''<br/>
 
<math>I=\int^{x}_{-\infty} f(x)\,dx</math>,<br/>
 
  
where f(x)~N(0,1) and x = -1<br/>
+
'''Example 1:'''<br/>
 +
<math>I=\int^{-1}_{-\infty} f(x)\,dx</math>, where <math>\displaystyle f(x) \sim N(0,1)</math><br/>
 +
Define
 
<math>
 
<math>
 
h(x) = \begin{cases}
 
h(x) = \begin{cases}
  
1, & \text{if } x <= -1 \\
+
1, & \text{if } x \leq -1 \\
 
0, & \text{if } x > -1
 
0, & \text{if } x > -1
 
\end{cases}</math><br/>
 
\end{cases}</math><br/>
Line 6,920: Line 7,182:
 
Therefore,<math>\widehat{I} = \frac{1}{n} \sum_{i = 1}^{n} h({x_{i})}</math> where <math>x_{i} ~\sim N(0,1)</math>  
 
Therefore,<math>\widehat{I} = \frac{1}{n} \sum_{i = 1}^{n} h({x_{i})}</math> where <math>x_{i} ~\sim N(0,1)</math>  
 
which gives <math>\widehat{I}= \frac{\text{number of observations }<= -1}{n}</math><br/>
 
which gives <math>\widehat{I}= \frac{\text{number of observations }<= -1}{n}</math><br/>
<br> '''Note''': I is acting like an indicator variable which follows bernoulli distribution with p = P(h<=-1)<br/>
+
<br>  
 +
 
 +
'''Note''': <br/>
 +
h(x) is acting as an indicator variable which follows a Bernoulli distribution with p = P(x<=-1).<br/>
 +
h(x) is used to count the points greater than -1.
 +
 
  
  
 +
 +
'''Consider <math>I= \int h(x)f(x)\,dx </math> again.'''  <br/>
 +
 +
Importance sampling is used to overcome the following two cases: <br/>
 +
*cases we don't know how to sample from f(x), because f(x) is a complicated distribution. <br/>
 +
*cases in which h(x) corresponds to a rare event over f (e.g. less than -3 in a standard normal distribution). <br/>
 +
---- In the second case, using the basic method without importance sampling will result in high variability in the simulated results (which goes against the purpose of variance reduction) <br/>
  
  
Consider <math>I= \int h(x)f(x)\,dx </math> again. Importance sampling is used to overcome the following two cases:
 
<br />*cases we don't know how to sample from f(x), because f(x) is a complicated distribution
 
<br />*cases in which h(x) corresponds to a rare event over f (e.g. less than -3 in a standard normal distribution)
 
<br />*In the second case, using the basic method without importance sampling will result in high variability in the simulated results (which goes against the purpose of variance reduction)
 
  
 
<math>\begin{align}
 
<math>\begin{align}
 
I &= \int h(x)f(x)dx \\
 
I &= \int h(x)f(x)dx \\
  
&= \int h(x)f(x) \frac{g(x)}{g(x)} dx, \text{ where g(x) is a pdf easy to sample from.}  \\
+
&= \int h(x)f(x) \frac{g(x)}{g(x)} dx, \text{ where g(x) is a pdf easy to sample from and f(x) is not.}  \\
  
 
&= \int \frac{h(x)f(x)}{g(x)} g(x) dx  \\
 
&= \int \frac{h(x)f(x)}{g(x)} g(x) dx  \\
Line 6,943: Line 7,213:
 
So <math>\hat{I} = \frac{1}{n} \sum_{i=1}^{n}w(x_i), x_i </math> from <math>g(x)</math><br />
 
So <math>\hat{I} = \frac{1}{n} \sum_{i=1}^{n}w(x_i), x_i </math> from <math>g(x)</math><br />
  
One can see <math>\frac{f(x)h(x)}{g(x)}</math> as weights. We can see it as we sample from <math>g(x)</math>, then re-weight our samples based on their importance.
+
One can think of <math>\frac{f(x)h(x)}{g(x)}</math> as weights. We sample from <math>g(x)</math>, and then re-weight our samples based on their importance.
  
 
Note that <math>\hat{I}</math> is an unbiased estimator for <math>I</math> as <math>\ E_x(\hat{I}) = E_x(\frac{1}{n} \sum_{i = 1}^{n} w(X_i)) = \frac{1}{n} \sum_{i = 1}^{n} E_x(\frac{h(X_i)f(X_i)}{g(X_i)}) = \frac{1}{n} \sum_{i = 1}^{n} \int \frac{h(x)f(x)}{g(x)}g(x)dx = \frac{1}{n} \sum_{i = 1}^{n} I = I</math>
 
Note that <math>\hat{I}</math> is an unbiased estimator for <math>I</math> as <math>\ E_x(\hat{I}) = E_x(\frac{1}{n} \sum_{i = 1}^{n} w(X_i)) = \frac{1}{n} \sum_{i = 1}^{n} E_x(\frac{h(X_i)f(X_i)}{g(X_i)}) = \frac{1}{n} \sum_{i = 1}^{n} \int \frac{h(x)f(x)}{g(x)}g(x)dx = \frac{1}{n} \sum_{i = 1}^{n} I = I</math>
  
[[Problem:]]The variance of <math> \widehat{I}</math>  could be very large with bad choice of g. <br/>
+
''''''[[Problem:]]''''''The variance of <math> \widehat{I}</math>  could be very large with bad choice of g. <br/>
'''Advice 1''':Choose g such that g has thicker tails compare to f.
+
 
 +
'''Advice 1''': <br/>
 +
Choose g such that g has thicker tails compare to f. <br/>
 
In general, if over a set A, g is small but f is large, then f(x)/g(x) could be large. ie: the variance could be large. (the values for which h(x) is exceedingly small) <br/>
 
In general, if over a set A, g is small but f is large, then f(x)/g(x) could be large. ie: the variance could be large. (the values for which h(x) is exceedingly small) <br/>
  
'''Advice 2''':Choose g to have similar shape with f.
+
'''Advice 2'''
In general, it is better to choose g , such that: it is similar to f in terms of shape, but has thicker tails.
+
Choose g to have similar shape with f. <br/>
 +
In general, it is better to choose g such that it is similar to f in terms of shape, but has thicker tails. <br/>
  
 
<br><br>
 
<br><br>
  
<b>Procedure</b><br>
+
<b>'''Procedure'''</b><br>
 +
 
 
1. Sample <math> x_{1}, x_{2}, ..., x_{n} ~\sim g(x) </math> <br /><br />
 
1. Sample <math> x_{1}, x_{2}, ..., x_{n} ~\sim g(x) </math> <br /><br />
 
2. <math>\widehat{I} = \frac{1}{n} \sum_{i = 1}^{n} w({x_{i})}</math> where <math> w(x_i) = \frac{h(x_i)f(x_i)}{g(x_i)} </math> for <math>i=1\dots n</math><br />
 
2. <math>\widehat{I} = \frac{1}{n} \sum_{i = 1}^{n} w({x_{i})}</math> where <math> w(x_i) = \frac{h(x_i)f(x_i)}{g(x_i)} </math> for <math>i=1\dots n</math><br />
  
  
'''Example2'''
+
 
 +
 
 +
'''Example 2'''
  
 
<math>I=\int^{-3}_{-\infty} f(x)\,dx =\int^{\infty}_{-\infty} h\left( x\right) f\left( x\right) dx </math><br/>
 
<math>I=\int^{-3}_{-\infty} f(x)\,dx =\int^{\infty}_{-\infty} h\left( x\right) f\left( x\right) dx </math><br/>
Line 6,976: Line 7,252:
 
which gives <math>\widehat{I}= \frac{\text{number of observations }< -3}{n}</math><br/>
 
which gives <math>\widehat{I}= \frac{\text{number of observations }< -3}{n}</math><br/>
  
'''Comments on Example 2''':<br/>
 
*Since the number of observations less than -3 is a relatively rare event, this method will give us a relatively high variance. <br/>
 
*To illustrate this, suppose we sample 100 points each time for many times, we will be getting mostly 0's and some 1's and occasionally 2's. This data has large variances.<br/>
 
'''Note''' : h(x) is counting the number of observations that are less than -3.
 
  
'''Remarks''':
+
 
 +
'''Matlab Code:''' <br/>
 +
 
 +
<pre style="font-size:16px">
 +
n = 200;
 +
x = randn(1,n);
 +
I= sum(x>3)./n;
 +
 
 +
>> mean(I)
 +
>> var(I) % to calculate the variance of the estimates
 +
</pre>
 +
<br>
 +
'''Comments on Example 2''':<br/>
 +
 
 +
*Since observations less than -3 are a relatively rare event, this method will give us a relatively high variance. <br/>
 +
*To illustrate this, suppose we sample 100 points each time for many times, we will be getting mostly 0's and some 1's and occasionally 2's. This data has large variances.<br/>
 +
 
 +
'''Note''' : h(x) is counting the number of observations that are less than -3.
 +
 
 +
 
 +
 
 +
'''Remarks''':
  
 
*We can actually compute the form of <math>\displaystyle g(x)</math> to have optimal variance. <br>Mathematically, it is to find <math>\displaystyle g(x)</math> subject to <math>\displaystyle \min_g [\ E_g([y(x)]^2) - (E_g[y(x)])^2\ ]</math><br>
 
*We can actually compute the form of <math>\displaystyle g(x)</math> to have optimal variance. <br>Mathematically, it is to find <math>\displaystyle g(x)</math> subject to <math>\displaystyle \min_g [\ E_g([y(x)]^2) - (E_g[y(x)])^2\ ]</math><br>
 +
 
It can be shown that the optimal <math>\displaystyle g(x)</math> is <math>\displaystyle {|h(x)|f(x)}</math>. Using the optimal <math>\displaystyle g(x)</math> will minimize the variance of estimation in Importance Sampling. This is of theoretical interest but not useful in practice. As we can see, if we can actually show the expression of g(x), we must first have the value of the integration---which is what we want in the first place. <br/>
 
It can be shown that the optimal <math>\displaystyle g(x)</math> is <math>\displaystyle {|h(x)|f(x)}</math>. Using the optimal <math>\displaystyle g(x)</math> will minimize the variance of estimation in Importance Sampling. This is of theoretical interest but not useful in practice. As we can see, if we can actually show the expression of g(x), we must first have the value of the integration---which is what we want in the first place. <br/>
  
*In practice, we shall choose <math>\displaystyle g(x)</math> which has similar shape as <math>\displaystyle f(x)</math> but with a thicker tail than <math>\displaystyle f(x)</math> in order to avoid the problem mentioned above.<br>
+
In practice, we shall choose <math>\displaystyle g(x)</math> which has similar shape as <math>\displaystyle f(x)</math> but with a thicker tail than <math>\displaystyle f(x)</math> in order to avoid the problem mentioned above.<br>
  
*The case when <math> g(x) </math> is important it should have the same support. If <math> g(x) </math> does not have the same support then it may not be able to sample from <math> f </math> like before. Also, if <math> g(x) </math> is not a good choice then it increases the variance very badly. <br/>
+
The case when <math> g(x) </math> is important it should have the same support. If <math> g(x) </math> does not have the same support then it may not be able to sample from <math> f </math> like before. Also, if <math> g(x) </math> is not a good choice then it increases the variance very badly. <br/>
 +
 
 +
 
 +
'''Note:'''
 +
Normalized imporatance sampling is biased, but it is asymptotically unbiased.<br/>
  
Normalize important sampling <br>
 
 
<math>I=\int h(x)f(x)dx</math> <br>
 
<math>I=\int h(x)f(x)dx</math> <br>
 
<math>I=\int \frac{h(x)f(x)}{g(x)} g(x) dx </math><br>
 
<math>I=\int \frac{h(x)f(x)}{g(x)} g(x) dx </math><br>
Line 7,003: Line 7,300:
 
[[File:IMP ex part 1.png|600px]]  
 
[[File:IMP ex part 1.png|600px]]  
 
[[File:IMP ex part 2.png|600px]] <br \>
 
[[File:IMP ex part 2.png|600px]] <br \>
Source: STAT 340 Spring 2010 Course Notes
+
Source: STAT 340 Spring 2010 Course Notes <br>
 +
 
 +
 
 +
 
 +
'''Example:'''<br>
 +
Suppose <math>I=\int^{\infty}_{0} \frac{1}{(1+x)^2} dx</math> <br>
 +
Since the range is from 0 to <math>\infty</math> here, we can use <math> g(x) = e^{-x} </math> ; x>0<br>
 +
So <math>I=\int^{\infty}_{0} w(x)g(x) dx</math> where <math>w(x) = \frac{f(x)}{g(x)} = \frac{e^x}{(1+x)^2}</math><br>
 +
 
 +
'''Algorithm:'''<br>
 +
1) Generate n number of U<sub>i</sub>~U(0,1)<br>
 +
2) Set <math>X_i=-log(1-U_i) </math> for i=1,...,n<br>
 +
3) Set <math>W(X_i)= \frac{e^{X_i}}{(1+X_i)^2}</math><br>
 +
4) <math>\widehat{I} = \frac{1}{n} \sum_{i = 1}^{n} W({X_i)}</math><br>
 +
Actual value of the integral is 1 <br>
 +
 
 +
'''Matlab Code''':<br>
 +
 
 +
<pre style="font-size:16px">
 +
>> clear all
 +
>> close all
 +
>> n=1000;
 +
>> u=rand(1,n);
 +
>> x=-log(u);  % Generates number from exponential distribution using inverse transformation method
 +
>> w=(1./(1+x).^2).*exp(x);
 +
>> sum(w)/n
 +
    ans = 0.8884
 +
 
 +
Similarly for n=1000000, we get 0.9376 which is even closer to 1.
 +
</pre>
 +
 
 +
'''Another Method'''<br />
 +
 
 +
By changing the variable so that the bounds is (0,1), we can apply the Unif(0,1) method: <br />
 +
 
 +
Let <math>y= \frac{1}{x+1},  dy= \frac{-1}{(x+1)^2}dx =-y^2dx</math><br />
 +
 
 +
We can express the integral as <br />
 +
<math>\int^{1}_{0} \frac {1}{y^2} y^2 dy =\int^{1}_{0} 1 dy </math><br />
 +
which we recognise that it is just a <math>Unif(0,1)</math> and the result follows. <br />
 +
<br />
 +
'''The following are general forms for the change of variable method for different cases'''
 +
:<math>
 +
\int_a^{+\infty}f(x) \, dx =\int_0^1 f\left(a + \frac{u}{1-u}\right) \frac{du}{(1-u)^2} </math>
  
===Problem of Importance Sampling===
+
:<math>
 +
\int_{-\infty}^a f(x) \, dx = \int_0^1 f\left(a - \frac{1-u}{u}\right) \frac{du}{u^2}</math>
 +
 
 +
:<math>
 +
\int_{-\infty}^{+\infty} f(x) \, dx = \int_{-1}^{+1} f\left( \frac{u}{1-u^2} \right) \frac{1+u^2}{(1-u^2)^2} \, du,
 +
</math>
 +
Source: Wikipedia Numerical Integration
 +
 
 +
<math>Insert formula here</math>===Problem of Importance Sampling===
 
The variance of <math>\hat{I}</math> '''could be very large''' (infinitely large) with a bad choice of <math>g</math> <br>
 
The variance of <math>\hat{I}</math> '''could be very large''' (infinitely large) with a bad choice of <math>g</math> <br>
  
<math>\displaystyle Var(x) = E(x^2) - (E(x))^2 </math> <br>
 
 
<math>\displaystyle Var(w) = E(w^2) - (E(w))^2 </math> <br>
 
<math>\displaystyle Var(w) = E(w^2) - (E(w))^2 </math> <br>
 
<math> \begin{align}
 
<math> \begin{align}
E(w^2) &= \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx \\
+
E(w^2) &= \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx , where w = (\frac{h(x)f(x)}{g(x)})\\
 
&= \int (\frac{h^2(x)f^2(x)}{g^2(x)}) g(x) dx \\
 
&= \int (\frac{h^2(x)f^2(x)}{g^2(x)}) g(x) dx \\
 
&= \int (\frac{h^2(x)f^2(x)}{g(x)}) dx
 
&= \int (\frac{h^2(x)f^2(x)}{g(x)}) dx
Line 7,018: Line 7,365:
  
 
Consider the term <math>\frac{f(x)}{g(x)}</math>.<br>
 
Consider the term <math>\frac{f(x)}{g(x)}</math>.<br>
• If <math>g(x)</math> has thinner tails compare to <math>f(x)</math>, then <math>\frac{f(x)}{g(x)}</math> could be infinitely large.
+
• If <math>g(x)</math> has thinner tails compared to <math>f(x)</math>, then <math>\frac{f(x)}{g(x)}</math> could be infinitely large.
 
i.e. <math>E(w^2)</math> is infinitely large and so is variance.
 
i.e. <math>E(w^2)</math> is infinitely large and so is variance.
  
Line 7,027: Line 7,374:
  
  
Therefore we need to have criteria for choosing good <math>g</math>: <br><br>
+
A bad choice for g(x) can cause a problem and a good choice can reduce the variance. Therefore we need to have criteria for choosing good <math>g</math>: <br><br>
 
<b>Advice 1:</b> Choose <math>g</math> such that <math>g</math> has thicker tails compared to <math>f</math><br>
 
<b>Advice 1:</b> Choose <math>g</math> such that <math>g</math> has thicker tails compared to <math>f</math><br>
 
- Also, if over a set <math>A</math>, <math>g</math> is small but <math>f</math> is large, then <math>\frac {f(x)}{g(x)}</math> could be large. (i.e. the variance could be large.)
 
- Also, if over a set <math>A</math>, <math>g</math> is small but <math>f</math> is large, then <math>\frac {f(x)}{g(x)}</math> could be large. (i.e. the variance could be large.)
  
 
<b>Advice 2:</b> Choose <math>g</math> to have similar shape with <math>f</math><br>
 
<b>Advice 2:</b> Choose <math>g</math> to have similar shape with <math>f</math><br>
-In general, it is better to choose <math>g</math> such that it is similar to <math>f</math> in term of shape but has thicker tails.
+
-In general, it is better to choose <math>g</math> such that it is similar to <math>f</math> in terms of shape but with thicker tails.
  
 
<div style="border:5px solid pink;">
 
<div style="border:5px solid pink;">
Line 7,068: Line 7,415:
 
MATLAB Code
 
MATLAB Code
 
<pre style="font-size:14px">
 
<pre style="font-size:14px">
x=randn(1,100)
+
x=randn(1,100);
sum(x>3)/100
+
sum(x>3)/100;
  
 
clc
 
clc
 
clear all
 
clear all
 
close all
 
close all
n = 100
+
n = 100;
 
for ii = 1:200;
 
for ii = 1:200;
 
     x = randn(1,n);
 
     x = randn(1,n);
Line 7,092: Line 7,439:
  
 
'''Note''' : <br/>
 
'''Note''' : <br/>
*<math>N(4,1)</math> was chosen according to our advice mentioned earlier. <math>N(4,1)</math> has the same shape (actually the exact same shape) as <math>N(0,1)</math>.hence it will not increase the variance of our simulation.<br/>
+
*<math>N(4,1)</math> was chosen according to our advice mentioned earlier. <math>N(4,1)</math> has the same shape (actually the exact same shape) as <math>N(0,1)</math>. Hence it will not increase the variance of our simulation.<br/>
  
*The reason why we do not choose uniform distribution is because the uniform distribution only distributed over a finite interval from a to b, where the required <math>N(4,1)</math> distributed over all x.<br/>
+
*The reason why we do not choose uniform distribution for this case is because the uniform distribution only distributed over a finite interval from a to b, where the required <math>N(4,1)</math> distributed over all x.<br/>
  
*To be more precise, we can see that choosing a distribution centered at 3 or nearby points (i.e. 2 or 4) will help us generate more points which are greater than 3. Thus the variance between different samples will be reduced. The reason behind this can be explained as follows:
+
*To be more precise, we can see that choosing a distribution centered at 3 or nearby points (i.e. 2 or 4) will help us generate more points which are greater than 3. Thus the size of variance between different samples will be reduced. The reason behind this can be explained as follows: <br/>
 
Eg: If we take a sample of 1000 points and very few points are above three and we take the sample again, we will have a huge variance as the probability of samples greater than 3 is low. We may even get 0 as our simulated answer as shown in class which is not the case. Thus using this method helps us overcome the problem of sampling from rare events.  
 
Eg: If we take a sample of 1000 points and very few points are above three and we take the sample again, we will have a huge variance as the probability of samples greater than 3 is low. We may even get 0 as our simulated answer as shown in class which is not the case. Thus using this method helps us overcome the problem of sampling from rare events.  
  
Line 7,111: Line 7,458:
 
close all
 
close all
 
clc  %% clc clears all input and output from the Command Window display, giving you a "clean screen."
 
clc  %% clc clears all input and output from the Command Window display, giving you a "clean screen."
n=100
+
n=100;
 
for ii=1:200
 
for ii=1:200
 
   x=randn(1,n);
 
   x=randn(1,n);
Line 7,117: Line 7,464:
  
 
   x=randn(1,n)+4;
 
   x=randn(1,n)+4;
   ls(ii)=(sum(x>3).*exp(8-4*x))/n; %w(x)/n
+
   ls(ii)=sum((x>3).*exp(8-4*x))/n; %w(x)/n
 
end
 
end
 
 
var(lb)
 
var(lb)
 
var(ls)
 
var(ls)
var(Is)/var(Ib)
+
var(ls)/var(lb)
 +
hist(ls,50)
 
</pre>
 
</pre>
 +
 +
[[File:hist(ls).jpg|450px]]
  
 
'''Note:'''
 
'''Note:'''
Line 7,131: Line 7,480:
  
 
'''Example3''' <br>
 
'''Example3''' <br>
If <math> g(x)=x </math> for x belongs to <math>[0,1]</math>, the integral of this function is not 1.<br>
+
If <math> g(x)=x </math> for x belongs to <math>[0,1]</math>, the integral of this function is not 1.<br/>
So, we need to add a constant number of make it a valid pdf.<br>
+
So, we need to add a constant number to make it a valid pdf.<br/>
 
Therefore, we '''change it to <math> g(x)=2x </math> for 0<x<1 '''
 
Therefore, we '''change it to <math> g(x)=2x </math> for 0<x<1 '''
  
This code is for calculating the variance.
+
This code is for calculating the variance.<br/>
The first method produced a variance with a power of 10<sup>-5</sup>, while the second method produced a variance with a power of 10<sup>-8</sup>. Hence, a clear variance reduction is evident.  
+
The first method produced a variance with a power of 10<sup>-5</sup>, while the second method produced a variance with a power of 10<sup>-8</sup>. Hence, a clear variance reduction is evident. <br/>
  
 
'''Side Note:'''  <br/>  
 
'''Side Note:'''  <br/>  
*The most effective variance reduction technique is to increase the sample size. For instance, in the above example, by using Importance Sampling, we were able to reduce the variance by 3 degrees of power. <br/>  
+
*The most effective variance reduction technique is to '''increase the sample size'''. For instance, in the above example, by using Importance Sampling, we are able to reduce the variance by 3 degrees of power. <br/>  
*However if we used Method 1 but increase the sample size from 200 to 1,000,000 or more, we are able to decrease the variance by 4 or more degrees of power. <br/>     
+
*However if we used Method 1 while increasing the sample size from 200 to 1,000,000 or more, we are able to decrease the variance by 4 or more degrees of power. <br/>     
*Also, note that since there is a high variance, then it is problematic. So by choosing a different distribution that is not centered around 0, but a distribution that is centered at 4 for example would result in less variation. So for example, choose <math>\displaystyle g(x)\sim N(4,1)</math> <br/>
+
*Also, note that since there is a large variance, it is problematic. So by choosing a different distribution that is not centered around 0, a distribution that is centered at 4 for example would result in less variation. For example, choose <math>\displaystyle g(x)\sim N(4,1)</math> <br/>
  
 
'''Important Notes on selection of g(x):'''  
 
'''Important Notes on selection of g(x):'''  
  
- g(x) must have the same support as f(x) in order for accurate sampling<br/>
+
*g(x) must have the same support as f(x) in order for accurate sampling<br/>
- g(x) must be such that it encourages the occurrence of rare points (rare h(x))<br/>
+
*g(x) must be such that it encourages the occurrence of rare points (rare h(x))<br/>
- selection of g(x) greatly affects E[w<sup>2</sup>] therefore affects the variance. A poor choice of g(x) can cause a significant increase in the variance, thus defeating the purpose of Importance Sampling.<br />
+
*Selection of g(x) greatly affects E[w<sup>2</sup>] therefore affects the variance. A poor choice of g(x) can cause a significant increase in the variance, thus defeating the purpose of Importance Sampling.<br />
- Specifically, it is recommended that g(x) have the following properties<ref>
+
*Specifically, it is recommended that g(x) have the following properties<ref>
 
http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf
 
http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf
 
</ref>
 
</ref>
Line 7,159: Line 7,508:
  
  
*There are many methods of variance reduction, however, the best way is to increase n.The larger n is, the closer your value is to the exact value.
+
*Although there are many methods of variance reduction, the best way is to '''increase n'''. The larger n is, the closer your value is to the exact value.
*Using various computer softwares is the most effective method of reducing variance.
+
*Using various computer software is the most effective method of reducing a variance.
 
</div>
 
</div>
  
 
==Class 24, Thursday, July 25, 2013==
 
==Class 24, Thursday, July 25, 2013==
 
===Importance Sampling===
 
===Importance Sampling===
definition of important sampling from Wikipedia:<br>
+
Importance Sampling is the most fundamental variance reduction technique and usually leads to a dramatic variance reduction. <br />
In statistics, importance sampling is a general technique for estimating properties of a particular distribution, while only having samples generated from a different distribution rather than the distribution of interest. It is related to umbrella sampling in computational physics. Depending on the application, the term may refer to the process of sampling from this alternative distribution, the process of inference, or both.
+
Importance sampling involves choosing a sampling distribution that favour important samples*.(Simulation and the Monte Carlo Method, Reuven Y. Rubinstein) <br />
 +
 
 +
* Here "favour important samples" implies encouraging the occurrence of the desired event or part of the desired event. For instance, if the event of interest is rare (probability of occurrence is close to zero), we "favour important samples" by choosing a sampling distribution such that the event has higher probability of occurrence.
 +
 
 +
Definition of importance sampling from Wikipedia:<br>
 +
Importance sampling is a general technique for estimating properties of a particular distribution, while samples are generated from a different distribution other than the distribution of interest. It is related to umbrella sampling in computational physics. Depending on the application, the term may refer to the process of sampling from this alternative distribution, the process of inference, or both.
 
<br>
 
<br>
Recall that using important sampling,we have the following:<br/>
+
Recall that using importance sampling,we have the following:<br/>
  
 
<math>I=\int_{a}^{b}f(x)dx = \int_{a}^{b}f(x)(b-a) \times \frac{1}{b-a}dx</math> <br />
 
<math>I=\int_{a}^{b}f(x)dx = \int_{a}^{b}f(x)(b-a) \times \frac{1}{b-a}dx</math> <br />
  
  
If g(x) is another probability density like f(x)=0 where g(x)=0 ,<br />
+
If g(x) is another probability density function, <br />
 +
note: in summary, a good importance sampling function g(x) should satisfies:<br />
 +
 
 +
1. g(x) > 0 whenever f(x)not equal to 0<br />
 +
2. g(x) should be equal or close to the absolute value of f(x)<br />
 +
3. easy to simulate values from g(x)<br />
 +
4. easy to compute the density of g(x)<br />
 +
 
 +
original source is here<br /> http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf<br />
  
 
then we have: <br />
 
then we have: <br />
Line 7,181: Line 7,543:
 
In order to estimate I we have:<br/>
 
In order to estimate I we have:<br/>
  
<math>\widehat{I}=\frac{1}{n}\sum_{i=1}^{n}w(x)</math> and <math>g^{*}(x) = \frac{|h(x)|f(x)}{\int |h(x)|f(x)dx}</math>
+
<math>\widehat{I}=\frac{1}{n}\sum_{i=1}^{n}w(x)</math> and <math>g^{*}(x) = \frac{|h(x)|f(x)}{\int |h(x)|f(x)dx}</math>, where <math> h(x)>=0 </math> for all x <br>
  
'''Note:''' g(x) should be chosen carefully. It should be easy to sample from, and since this method is for minimizing the variance, g(x) should be chosen in a manner such that the variance is minimized. g*(x) is the distribution that minimizes the variance.
+
Higher values of n correspond to values of <math>\widehat{I}</math> closer to <math>{I}</math>, which approaches <math>\widehat{I}</math> as n approaches infinity.
 +
 
 +
'''Note:''' g(x) should be chosen carefully. It should be easy to sample from. Also, since this method is for minimizing the variance, g(x) should be chosen in a manner such that the variance is minimized. g*(x) is the distribution that minimizes the variance.
  
  
Line 7,192: Line 7,556:
  
 
:<math>\, Var(I) = Var(\frac{1}{n} \sum_{i = 1}^{n} w({x_{i})})= Var(w)/n </math> <br>
 
:<math>\, Var(I) = Var(\frac{1}{n} \sum_{i = 1}^{n} w({x_{i})})= Var(w)/n </math> <br>
 +
 +
Note:
 +
This expression has equivalent to the summation of all variances of W, because W’s are independent, hence covariance terms are zero.