Difference between revisions of "stat340s13"

From statwiki
Jump to: navigation, search
()
m (in class3, I added "because it requires a closed form expression for F(x)", It helped us to understand the limitation of the inverse mothord and why it is difficult to find F(x))
Line 1: Line 1:
== Introduction, Class 1 - Tuesday, May 7 2103 ==
+
== Introduction, Class 1 - Tuesday, May 7 ==
  
  
Line 62: Line 62:
 
(A simple practice might be investigating the hypothesis that higher levels of education cause higher levels of income.) <br />
 
(A simple practice might be investigating the hypothesis that higher levels of education cause higher levels of income.) <br />
 
3. Clustering: Use common features of objects in same class or group to form clusters.(in this case, x is given, y is unknown) <br />
 
3. Clustering: Use common features of objects in same class or group to form clusters.(in this case, x is given, y is unknown) <br />
4. Dimensionality Reduction (aka Feature extraction, Manifold learning): Used when we have a variable in high dimension space and we want to reduce the dimension. <br />
+
4. Dimensionality Reduction (aka Feature extraction, Manifold learning): Used when we have a variable in high dimension space and we want to reduce the dimension <br />
  
 
=== Applications ===
 
=== Applications ===
Line 166: Line 166:
 
<br>A computer cannot generate truly random numbers because computers can only run algorithms, which are deterministic in nature. They can, however, generate Pseudo Random Numbers<br>
 
<br>A computer cannot generate truly random numbers because computers can only run algorithms, which are deterministic in nature. They can, however, generate Pseudo Random Numbers<br>
  
'''Pseudo Random Numbers''' are the numbers that seem random but are actually deterministic. Although the pseudo random numbers are deterministic, these numbers have a sequence of value and all of them have the appearance of being independent uniform random variables. Being deterministic, pseudo random numbers are valuable and beneficial due to the ease of generation and manipulation.  
+
'''Pseudo Random Numbers''' are the numbers that seem random but are actually deterministic. Although the pseudo random numbers are deterministic, these numbers have a sequence of value and all of them have the appearances of being independent uniform random variables. Being deterministic, pseudo random numbers are valuable and beneficial due to the ease to generate and manipulate.
  
 
When people do the test many times, the results will be the closed express values, which make the trial look deterministic, however for each trial, the result is random.
 
When people do the test many times, the results will be the closed express values, which make the trial look deterministic, however for each trial, the result is random.
Line 335: Line 335:
 
''The number 31 will never appear. When you perform the operation <math>\mod m</math>, the largest possible answer that you could receive is <math>m-1</math>. Whether or not a particular number in the range from 0 to <math>m - 1</math> appears in the above algorithm will be dependent on the values chosen for <math>a, b</math> and <math>m</math>. ''
 
''The number 31 will never appear. When you perform the operation <math>\mod m</math>, the largest possible answer that you could receive is <math>m-1</math>. Whether or not a particular number in the range from 0 to <math>m - 1</math> appears in the above algorithm will be dependent on the values chosen for <math>a, b</math> and <math>m</math>. ''
 
<hr/>
 
<hr/>
 
'''Example'''<br/>
 
If <math>x_0=7</math> and <math>x_n=(9x_{n-1}+5)\mod 476</math>, find <math>x_1,\cdots,x_{5}</math>.<br />
 
'''Solution:'''<br />
 
<math>\begin{align}
 
x_1 &{}= (9 \times 7+5) &{}\mod{200} &{}= 68 \\
 
x_2 &{}= 615 &{}\mod{200} &{}= 139 \\
 
x_3 &{}= 1256 &{}\mod{200} &{}= 304 \\
 
x_4 &{}= 2741 &{}\mod{200} &{}= 406 \\
 
x_5 &{}= 3659 &{}\mod{200} &{}= 327 \\
 
\end{align}</math><br/><br/>
 
  
 
'''Examples:[From Textbook]'''<br />
 
'''Examples:[From Textbook]'''<br />
Line 451: Line 440:
  
 
In short, what the theorem tells us is that we can use a random number <math> U from U(0,1) </math> to randomly sample a point on the CDF of X, then apply the inverse of the CDF to map the given probability to its domain, which gives us the random variable X.<br/>
 
In short, what the theorem tells us is that we can use a random number <math> U from U(0,1) </math> to randomly sample a point on the CDF of X, then apply the inverse of the CDF to map the given probability to its domain, which gives us the random variable X.<br/>
 
 
'''Proof of F(x) is Uniformly Distributed:'''
 
 
P(F(x)<u)=P(F<sup>-1</sup>(F(x))<F<sup>-1</sup>(u))=P(x<F<sup>-1</sup>(u))=F(F<sup>-1</sup>(u))=u
 
 
So F(x)~U(0,1).
 
  
  
Line 509: Line 491:
 
Step 2: Compute <math>X = F^-1(U)</math> i.e. <math>X = \theta  + \frac {1}{\lambda} ln(2U)</math> for U < 0.5 else <math>X = \theta -\frac {1}{\lambda} ln(2(1-U))</math>
 
Step 2: Compute <math>X = F^-1(U)</math> i.e. <math>X = \theta  + \frac {1}{\lambda} ln(2U)</math> for U < 0.5 else <math>X = \theta -\frac {1}{\lambda} ln(2(1-U))</math>
  
'''MatLab Code''':<br />
 
<pre style="font-size:16px">
 
>> u = rand;
 
>> theta  = 1;
 
>> lambda = 1;
 
>> if u < 0.5
 
      X = theta + (1/lambda) * log(2u);
 
  else
 
      X = theta - (1/lambda) * log(2(1-u));
 
  end
 
</pre>
 
  
 
'''Example 3 - <math>F(x) = x^5</math>''':<br/>
 
'''Example 3 - <math>F(x) = x^5</math>''':<br/>
Line 532: Line 503:
 
Step 1: Draw U ~ rand[0, 1];<br />
 
Step 1: Draw U ~ rand[0, 1];<br />
 
Step 2: X=U^(1/5);<br />
 
Step 2: X=U^(1/5);<br />
 
'''MatLab Code''':<br />
 
<pre style="font-size:16px">
 
>>x=rand^(1/5)
 
</pre>
 
 
  
 
'''Example 4 - BETA(1,β)''':<br/>
 
'''Example 4 - BETA(1,β)''':<br/>
Line 549: Line 514:
 
<math>x = 1-(1-u)^\frac {1}{\beta}</math><br />
 
<math>x = 1-(1-u)^\frac {1}{\beta}</math><br />
 
let β=3, use Matlab to construct N=1000 observations from Beta(1,3)<br />
 
let β=3, use Matlab to construct N=1000 observations from Beta(1,3)<br />
 
+
Matlab Code:<br />
'''MatLab Code''':<br />
+
>> u = rand(1,1000);<br />
<pre style="font-size:16px">
+
  x = 1-(1-u)^(1/3);<br />
>> u = rand(1,1000);
+
>> hist(x,50)<br />
>> x = 1-(1-u)^(1/3);
+
>> mean(x)<br />
>> hist(x,50)
 
>> mean(x)
 
</pre>
 
  
 
'''Example 5 - Estimating <math>\pi</math>''':<br/>
 
'''Example 5 - Estimating <math>\pi</math>''':<br/>
Line 593: Line 555:
 
=== Recall the Inverse Transform Method ===
 
=== Recall the Inverse Transform Method ===
  
Recall that Inverse Transform Method is used to generating a random variable, X, from its CDF F(x) using the uniform distribution. To sample X with CDF F(x), <br />
+
To sample X with CDF F(x), <br />
  
 
'''1. Draw u~U(0,1) '''<br />
 
'''1. Draw u~U(0,1) '''<br />
Line 611: Line 573:
 
This is the c.d.f. of X.  <br />
 
This is the c.d.f. of X.  <br />
 
<br />
 
<br />
 
This same technique can be used to sample from discrete distribution.<br />
 
  
 
'''Note''': that the CDF of a U(a,b) random variable is:
 
'''Note''': that the CDF of a U(a,b) random variable is:
Line 622: Line 582:
 
   \end{cases}
 
   \end{cases}
 
</math>  
 
</math>  
 
Further, the pdf <math>f(x) = \frac{1}{b-a}</math> and 0 otherwise.
 
  
 
Thus, for <math> U </math> ~ <math>U(0,1) </math>, we have <math>P(U\leq 1) = 1</math> and <math>P(U\leq 1/2) = 1/2</math>.<br />
 
Thus, for <math> U </math> ~ <math>U(0,1) </math>, we have <math>P(U\leq 1) = 1</math> and <math>P(U\leq 1/2) = 1/2</math>.<br />
Line 639: Line 597:
 
Note that on a single point there is no mass probability (i.e. <math>u</math> <= 0.5, is the same as <math> u </math> < 0.5)  
 
Note that on a single point there is no mass probability (i.e. <math>u</math> <= 0.5, is the same as <math> u </math> < 0.5)  
 
More formally, this is saying that <math> P(X = x) = F(x)- \lim_{s \to x^-}F(x)</math> which equals zero for any continuous random variable
 
More formally, this is saying that <math> P(X = x) = F(x)- \lim_{s \to x^-}F(x)</math> which equals zero for any continuous random variable
 
====Advantages of the Inverse Transform Method====
 
 
*  It is very easy to use and apply if we are able to find the inverse cdf <math> F^{-1}(\cdot)</math>.
 
*  It preserves monotonicity and correlation, which consequently helps in order statistics, variance reduction methods, and also generating truncated distributions.
 
  
 
====Limitations of the Inverse Transform Method====
 
====Limitations of the Inverse Transform Method====
  
Though this method is very easy to use and apply,  it does have some major disadvantages/limitations:
+
Though this method is very easy to use and apply,  it does have a major disadvantage/limitation:
  
*  Since a number of comparisons are required, the speed of this method is often very slow.
+
*  We need to find the inverse cdf <math> F^{-1}(\cdot) </math>. In some cases the inverse function does not exist, or is difficult to find because it requires a closed form expression for F(x).
*  We need to find the inverse cdf <math> F^{-1}(\cdot) </math>. In some cases the inverse function does not exist, or is difficult to find.
 
  
 
For example, it is too difficult to find the inverse cdf of the Gaussian distribution, so we must find another method to sample from the Gaussian distribution.
 
For example, it is too difficult to find the inverse cdf of the Gaussian distribution, so we must find another method to sample from the Gaussian distribution.
  
[Discrete Case]
+
=== Discrete Case ===
 
The same technique can be used for discrete case. We want to generate a discrete random variable x, that has probability mass function: <br/>
 
The same technique can be used for discrete case. We want to generate a discrete random variable x, that has probability mass function: <br/>
  
Line 697: Line 649:
 
else if U < 0.8 then output 2<br />
 
else if U < 0.8 then output 2<br />
 
else if U < 0.9 then output -2<br />
 
else if U < 0.9 then output -2<br />
else if U < 0.97 then output 0<br />
+
else if U < 0.97 then output 0 else output 1<br />
else output 1<br />
 
 
 
* '''Matlab Code'''<br />
 
<pre style="font-size:16px">
 
>> u = rand;              # pick up F(x);
 
>> if u < 0.5            # Pr(x = -1)=0.5;
 
      x = -1;           
 
  elseif u< 0.8          # Pr(x = 2) =0.8-0.5 =0.3
 
      x = 2;
 
  elseif u < 0.9        # Pr(x = -2)=0.9-0.8 =0.1
 
      x = -2;
 
  elseif u < 0.97        # Pr(x = 0) =0.97-0.9=0.07
 
      x = 0;
 
  else                  # Pr(x = 1) =1 - 0.97=0.03
 
      x = 1;
 
  end                    # total probability:0.5+0.2+0.2+0.07+0.03=1
 
</pre>
 
  
 
'''Example 3.1 (from class):''' (Coin Flipping Example)<br />
 
'''Example 3.1 (from class):''' (Coin Flipping Example)<br />
Line 791: Line 726:
 
3. else if 0.3<U<=0.5 deliver x=1<br />
 
3. else if 0.3<U<=0.5 deliver x=1<br />
 
4. else 0.5<U<=1 deliver x=2
 
4. else 0.5<U<=1 deliver x=2
 +
 +
Can you find a faster way to run this algorithm? Consider:
 +
 +
:<math>
 +
x = \begin{cases}
 +
2, & \text{if } U\leq 0.5 \\
 +
1, & \text{if } 0.5 < U \leq 0.7 \\
 +
0, & \text{if } 0.7 <U\leq 1
 +
\end{cases}</math>
 +
 +
The logic for this is that U is most likely to fall into the largest range. Thus by putting the largest range (in this case x >= 0.5) we can improve the run time of this algorithm. Could this algorithm be improved further using the same logic?
  
 
* '''Code''' (as shown in class)<br />
 
* '''Code''' (as shown in class)<br />
Line 812: Line 758:
 
[[File:Discrete_example.jpg|300px]]
 
[[File:Discrete_example.jpg|300px]]
  
Can you find a faster way to run this algorithm? Consider:
+
'''Example 3.3''': Generating a random variable from pdf <br>
 +
:<math>
 +
f_{x}(x) = \begin{cases}
 +
2x, & \text{if } 0\leq x \leq 1 \\
 +
0, & \text{if }  otherwise
 +
\end{cases}</math>
  
 
:<math>
 
:<math>
x = \begin{cases}
+
F_{x}(x) = \begin{cases}
2, & \text{if } U\leq 0.5 \\
 
1, & \text{if } 0.5 < U \leq 0.7 \\
 
0, & \text{if } 0.7 <U\leq 1
 
\end{cases}</math>
 
 
 
The logic for this is that U is most likely to fall into the largest range. Thus by putting the largest range (in this case x >= 0.5) we can improve the run time of this algorithm. Could this algorithm be improved further using the same logic?
 
 
 
<pre style="font-size:16px">
 
close all
 
clear all
 
for ii=1:1000
 
    u=rand;
 
    if u<=0.5
 
      x(ii)=2;
 
    elseif u<=0.7
 
      x(ii)=1;
 
    else
 
      x(ii)=0;
 
    end
 
end
 
size(x)
 
hist(x)
 
</pre>
 
[[File:lec3.jpg|300px]]
 
 
 
 
 
'''Example 3.3''': Generating a random variable from pdf <br>
 
:<math>
 
f_{x}(x) = \begin{cases}
 
2x, & \text{if } 0\leq x \leq 1 \\
 
0, & \text{if }  otherwise
 
\end{cases}</math>
 
 
 
:<math>
 
F_{x}(x) = \begin{cases}
 
 
0, & \text{if } x < 0 \\
 
0, & \text{if } x < 0 \\
 
\int_{0}^{x}2sds = x^{2}, & \text{if } 0\leq x \leq 1 \\
 
\int_{0}^{x}2sds = x^{2}, & \text{if } 0\leq x \leq 1 \\
Line 857: Line 773:
  
 
:<math>\begin{align} U = x^{2}, X = F^{-1}x(U)= U^{\frac{1}{2}}\end{align}</math>
 
:<math>\begin{align} U = x^{2}, X = F^{-1}x(U)= U^{\frac{1}{2}}\end{align}</math>
 
 
* '''Code'''<br />
 
<pre style="font-size:16px">
 
>> u = rand;
 
>> x = u ^ (1/2);
 
</pre>
 
 
  
 
'''Example 3.4''': Generating a Bernoulli random variable <br>
 
'''Example 3.4''': Generating a Bernoulli random variable <br>
Line 880: Line 788:
 
\end{cases}</math>
 
\end{cases}</math>
  
* '''Code'''<br />
 
<pre style="font-size:16px">
 
>> u = rand;
 
>> P = .3  % p from (0,1)
 
>> if u < (1-p)
 
      x = 0;
 
  else
 
      x = 1;
 
  end
 
</pre>
 
  
 
'''Example 3.5''': Generating Binomial(n,p) Random Variable<br>
 
'''Example 3.5''': Generating Binomial(n,p) Random Variable<br>
Line 901: Line 799:
 
*Note: These steps can be found in Simulation 5th Ed. by Sheldon Ross.
 
*Note: These steps can be found in Simulation 5th Ed. by Sheldon Ross.
 
*Note: Another method by seeing the Binomial as a sum of n independent Bernoulli random variables.<br>
 
*Note: Another method by seeing the Binomial as a sum of n independent Bernoulli random variables.<br>
 
Another method:<br>
 
 
Step 1: Generate n uniform numbers U1 ... Un.<br>
 
Step 1: Generate n uniform numbers U1 ... Un.<br>
 
Step 2: X = <math>\sum U_i < = p</math> where P is the probability of success.
 
Step 2: X = <math>\sum U_i < = p</math> where P is the probability of success.
Line 958: Line 854:
 
...  
 
...  
 
   Else if <math>U \leq P_{0} + ... + P_{k} </math> deliver <math>x = x_{k}</math><br />
 
   Else if <math>U \leq P_{0} + ... + P_{k} </math> deliver <math>x = x_{k}</math><br />
then if the <math>x_{i}</math>,<math>i \geq </math>, are ordered so that <math>x_{0}<x_{1}<x_{2}<...</math> and if we let F denote the distribution function of X, then <math>F(x_{k}) = \sum p_{i}</math> and so
 
                      X will equal <math>x_{j}</math> if <math>F(x_{j-1}) \leq U \leq F(x_{j})</math>
 
  
 
<br /'''>===Inverse Transform Algorithm for Generating a Binomial(n,p) Random Variable(from textbook)==='''
 
<br /'''>===Inverse Transform Algorithm for Generating a Binomial(n,p) Random Variable(from textbook)==='''
Line 1,014: Line 908:
 
</pre>
 
</pre>
 
</div>
 
</div>
 
=== Continuous Case ===
 
'''Example 3.8''': Generating a Weibull(a,b) distribution <br>
 
 
Let X ~ Weibull(a,b). Write an algorithm to generate X. <br>
 
The PDF of X is: <br>
 
 
      <math>f(x) = ab^{-a}x^{a-1}exp((-x/b)^a)</math> ; x > 0 <br>
 
 
The CDF of X is: <br>
 
 
      <math>F(x) = 1-exp((-x/b)^a)</math> ; x > 0 <br>
 
 
Solve for U = F(X) for X: <br>
 
 
      <math>U = 1-exp((-x/b)^a)</math> <br>
 
<=>  <math>-(X/b)^a=ln(1-U)</math> <br>
 
<=>  <math>(X/b)=(-ln(1-U))^{1/a}</math><br>
 
<=>  <math>X=b(-ln(1-U))^{1/a}</math><br>
 
 
'''Algorithm''': <br>
 
1. Generate U~U(0,1) <- ''Note: Generating U and 1-U is the same since they are still U(0,1)''<br>
 
2. Return <math>X=b(-ln (U) )^{1/a}</math><br>
 
 
A simple example: Simulating an Exponential Random Variable:
 
If F(x)=1-exp(-x), then F-1(u) is that value of x such that
 
                          1-exp(-x)=u
 
or
 
                          x= -log(1-u)
 
Hence, if U is a uniform(0,1) variable, then
 
                          F-1(u)=-log(1-U)
 
is exponentially distributed with mean1. Since 1-U is also uniformly distributed on (0,1) it follows that -logU is exponential with mean 1. Since cX is exponential with mean c when X is exponential with mean 1, it follows that -clogU is exponential with mean c.
 
  
 
===Acceptance-Rejection Method===
 
===Acceptance-Rejection Method===
 
<span style="text-shadow:5px 5px 5px #555;">this is well worth reading</span>
 
  
 
Although the inverse transformation method does allow us to change our uniform distribution, it has two limits;
 
Although the inverse transformation method does allow us to change our uniform distribution, it has two limits;
Line 1,060: Line 920:
  
 
[[File:AR_Method.png]]
 
[[File:AR_Method.png]]
 +
 +
 +
{{Cleanup|reason= Do not write <math>c*g(x)</math>. Instead write <math>c \times g(x)</math> or <math>\,c g(x)</math>
 +
}}
  
 
The main logic behind the Acceptance-Rejection Method is that:<br>
 
The main logic behind the Acceptance-Rejection Method is that:<br>
Line 1,108: Line 972:
 
At X<sub>2</sub>, high probability to accept the point.  <math>P(U\leq a)=a</math> in Uniform Distribution.
 
At X<sub>2</sub>, high probability to accept the point.  <math>P(U\leq a)=a</math> in Uniform Distribution.
  
Note:<br>
+
Note: Since U is the variable for uniform distribution between 0 and 1. It equals to 1 for all. The condition depends on the constant c. so the condition changes to <math>c\leq \frac{f(y)}{g(y)}</math>
•Since U is the variable for uniform distribution between 0 and 1. It equals to 1 for all. The condition depends on the constant c. so the condition changes to <math>c\leq \frac{f(y)}{g(y)}</math>.
 
  
•f(Y)and g(Y) are random variables, hence so is the ratio <math>\frac{f(y)}{\, c g(y)}</math> and this ratio is independent of U in the step 2.
 
  
•introduce the relationship of cg(x)and f(x),and prove why they have that relationship and where we can use this rule to reject some cases.
+
introduce the relationship of cg(x)and f(x),and prove why they have that relationship and where we can use this rule to reject some cases.
 
and learn how to see the graph to find the accurate point to reject or accept the ragion above the random variable x.
 
and learn how to see the graph to find the accurate point to reject or accept the ragion above the random variable x.
 
for the example, x1 is bad point and x2 is good point to estimate the rejection and acceptance
 
for the example, x1 is bad point and x2 is good point to estimate the rejection and acceptance
 
 
  
 
'''Some notes on the constant C'''<br>
 
'''Some notes on the constant C'''<br>
Line 1,139: Line 999:
 
<math>P(y|accepted)=f(y)=\frac{P(accepted|y)P(y)}{P(accepted)}</math><br />         
 
<math>P(y|accepted)=f(y)=\frac{P(accepted|y)P(y)}{P(accepted)}</math><br />         
 
<br />based on the concept from '''procedure-step1''':<br />
 
<br />based on the concept from '''procedure-step1''':<br />
<math>P(y)=g(y)</math><br /> (first step:draw Y~g(.))
+
<math>P(y)=g(y)</math><br />
  
 
<math>P(accepted|y)=\frac{f(y)}{cg(y)}</math> <br />
 
<math>P(accepted|y)=\frac{f(y)}{cg(y)}</math> <br />
Line 1,152: Line 1,012:
 
           &=\frac{1}{c}
 
           &=\frac{1}{c}
 
\end{align}</math><br />
 
\end{align}</math><br />
(under any pdf, the area=1)
 
  
 
Therefore:<br />
 
Therefore:<br />
Line 1,160: Line 1,019:
 
&=\frac{\frac{f(y)}{c}}{1/c}\\
 
&=\frac{\frac{f(y)}{c}}{1/c}\\
 
&=f(y)\end{align}</math><br /><br /><br />
 
&=f(y)\end{align}</math><br /><br /><br />
so a sample of g results to a sample of f.
 
  
 
'''''Here is an alternative introduction of Acceptance-Rejection Method'''''
 
'''''Here is an alternative introduction of Acceptance-Rejection Method'''''
Line 1,265: Line 1,123:
 
*The relationship between the proposal distribution and target distribution is: <math> c \cdot g(x) \geq f(x) </math>, where c is a constant. This means that the area of f(x) is under the area of <math> c \cdot g(x)</math>. <br>
 
*The relationship between the proposal distribution and target distribution is: <math> c \cdot g(x) \geq f(x) </math>, where c is a constant. This means that the area of f(x) is under the area of <math> c \cdot g(x)</math>. <br>
 
*Chance of acceptance is less if the distance between <math>f(x)</math> and <math> c \cdot g(x)</math> is big, and vice-versa, we use <math> c </math> to keep <math> \frac {f(x)}{c \cdot g(x)} </math> below 1 (so <math>f(x) \leq c \cdot g(x)</math>). Therefore, we must find the constant <math> C </math> to achieve this.<br />
 
*Chance of acceptance is less if the distance between <math>f(x)</math> and <math> c \cdot g(x)</math> is big, and vice-versa, we use <math> c </math> to keep <math> \frac {f(x)}{c \cdot g(x)} </math> below 1 (so <math>f(x) \leq c \cdot g(x)</math>). Therefore, we must find the constant <math> C </math> to achieve this.<br />
*In other words, a <math>C</math> is chosen to make sure  <math> c \cdot g(x) \geq f(x) </math>. However, it will not make sense if <math>C</math> is simply chosen to be arbitrarily large. If <math>C</math> is too large, the probability of this sample is too small. We need to choose <math>C</math> such that <math>c \cdot g(x)</math> fits <math>f(x)</math> as tightly as possible.<br />
+
*In other words, a <math>C</math> is chosen to make sure  <math> c \cdot g(x) \geq f(x) </math>. However, it will not make sense if <math>C</math> is simply chosen to be arbitrarily large. We need to choose <math>C</math> such that <math>c \cdot g(x)</math> fits <math>f(x)</math> as tightly as possible.<br />
 
*The constant c cannot be a negative number.<br />
 
*The constant c cannot be a negative number.<br />
this is well worth reading
 
  
 
'''How to find C''':<br />
 
'''How to find C''':<br />
Line 1,393: Line 1,250:
 
<pre style="font-size:16px">
 
<pre style="font-size:16px">
 
>>u=rand(1,1000);
 
>>u=rand(1,1000);
>>x=u.^0.5; %square root of each element of u.
+
>>x=u.^0.5;
 
>>hist(x)
 
>>hist(x)
 
</pre>
 
</pre>
Line 1,399: Line 1,256:
  
 
<span style="font-weight:bold;colour:green;">Matlab Tip:</span>
 
<span style="font-weight:bold;colour:green;">Matlab Tip:</span>
Periods, ".",meaning "element-wise", are used to describe the operation you want performed on each element of a vector. In the above example, to take the square root of every element in U, the notation U.^0.5 is used. If we don't have the ".", there will be an error. Input must be a scalar and a square matrix. However if you want to take the Square root of the entire matrix U the period, "*.*" would be excluded. i.e. Let matrix B=U^0.5, then <math>B^T*B=U</math>. For example if we have a two 1 X 3 matrices and we want to find out their product; using "." in the code will give us their product. However, if we don't use ".", it will just give us an error. For example, a =[1 2 3] b=[2 3 4] are vectors, a.*b=[2 6 12], but a*b does not work since matrix dimensions must agree.
+
Periods, ".",meaning "element-wise", are used to describe the operation you want performed on each element of a vector. In the above example, to take the square root of every element in U, the notation U.^0.5 is used. However if you want to take the Square root of the entire matrix U the period, "*.*" would be excluded. i.e. Let matrix B=U^0.5, then <math>B^T*B=U</math>. For example if we have a two 1 X 3 matrices and we want to find out their product; using "." in the code will give us their product. However, if we don't use ".", it will just give us an error. For example, a =[1 2 3] b=[2 3 4] are vectors, a.*b=[2 6 12], but a*b does not work since matrix dimensions must agree.
  
 
'''
 
'''
Line 1,423: Line 1,280:
 
:4: if <math>U2 \leq \frac { \frac{3}{4} * (1-y^2)} { \frac{3}{4}} = {1-y^2}</math>, then x=y,  '''note that''' (3/4(1-y^2)/(3/4) is getting from f(y) / (cg(y)) )
 
:4: if <math>U2 \leq \frac { \frac{3}{4} * (1-y^2)} { \frac{3}{4}} = {1-y^2}</math>, then x=y,  '''note that''' (3/4(1-y^2)/(3/4) is getting from f(y) / (cg(y)) )
 
:5: else: return to '''step 1'''  
 
:5: else: return to '''step 1'''  
 
<span style="font-weight:bold;color:green;">Matlab Code</span>
 
<pre style="font-size:16px">
 
>> ii = 1
 
>> while  ii < 1000  % example of generating 1000 points
 
      u1 = rand;
 
      u2 = rand;
 
      y = 2 * u1 - 1;  % make y uniform over (-1,1)
 
      if u2 <= (1 - y^2)
 
        x(ii) = y;
 
        ii = ii + 1;
 
      end
 
  end
 
</pre>
 
 
  
 
----
 
----
  
 
Simple example of Acceptance-Rejection Method:
 
Generate a random variable having density function f(x)=20x[(1-x)^3], 0<x<1
 
Find c such that C>=f(x)/g(x), we use calculus to determine the maximum value of
 
  f(x)/g(x)= 20x[(1-x)^3]
 
d/dx[f(x)/g(x)] = 20[(1-x)^3 - 3x(1-x^2) ]
 
Setting this equal to 0 shows that the maximal value is attained when x=0.25, and thus
 
f(x)/g(x) <= 20(0.25)[(0.25)^3] = 135/64 ≅c
 
Hence,
 
f(x)/cg(x) = (256/27)x[(1-x)^3]
 
 
Procedure:
 
i. Generate random numbers U1 and U2
 
ii.If U2 <= (256/27)U1[(1-U1)^3], stop and set X=U1. Otherwise return to step i.
 
(From textbook: Introduction to Probability,  10th Edition, Sheldon M.Ross)
 
  
 
=====Example of Acceptance-Rejection Method=====
 
=====Example of Acceptance-Rejection Method=====
Line 1,481: Line 1,308:
 
To obtain a better proposing function g(x), we can first assume a new q(x) and then solve for the normalizing constant by integrating.<br>
 
To obtain a better proposing function g(x), we can first assume a new q(x) and then solve for the normalizing constant by integrating.<br>
 
In the previous example, we first assume q(x) = 3x. To find the normalizing constant, we need to solve k * <math>\sum 3x = 1</math> which gives us k = 2/3. So, g(x) = k*q(x) = 2x.
 
In the previous example, we first assume q(x) = 3x. To find the normalizing constant, we need to solve k * <math>\sum 3x = 1</math> which gives us k = 2/3. So, g(x) = k*q(x) = 2x.
 
+
       
=====Another example of Acceptance-Rejection Method=====
 
 
 
Let <math> f(x) = x^3 </math> for <math> 0<x<\sqrt{2} </math>. Use acceptance-rejection method with the proposal distribution, <math> g(x)=x </math> for <math> 0<x<\sqrt{2} </math>
 
 
 
<math> c=max(\frac{f(x)}{g(x)}) = max(\frac{x^3}{x}) = max(x^2) = (\sqrt{2})^2 = 2 => \frac{f(x)}{c \cdot g(x)} = \frac{x^2}{2} </math> <br>
 
Hence, the algorithm is: <br>
 
1. Generate <math> y \sim~ U(0,1) </math> <br>
 
2. Generate <math> U \sim~ U(0,1) </math> <br>
 
3. If <math> U \leqslant \frac{y^2}{2} </math>. Then X=y, else go to step 1.
 
 
 
  
 
'''Possible Limitations'''
 
'''Possible Limitations'''
Line 1,525: Line 1,342:
  
 
3. Now Y follows <math>U(a,b)</math>
 
3. Now Y follows <math>U(a,b)</math>
 
  
 
'''Example''': Generate a random variable z from the Semicircular density <math>f(x)= \frac{2}{\pi R^2} \sqrt{R^2-x^2}, -R\leq x\leq R</math>.
 
'''Example''': Generate a random variable z from the Semicircular density <math>f(x)= \frac{2}{\pi R^2} \sqrt{R^2-x^2}, -R\leq x\leq R</math>.
  
-> Proposal distribution: g(x)=1\2R, x belongs to UNIF(-R, R)
+
-> Proposal distribution: UNIF(-R, R)
  
-> We want to generate it to <math> U \sim UNIF (0,1) </math> Let <math> Y= [R-(-R)]U+(-R)=2RU-R=R(2U-1)</math>, therefore Y follows <math>U(-R,R)</math>
+
-> We know how to generate using <math> U \sim UNIF (0,1) </math> Let <math> Y= 2RU-R=R(2U-1)</math>, therefore Y follows <math>U(-R,R)</math>
  
 
-> In order to maximize the function we must maximize the top and minimize the bottom.
 
-> In order to maximize the function we must maximize the top and minimize the bottom.
Line 1,657: Line 1,473:
 
[[File:ARM_cont_example.jpg|300px]]
 
[[File:ARM_cont_example.jpg|300px]]
  
the basic definition  of the histogram is that a histogram to show variable x, and the bars number is y.
+
a histogram to show variable x, and the bars number is y.
 
=== Discrete Examples ===
 
=== Discrete Examples ===
 
* '''Example 1''' <br>
 
* '''Example 1''' <br>
Line 1,676: Line 1,492:
 
Step 2. Draw <math>U \sim~ U(0,1)</math>.<br/>
 
Step 2. Draw <math>U \sim~ U(0,1)</math>.<br/>
 
Step 3. If <math>U \leq \frac{f(Y)}{c \cdot g(Y)}</math>, then <b> X = Y </b>;<br/>
 
Step 3. If <math>U \leq \frac{f(Y)}{c \cdot g(Y)}</math>, then <b> X = Y </b>;<br/>
        else return to Step 1.
+
Else return to Step 1.<br/>
  
 
C can be found by maximizing the ratio :<math> \frac{f(x)}{g(x)} </math>. To do this, we want to maximize <math> f(x) </math> and minimize <math> g(x) </math>. <br>
 
C can be found by maximizing the ratio :<math> \frac{f(x)}{g(x)} </math>. To do this, we want to maximize <math> f(x) </math> and minimize <math> g(x) </math>. <br>
Line 1,683: Line 1,499:
 
:<math>\frac{p(x)}{cg(x)} =  \frac{p(x)}{1.5*0.2} = \frac{p(x)}{0.3} </math><br>
 
:<math>\frac{p(x)}{cg(x)} =  \frac{p(x)}{1.5*0.2} = \frac{p(x)}{0.3} </math><br>
 
Note: The U is independent from y in Step 2 and 3 above.
 
Note: The U is independent from y in Step 2 and 3 above.
The constant c is an indicator of rejection rate or efficiency of the algorithm.
+
~The constant c is a indicator of rejection rate or efficiency of the algorithm.
  
Since g follows a discrete uniform distribution, the probability is the same for all variables. And since there are 5 parameters (1,2,3,4,5), g(x)=1/5=0.2.
+
the acceptance-rejection method of pmf, the uniform probability is the same for all variables, and there 5 parameters(1,2,3,4,5), so g(x) is 0.2
  
Remember that we always want to choose <math> c/timesg </math> to be equal to or greater than <math> f </math>, but as close as possible.
+
Remember that we always want to choose <math> cg </math> to be equal to or greater than <math> f </math>, but as close as possible.
<br />limitations: If the form of the proposal distribution g is very different from the target distribution f, then c is very large and the algorithm is not computationally effective.
+
<br />limitations: If the form of the proposal dist g is very different from target dist f, then c is very large and the algorithm is not computatively effect.
  
 
* '''Code for example 1'''<br />
 
* '''Code for example 1'''<br />
Line 1,712: Line 1,528:
 
The acceptance rate is <math>\frac {1}{c}</math>, so the lower the c, the more efficient the algorithm. Theoretically, c equals 1 is the best case because all samples would be accepted; however it would only be true when the proposal and target distributions are exactly the same, which would never happen in practice.  
 
The acceptance rate is <math>\frac {1}{c}</math>, so the lower the c, the more efficient the algorithm. Theoretically, c equals 1 is the best case because all samples would be accepted; however it would only be true when the proposal and target distributions are exactly the same, which would never happen in practice.  
  
For example, if c = 1.5, the acceptance rate would be <math>\frac {1}{1.5}=\frac {2}{3}</math>. This means about 66% of points are accepted. Thus, in order to generate 1000 random values, on average, a total of 1500 iterations would be required.  
+
For example, if c = 1.5, the acceptance rate would be <math>\frac {1}{1.5}=\frac {2}{3}</math>. Thus, in order to generate 1000 random values, on average, a total of 1500 iterations would be required.  
  
 
A histogram to show 1000 random values of f(x), more random value make the probability close to the express probability value.
 
A histogram to show 1000 random values of f(x), more random value make the probability close to the express probability value.
Line 1,745: Line 1,561:
 
</pre>
 
</pre>
  
[[File:May21_Example2.jpg|300px]]
 
  
 
* '''Example 3'''<br>
 
* '''Example 3'''<br>
Line 1,759: Line 1,574:
 
2. <math>j = \lfloor \frac{ln(U_{1})}{ln(.75)} \rfloor+1;</math><br>
 
2. <math>j = \lfloor \frac{ln(U_{1})}{ln(.75)} \rfloor+1;</math><br>
 
3. if <math>U_{2} < \frac{p_{j}}{cg(j)}</math>, set X = x<sub>j</sub>, else go to step 1.
 
3. if <math>U_{2} < \frac{p_{j}}{cg(j)}</math>, set X = x<sub>j</sub>, else go to step 1.
 
 
* '''Code for example 3'''<br />
 
<pre style="font-size:16px">
 
>> ii = 1
 
>> while ii < 1000
 
      u1 = rand;
 
      u2 = rand;
 
      j = floor((log(u1)/log(.75)) + 1;
 
      pj = (e^3) * (3^j)/fact(j);
 
      gj = .25 * (1 - .25)^j
 
      if u2 < pj / (2.12 * gj)
 
        x(ii) = j;
 
        ii = ii + 1;
 
      end
 
  end
 
>> hist(x)
 
 
</pre>
 
  
 
Note: In this case, f(x)/g(x) is extremely difficult to differentiate so we were required to test points. If the function is easily differentiable, we can calculate the max as if it were a continuous function then check the two surrounding points for which is the highest discrete value.
 
Note: In this case, f(x)/g(x) is extremely difficult to differentiate so we were required to test points. If the function is easily differentiable, we can calculate the max as if it were a continuous function then check the two surrounding points for which is the highest discrete value.
Line 1,804: Line 1,600:
  
 
Looking for the max f(x) is 0.4945 and the max g(x) is 0.4444, so we can calculate the max c is 1.1127.
 
Looking for the max f(x) is 0.4945 and the max g(x) is 0.4444, so we can calculate the max c is 1.1127.
But for the graph, this c is not the best because it does not cover all the point of f(x), and we need to move the c*g(x) graph to cover all f(x), and decreasing the rejection ratio.
+
But for the graph, this c is not the best because it does not cover all the point of f(x), so we need to move the c*g(x) graph to cover all f(x), and decreasing the rejection ratio.
  
 
Limitation: If the shape of the proposed distribution g is very different from the target distribution f, then the rejection rate will be high (High c value). Computationally, the algorithm is always right; however it is inefficient and requires many iterations. <br>
 
Limitation: If the shape of the proposed distribution g is very different from the target distribution f, then the rejection rate will be high (High c value). Computationally, the algorithm is always right; however it is inefficient and requires many iterations. <br>
Line 1,940: Line 1,736:
  
 
=== Other Sampling Method: Box Muller ===
 
=== Other Sampling Method: Box Muller ===
 
 
[[File:Unnamed_QQ_Screenshot20130521203625.png‎]]
 
[[File:Unnamed_QQ_Screenshot20130521203625.png‎]]
 
* From cartesian to polar coordinates <br />
 
* From cartesian to polar coordinates <br />
Line 1,947: Line 1,742:
 
   
 
   
 
*Box-Muller Transformation:<br>
 
*Box-Muller Transformation:<br>
It is a transformation that consumes two continuous uniform random variables <math> X \sim U(0,1), Y \sim U(0,1) </math> and outputs a bivariate normal random variable with <math> Z_1\sim N(0,1), Z_2\sim N(0,1). </math><br>
+
It is a transformation that consumes two continuous uniform random variables <math> X \sim U(0,1), Y \sim U(0,1) </math> and outputs a bivariate normal random variable with <math> Z_1\sim N(0,1), Z_2\sim N(0,1). </math>
In other words, the Box-Muller method is a method of producing two independent standard normals from two independent uniforms. <br>
 
 
 
*Basic Form:<br>
 
Let U<sub>1</sub> and U<sub>2</sub> ~ U(0,1). Assuming U1 & U2 are independent, and let: <br>
 
 
 
1)  <math>Z_0 = R \cos(\Theta) =\sqrt{-2 \ln U_1} \cos(2 \pi U_2)\,</math><br>
 
 
 
2)  <math>Z_1 = R \sin(\Theta) = \sqrt{-2 \ln U_1} \sin(2 \pi U_2)\,</math><br>
 
 
 
where both Z<sub>0</sub> and Z<sub>1</sub>~N(0,1) are independent, with corresponding polar coordinates:<br>
 
  <math>R^2 = -2\cdot\ln U_1\,</math> <br>
 
and <br>
 
  <math>\Theta = 2\pi U_2\,</math> <br>
 
 
 
'''Note:''' <br>
 
 
 
R<sup>2</sup> here has Chi-Squared distribution with df = 2 since it is just the square of the norm of the standard bivariate normal variable (X,Y). For the special case where df = 2, chi-squared distribution is the same as the exponential distribution. Hence, R<sup>2</sup> is simply obtainable by generating the required exponential variate.
 
 
 
Source: https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform
 
  
 
=== '''Matlab''' ===
 
=== '''Matlab''' ===
Line 2,088: Line 1,864:
 
Alternative Method of Generating Standard Normal Random Variables   
 
Alternative Method of Generating Standard Normal Random Variables   
  
Step 1: Generate <math>u</math><sub>1</sub>~<math>Unif(0,1)</math>
+
Step 1: Generate <math>u1~Unif(0,1)</math>
Step 2: Generate <math>Y</math><sub>1</sub>~<math>Exp(1),Y</math><sub>2</sub>~<math>Exp(2)</math>
+
Step 2: Generate <math>Y1~Exp(1),Y2~Exp(2)</math>
Step 3: If <math>Y2 \geq(Y-1)^2/2</math>,set <math>V=Y1</math>,otherwise,go to step 1
+
Step 3: If <math>Y2 \geq(Y1-1)^2/2</math>,set <math>V=Y1</math>,otherwise,go to step 1
 
Step 4: If <math>u1 \leq 1/2</math>,then <math>X=-V</math>
 
Step 4: If <math>u1 \leq 1/2</math>,then <math>X=-V</math>
  
Line 2,248: Line 2,024:
 
where <math> \mu </math> is the mean or expectation of the distribution and <math> \sigma </math> is standard deviation <br />
 
where <math> \mu </math> is the mean or expectation of the distribution and <math> \sigma </math> is standard deviation <br />
  
The special case of the normal distribution is standard normal distribution, which the variance is 1 and the mean is zero. If X is a general normal deviate, Z = (X − μ)/σ will have a standard normal distribution.
+
The special case of the normal distribution is standard normal distribution, which the variance is 1 and the mean is zero. If X is a general normal deviate, then Z = (X − μ)/σ will have a standard normal distribution.
  
 
If Z ~ N(0,1), and we want <math>X </math>~<math> N(\mu, \sigma^2)</math>, then <math>X = \mu + \sigma * Z</math> Since <math>E(x) = \mu +\sigma*0 = \mu </math> and <math>Var(x) = 0 +\sigma^2*1</math>
 
If Z ~ N(0,1), and we want <math>X </math>~<math> N(\mu, \sigma^2)</math>, then <math>X = \mu + \sigma * Z</math> Since <math>E(x) = \mu +\sigma*0 = \mu </math> and <math>Var(x) = 0 +\sigma^2*1</math>
Line 2,259: Line 2,035:
 
>>clear all
 
>>clear all
 
>>z1=randn(1,1000);    <-generate variable from standard normal distribution
 
>>z1=randn(1,1000);    <-generate variable from standard normal distribution
>>z2=randn(1,1000);   <-generate another variable from standard normal distribution
+
>>z2=randn(1,1000);
 
>>z=[z1;z2];          <-produce a vector
 
>>z=[z1;z2];          <-produce a vector
 
>>plot(z(1,:),z(2,:),'.')
 
>>plot(z(1,:),z(2,:),'.')
Line 2,474: Line 2,250:
 
===Universality of the Uniform Distribution/Inverse Method===
 
===Universality of the Uniform Distribution/Inverse Method===
  
The inverse method is universal in the sense that we can potentially sample from any distribution where we can find the inverse of the cumulative distribution function. However, this is not always the case as some functions do not have an inverse while others may be difficult to find. As such, there exist different procedures such as Acceptance-Rejection which is outlined in a further lecture.
+
The inverse method is universal in the sense that we can potentially sample from any distribution where we can find the inverse of the cumulative distribution function.
  
 
Procedure:
 
Procedure:
Line 2,484: Line 2,260:
 
'''Remark'''<br>
 
'''Remark'''<br>
 
1. The preceding can be written algorithmically as
 
1. The preceding can be written algorithmically as
Generate a random number U.
+
Generate a random number U
If U<p<sub>0</sub> set X=x<sub>0</sub> and stop.
+
If U<<sub>p0</sub> set X=<sub>x0</sub> and stop
If U<p<sub>0</sub>+p<sub>1</sub> set X=x<sub>1</sub> and stop....
+
If U<<sub>p0</sub>+<sub>p1</sub> set X=x1 and stop
 
+
...
 
+
2. If the <sub>xi</sub>, i>=0, are ordered so that <sub>x0</sub><<sub>x1</sub><<sub>x2</sub><... and if we let F denote the distribution function of X, then <math>F(<sub>xk</sub>=<sub>/sum/pi</sub>)</math> and so X will equal <sub>xj</sub> if F(<sub>x(j-1)</sub>)<=U<F(<sub>xj</sub>)
2. If the x<sub>i</sub>, i>=0, are ordered so that x<sub>0</sub><x<sub>1</sub><x<sub>2</sub><... and if we let F denote the CDF of X, X will equal x<sub>j</sub> if F<sub>x<sub>j-1</sub></sub><=U<F<sub>x<sub>j</sub></sub>
 
 
 
  
 
'''Example 1'''<br>
 
'''Example 1'''<br>
Line 2,568: Line 2,342:
 
Similarly if <math> Y = min(X_1,\ldots,X_n)</math> then the cdf of <math>Y</math> is <math>F_Y = 1- </math><math>\prod</math><math>(1- F_{X_i})</math><br>  
 
Similarly if <math> Y = min(X_1,\ldots,X_n)</math> then the cdf of <math>Y</math> is <math>F_Y = 1- </math><math>\prod</math><math>(1- F_{X_i})</math><br>  
 
<br>
 
<br>
'''Method 1:''' Following the above result we can see that in this example, F<sub>X</sub> = x<sup>n</sup> is the cumulative distribution function of the max of n uniform random variables between 0 and 1 (since for U~Unif(0, 1), F<sub>U</sub>(x) = x<br>
+
'''Method 1:''' Following the above result we can see that in this example, F<sub>X</sub> = x<sup>n</sup> is the cumulative distribution function of the max of n uniform random variables between 0 and 1 (since for U~Unif(0, 1), F<sub>U</sub>(x) = <br>
 
'''Method 2:'''  generate X by having a sample of n independent U~Unif(0, 1) and take the max of the n samples to be x. However, the solution given above using inverse-transform method only requires generating one uniform random number instead of n of them, so it is a more efficient method.
 
'''Method 2:'''  generate X by having a sample of n independent U~Unif(0, 1) and take the max of the n samples to be x. However, the solution given above using inverse-transform method only requires generating one uniform random number instead of n of them, so it is a more efficient method.
 
<br>
 
<br>
Line 2,688: Line 2,462:
 
if 0<u<1/3, x = v
 
if 0<u<1/3, x = v
  
else if <math>u<\frac{2}{3}, x = v<sup>\frac{1}{2}</sup></math>
+
else if u<2/3, x = v<sup>1/2</sup>
  
else <math>x = v<sup>\frac{1}{3}</sup></math><br>
+
else x = v<sup>1/3</sup><br>
  
  
 
'''Matlab Code:'''  
 
'''Matlab Code:'''  
 
<pre style="font-size:16px">
 
<pre style="font-size:16px">
u=rand     generating a random variable
+
u=rand
v=rand     generating another random variable
+
v=rand
if <math>u<\frac{1}{3}</math>
+
if u<1/3
 
x=v
 
x=v
elseif <math>u<\frac{2}{3}</math>
+
elseif u<2/3
 
x=sqrt(v)
 
x=sqrt(v)
 
else
 
else
Line 2,755: Line 2,529:
  
 
For More Details, please refer to http://www.stanford.edu/class/ee364b/notes/decomposition_notes.pdf
 
For More Details, please refer to http://www.stanford.edu/class/ee364b/notes/decomposition_notes.pdf
 +
  
 
===Fundamental Theorem of Simulation===
 
===Fundamental Theorem of Simulation===
Line 2,762: Line 2,537:
 
(Basis of the Accept-Reject algorithm)
 
(Basis of the Accept-Reject algorithm)
  
Pro: This method allows us to sample an unknown distribution from a easy distribution.  
+
The advantage of this method is that we can sample a unknown distribution from a easy distribution. The disadvantage of this method is that it may need to reject many points, which is inefficient.
Con: This method may have to reject many points, which is inefficient.
 
  
Inverse each part of partial CDF, the partial CDF is divided by the original CDF, partial range is uniform distribution.
+
inverse each part of partial CDF, the partial CDF is divided by the original CDF, partial range is uniform distribution.
 
 
Suppose we want to sample from f(x), we can write f(x)=<math>\int _{0}^{f(x)}du</math>.
 
Thus, f(x) can be thought as the marginal distribution of (X, U) ∼ U{(x, u):0 <u<f(x)}.
 
 
 
Theorem: Simulating X ∼ f(x)is equivalent to simulating (X, U) ∼ U{(x, u):0 <u<f(x)}.
 
  
 
===Question 2===
 
===Question 2===
Line 2,809: Line 2,578:
 
==Class 8 - Thursday, May 30, 2013==
 
==Class 8 - Thursday, May 30, 2013==
  
In this lecture, we will discuss algorithms to generate 3 well-known distributions: Binomial, Geometric, and Poisson. For each of these distributions, we will first state its general understanding, probability mass function, expectation, and variance. Then, we will derive one or more algorithms to sample from each of these distributions, and implement the algorithms utilizing Matlab. <br \>
+
In this lecture, we will discuss algorithms to generate 3 well-known distributions: Binomial, Geometric and Poisson. For each of these distributions, we will first state its general understanding, probability mass function, expectation and variance. Then, we will derive one or more algorithms to sample from each of these distributions, and implement the algorithms on Matlab. <br \>
  
 
===The Bernoulli distribution===
 
===The Bernoulli distribution===
  
The Bernoulli distribution is a special case of the binomial distribution, where n = 1. X ~ Bin(1, p) has the same meaning as X ~ Ber(p), where p is the probability if the event success, otherwise the probability is 1-p (we usually define a variate q, q= 1-p). Bin(n, p), is the distribution of the sum of n independent Bernoulli trials - Bernoulli(p), each with the same probability p, where 0<p<1. <br>
+
The Bernoulli distribution is a special case of the binomial distribution, where n = 1. X ~ Bin(1, p) has the same meaning as X ~ Ber(p), where p is the probability if the event success, otherwise the probability is 1-p (we usually define a variate q, q= 1-p). The mean of Bernoulli is p, variance is p(1-p). Bin(n, p), is the distribution of the sum of n independent Bernoulli trials, Bernoulli(p), each with the same probability p, where 0<p<1. <br>
{| class="wikitable"
 
|-
 
! Mean
 
! Variance
 
! PMF
 
! CMF
 
! MGF
 
|-
 
| p
 
| p(1-p)
 
| <math>\begin{cases}
 
    1-p & \text{for }k=0 \\ p & \text{for }k=1
 
    \end{cases}</math>
 
| <math>\begin{cases}
 
    0 & \text{for }k<0 \\ 1-p & \text{for }0\leq k<1 \\ 1 & \text{for }k\geq 1
 
    \end{cases}
 
</math>
 
| <math>1-p+pe^t\,</math>
 
|}
 
 
For example, let X be the event that a coin toss results in a "head" with probability ''p'', then ''X~Bernoulli(p)''. <br>
 
For example, let X be the event that a coin toss results in a "head" with probability ''p'', then ''X~Bernoulli(p)''. <br>
 
P(X=1)=p,P(X=0)=1-p, P(x=0)+P(x=1)=p+q=1
 
P(X=1)=p,P(X=0)=1-p, P(x=0)+P(x=1)=p+q=1
Line 2,845: Line 2,595:
 
when U>p, x=0<br>
 
when U>p, x=0<br>
 
3.Repeat as necessary
 
3.Repeat as necessary
 
'''Code'''<br>
 
<pre style="font-size:16px">
 
i = 1;
 
 
while (i <=1000)
 
    u =rand();
 
    p = 0.1;            # define the prob. p
 
    if (u <= p)       
 
        x(i) = 1;
 
    else
 
        x(i) = 0;
 
    end
 
    i = i + 1;
 
end
 
 
hist(x)
 
</pre>
 
 
[[File:Bernoulli.jpg|300px]]
 
  
 
===The Binomial Distribution===
 
===The Binomial Distribution===
  
The binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p.
+
If X~Bin(n,p), then its pmf is of form:
(Definition is copied from:http://en.wikipedia.org/wiki/Binomial_distribution)
 
Let X~Bin(n,p), then:
 
{| class="wikitable"
 
|-
 
! Mean
 
! Variance
 
! PMF
 
! CMF
 
! MGF
 
|-
 
| <math> np </math>
 
| <math> np(1-p) </math> 
 
| (nCx) p<sup>x</sup>(1-p)<sup>(n-x)</sup>, x=0,1,...n<br />Or f(x) = <math>(n!/x!(n-x)!)</math> p<sup>x</sup>(1-p)<sup>(n-x)</sup>, x=0,1,...n <br />
 
| <math>\sum_{i=0}^{x}{n\choose i}p^i(1-p)^{n-i}</math>
 
| (1-p+pe<sup>t</sup>)<sup>n</sup>
 
|}
 
  
 +
f(x)=(nCx) p<sup>x</sup>(1-p)<sup>(n-x)</sup>, x=0,1,...n<br />
 +
Or f(x) = <math>(n!/x!(n-x)!)</math> p<sup>x</sup>(1-p)<sup>(n-x)</sup>, x=0,1,...n <br />
  
 +
Mean (x) = E(x) = <math> np </math>
 +
Variance = <math> np(1-p) </math><br/>
  
 
Generate n uniform random number <math>U_1,...,U_n</math> and let X be the number of <math>U_i</math> that are less than or equal to p.
 
Generate n uniform random number <math>U_1,...,U_n</math> and let X be the number of <math>U_i</math> that are less than or equal to p.
The logic behind this algorithm is that the Binomial Distribution is simply a summation of '''n''' Bernoulli trials, each with probability of success '''p'''. Thus, we can say equivalently that sampling from a Bin (n, p) distribution is the same as sampling from '''n''' Bernoulli trials. In the example below, we are sampling 1000 realizations from 20 Bernoulli random variables. By summing up the rows of the 20 by 1000 matrix that is produced, we are summing up the 20 Bernoulli outcomes to produce one binomial sampling. We have 1000 rows, which means we have realizations from 1000 binomial random variables when this sum is done (the output of the sum is a 1 by 1000 sized vector).<br />
+
The logic behind this algorithm is that the Binomial Distribution is simply a Bernoulli Trial, with a probability of success of p, repeated n times. Thus, we can sample from the distribution by sampling from n Bernoulli. The sum of these n bernoulli trials will represent one binomial sampling. Thus, in the below example, we are sampling 1000 realizations from 20 Bernoulli random variables. By summing up the rows of the 20 by 1000 matrix that is produced, we are summing up the 20 bernoulli outcomes to produce one binomial sampling. We have 1000 rows, which means we have realizations from 1000 binomial random variables when this sum is done (the output of the sum is a 1 by 1000 sized vector).<br />
 
 
 
To continue with the previous example, let X be the number of heads in a series of ''n'' independent coin tosses - where for each toss, the probability of coming up with a head is ''p'' - then ''X~Bin(n, p)''. <br />
 
To continue with the previous example, let X be the number of heads in a series of ''n'' independent coin tosses - where for each toss, the probability of coming up with a head is ''p'' - then ''X~Bin(n, p)''. <br />
 
+
MATLAB tips: to get a pdf f(x), we can use code binornd(N,P). N means number of trials and p is the probability of success. a=[2 3 4],if set a<3, will produce a=[1 0 0]. If you set "a == 3", it will produce [0 1 0]. If a=[2 6 9 10], if set a<4, will produce a=[1 0 0 0], because only the first element (2) is less than 4, meanwhile the rest are greater. So we can use this to get the number which is less than p.<br />
MATLAB tips: to get the pdf f(x), we can use code binornd(N,P). N represents the number of trials and P is the probability of success. a=[2 3 4],if set a<3, will produce a=[1 0 0]. If you set "a == 3", it will produce [0 1 0]. If a=[2 6 9 10], if set a<4, will produce a=[1 0 0 0], because only the first element (2) is less than 4, meanwhile the rest are greater. So we can use this to get the number which is less than p.<br />
 
  
 
Algorithm for Bernoulli is given as above
 
Algorithm for Bernoulli is given as above
Line 2,903: Line 2,619:
 
ans= 1 0 0
 
ans= 1 0 0
  
>>rand(20,1000)   # if we want to generate 20 times; retry the trail 20 times.
+
>>rand(20,1000)
 
>>rand(20,1000)<0.4
 
>>rand(20,1000)<0.4
 
>>A = sum(rand(20,1000)<0.4)  #sum of raws ~ Bin(20 , 0.3)
 
>>A = sum(rand(20,1000)<0.4)  #sum of raws ~ Bin(20 , 0.3)
Line 2,918: Line 2,634:
  
 
remark: a=[2 3 4],if set a<3, will produce a=[1 0 0]. If you set "a == 3", it will produce [0 1 0].
 
remark: a=[2 3 4],if set a<3, will produce a=[1 0 0]. If you set "a == 3", it will produce [0 1 0].
 +
using code to find some value what i want to get from the matrix. It`s useful to define some matrixs.
  
 
Relation between Bernoulli Distribution and Binomial Distribution:
 
Relation between Bernoulli Distribution and Binomial Distribution:
Line 2,923: Line 2,640:
  
 
===The Geometric Distribution===
 
===The Geometric Distribution===
Geometric distribution is a discrete distribution. There are two types of geometric distributions.
+
Geometric distribution is a discrete distribution. There are two types geometric distributions, the first one is the probability distribution of the number of X Bernoulli fail trials, with probability 1-p, needed until the first success situation happened, X come from the set { 1, 2, 3, ...}; the other one is the probability distribution of the number Y = X 1 of failures, with probability 1-p, before the first success, Y comes from the set { 0, 1, 2, 3, ... }.
1. Generate the probability distribution of X Bernoulli fail trials (probability '''1-p'''), until the first success (probability '''p'''). X belongs in the set {1, 2, 3, ...}.
 
2. Generate probability distribution of Y = X - 1 failures (probability '''1-p''') before the first success (probability '''p'''). Y belongs in the set {0, 1, 2, 3, ...}.
 
  
 
For example,<br />
 
For example,<br />
Line 2,940: Line 2,655:
 
.    .<br />
 
.    .<br />
 
.    .<br />
 
.    .<br />
n    P(1-P)<sup>(n-1)</sup>(number of x-1 failures)<br />
+
n    P(1-P)<sup>(n-1)</sup><br />
 
For example, suppose a die is thrown repeatedly until the first time a "6" appears. This is a question of geometric distribution of the number of times on the set { 1, 2, 3, ... } with p = 1/6.
 
For example, suppose a die is thrown repeatedly until the first time a "6" appears. This is a question of geometric distribution of the number of times on the set { 1, 2, 3, ... } with p = 1/6.
  
Line 2,956: Line 2,671:
  
 
The CDF : P(X<n) = 1 - <math>(1-p)^n</math>
 
The CDF : P(X<n) = 1 - <math>(1-p)^n</math>
 
Memorylessness properties : P(X>m+n|X>=m)=P(X>n)
 
  
  
Line 3,114: Line 2,827:
 
If <math>\displaystyle X \sim \text{Poi}(\lambda)</math>, its pdf is of the form <math>\displaystyle \, f(x) = \frac{e^{-\lambda}\lambda^x}{x!}</math> , where <math>\displaystyle \lambda </math> is the rate parameter.<br />
 
If <math>\displaystyle X \sim \text{Poi}(\lambda)</math>, its pdf is of the form <math>\displaystyle \, f(x) = \frac{e^{-\lambda}\lambda^x}{x!}</math> , where <math>\displaystyle \lambda </math> is the rate parameter.<br />
  
Definition: In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event. The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume. (from Wikipedia) In short, the Poisson distribution measures the number of occurrences at a particular time interval, given the rate of occurrences per unit time.
+
definition:In probability theory and statistics, the Poisson distribution (pronounced [pwasɔ̃]) is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event. The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.
 +
For instance, suppose someone typically gets 4 pieces of mail per day on average. There will be, however, a certain spread: sometimes a little more, sometimes a little less, once in a while nothing at all.[2] Given only the average rate, for a certain period of observation (pieces of mail per day, phonecalls per hour, etc.), and assuming that the process, or mix of processes, that produces the event flow is essentially random, the Poisson distribution specifies how likely it is that the count will be 3, or 5, or 10, or any other number, during one period of observation. That is, it predicts the degree of spread around a known average rate of occurrence.
 +
The Derivation of the Poisson distribution section shows the relation with a formal definition.(from Wikipedia)
  
 
Understanding of Poisson distribution:
 
Understanding of Poisson distribution:
  
If customers '''independently''' come to bank over time, all of the following exponential distributions with rate <math>\lambda</math> per unit of time, then  
+
If customers '''independently''' come to bank over time, all following exponential distributions with rate <math>\lambda</math> per unit of time, then  
 
X(t) = # of customer in [0,t] ~ Poi<math>(\lambda t)</math>
 
X(t) = # of customer in [0,t] ~ Poi<math>(\lambda t)</math>
  
Line 3,197: Line 2,912:
  
 
=== Beta Distribution ===
 
=== Beta Distribution ===
The beta distribution is a continuous probability distribution. There are two positive shape parameters (i.e. greater than zero) in this distribution, defined as '''α''' and '''β'''. X falls within the interval [0,1]. The parameter '''α''' is used as exponents of the random variable. The parameter '''β''' is used to control the shape of the this distribution. We use the beta distribution to build the model of the behavior of random variables, which are limited to intervals of finite length. For example, we can use the beta distribution to analyze the time allocation of sunshine data and variability of soil properties.
+
The beta distribution is a continuous probability distribution. There are two positive shape parameters in this distribution defined as alpha and beta, both parameters greater than 0, and X within the interval [0,1]. The parameter alpha is used as exponents of the random variable. The parameter beta is used to control the shape of the this distribution. We use the beta distribution to build the model of the behavior of random variables, which are limited to intervals of finite length. For example, we can use the beta distribution to analyze the time allocation of sunshine data and variability of soil properties.
  
 
If X~Beta(<math>\alpha, \beta</math>) then its p.d.f. is of the form
 
If X~Beta(<math>\alpha, \beta</math>) then its p.d.f. is of the form
Line 3,205: Line 2,920:
 
<math>f(x;\alpha,\beta)= 0 </math> otherwise
 
<math>f(x;\alpha,\beta)= 0 </math> otherwise
 
Note: <math>\Gamma(\alpha)=(\alpha-1)! </math> if <math>\alpha</math> is a positive integer.
 
Note: <math>\Gamma(\alpha)=(\alpha-1)! </math> if <math>\alpha</math> is a positive integer.
However, several other authors, including W. Feller choose to exclude the ends x = 0 and x = 1, (such that the two ends are not actually part of the density function) and consider instead 0 < x < 1.
 
Another notation for beta-distributed random variables is X~Be(<math>\alpha, \beta</math>).
 
  
{| class="wikitable"
+
 
|-
+
The mean of the beta distribution is <math>\frac{\alpha}{\alpha + \beta}</math>. The variance is <math>\frac{\alpha\beta}{(\alpha+\beta)^2 (\alpha + \beta + 1)}</math>
! Mean
 
! Vanriance
 
|-
 
| <math>\frac{\alpha}{\alpha + \beta}</math>
 
| <math>\frac{\alpha\beta}{(\alpha+\beta)^2 (\alpha + \beta + 1)}</math>
 
|}
 
 
The variance of the beta distribution decreases monotonically if <math> \alpha = \beta </math> and as <math> \alpha = \beta </math> increases, the variance decreases.  
 
The variance of the beta distribution decreases monotonically if <math> \alpha = \beta </math> and as <math> \alpha = \beta </math> increases, the variance decreases.  
The mode of a Beta distributed random variable X with α, β > 1 is <math>\frac{\alpha-1}{\alpha + \beta-2}</math>.
 
  
 
The formula for the cumulative distribution function of the beta distribution is also called the incomplete beta function ratio (commonly denoted by Ix) and is defined as F(x) = I(x)(p,q)  
 
The formula for the cumulative distribution function of the beta distribution is also called the incomplete beta function ratio (commonly denoted by Ix) and is defined as F(x) = I(x)(p,q)  
Line 3,229: Line 2,935:
 
:<math> = 1 </math><br>
 
:<math> = 1 </math><br>
  
Note that by definition, 0! = 1. <br>
+
Note: 0! = 1. <br>
 
Hence, the distribution is:<br>
 
Hence, the distribution is:<br>
 
:<math>\displaystyle \text{Beta}(1,1) = U (0, 1) </math><br>
 
:<math>\displaystyle \text{Beta}(1,1) = U (0, 1) </math><br>
Line 3,303: Line 3,009:
 
'''MATLAB Code for generating Beta Distribution'''
 
'''MATLAB Code for generating Beta Distribution'''
 
<pre style='font-size:16px'>
 
<pre style='font-size:16px'>
>>Y1 = sum(-log(rand(10,1000)))            #Gamma(10,1), sum 10 exponential for each of the 1000 samples
+
>>Y1 = sum(-log(rand(10,1000)))            #Gamma(10,1), sum 10 exponentials for each of the 1000 samples
  
>>Y2 = sum(-log(rand(5,1000)))              #Gamma(5,1), sum 5 exponential for each of the 1000 samples
+
>>Y2 = sum(-log(rand(5,1000)))              #Gamma(5,1), sum 5 exponentials for each of the 1000 samples
  
 
%NOTE: here, lamda is 1, since the scale parameter for Y1 & Y2 are both 1
 
%NOTE: here, lamda is 1, since the scale parameter for Y1 & Y2 are both 1
Line 3,323: Line 3,029:
 
>>hist(Y)                                    #Do this to check that the shape fits beta. ~Beta(10,5).
 
>>hist(Y)                                    #Do this to check that the shape fits beta. ~Beta(10,5).
  
>>disttool                                  #Check the beta plot. We can change beta here.
+
>>disttool                                  #Check the beta plot.
  
 
</pre>
 
</pre>
Line 3,335: Line 3,041:
 
[[File:325px-Beta_distribution_pdf.png|300px]]
 
[[File:325px-Beta_distribution_pdf.png|300px]]
  
 +
[[File:untitled.jpg|300px]]<br />
 
MATLAB tips: rand(10,1000) produces one 10*1000 matrix and sum(rand(10,1000)) produces a 10*1000 matrix
 
MATLAB tips: rand(10,1000) produces one 10*1000 matrix and sum(rand(10,1000)) produces a 10*1000 matrix
 
and each element in the matrix follows CDF of uniform distribution.
 
and each element in the matrix follows CDF of uniform distribution.
Line 3,366: Line 3,073:
  
 
====Case 1====
 
====Case 1====
If <math>x_1, x_2 \cdots, x_d</math>'s are independent, then<br/>
+
if the <math>x_1, x_2 \cdots, x_d</math>'s are independent, then<br/>
 
<math>f(x) = f(x_1,\cdots, x_d) = f(x_1)\cdots f(x_d)</math><br/>
 
<math>f(x) = f(x_1,\cdots, x_d) = f(x_1)\cdots f(x_d)</math><br/>
We can sample from each component <math>x_1, x_2,\cdots, x_d</math> individually, and then form a vector.<br/>
+
we can sample from each component <math>x_1, x_2,\cdots, x_d</math> individually, and then form a vector.<br/>
  
Based on the property of independence, we can derive the pdf or pmf of <math>x=x_1,x_2,x_3,x_4,x_5,\cdots</math>
+
based on the property of independence, we can derive the pdf or pmf of <math>x=x_1,x_2,x_3,x_4,x_5,\cdots</math>
  
 
====Case 2====
 
====Case 2====
Line 3,400: Line 3,107:
 
Algorithm: <br/>
 
Algorithm: <br/>
 
1)  for i = 1 to d <br/>
 
1)  for i = 1 to d <br/>
2)    U<sub>i</sub> ~ U(0,1)  (we want to translate this to U(a,b)) <br/>
+
2)    U<sub>i</sub> ~ U(0,1) <br/>
 
3)    x<sub>i</sub> = a<sub>i</sub> + U(b<sub>i</sub>-a<sub>i</sub>) <br/>
 
3)    x<sub>i</sub> = a<sub>i</sub> + U(b<sub>i</sub>-a<sub>i</sub>) <br/>
 
4)  end <br/>
 
4)  end <br/>
Line 3,428: Line 3,135:
  
 
<pre style='font-size:16px'>
 
<pre style='font-size:16px'>
(open->file->function)
+
function x = urectangle (d,n,a,b)
function x = urectangle (d,n,a,b)   %urectangle: function name; d:dimension; n:number of sample; a:lower bound on graph; b:upper bound on graph
 
 
for ii = 1:d;
 
for ii = 1:d;
 
     u(ii,:) = rand(1,n);
 
     u(ii,:) = rand(1,n);
 
     x(ii,:) = a+ u(ii,:)*(b-a);
 
     x(ii,:) = a+ u(ii,:)*(b-a);
     %keyboard                      %(or writen only as "keyboard")makes the function stop at this step so you can evaluate the variables
+
     %keyboard                      #makes the function stop at this step so you can evaluate the variables
 
end
 
end
  
  
>>x=urectangle(2, 100, 2, 5);      
+
>>x=urectangle(2, 100, 2, 5);
>>scatter(x(1,:),x(2,:))           %another way to scatter points
+
>>scatter(x(1,:),x(2,:))
  
>>x=urectangle(2, 10000, 2, 5);        %generate 10000 numbers (instead of 100)
+
>>x=urectangle(2, 10000, 2, 5);        #generate 10000 numbers (instead of 100)
>>x=urectangle(3, 10000, 2, 5);        %changed to 3-dimensional
+
>>x=urectangle(3, 10000, 2, 5);        #changed to 3-dimensional
 
>>scatter3(x(1,:), x(2,:), x(3,:))
 
>>scatter3(x(1,:), x(2,:), x(3,:))
 
>>axis square
 
>>axis square
Line 3,473: Line 3,179:
  
 
Suppose we sampled from the target area W uniformly, let Aw, Ag indicate the area of W and G, g(x)=1/Aw and f(x)=1/Ag
 
Suppose we sampled from the target area W uniformly, let Aw, Ag indicate the area of W and G, g(x)=1/Aw and f(x)=1/Ag
 +
  
 
This is the picture of the example  
 
This is the picture of the example  
[[File:AA.jpg]]
+
[[File:Untitled.jpg]]
  
 
matlab code:
 
matlab code:
Line 3,501: Line 3,208:
  
 
==Class 10 - Thursday June 6th 2013 ==  
 
==Class 10 - Thursday June 6th 2013 ==  
MATLAB code for using Acceptance-Rejection Method to sample from a d-dimensional unit ball.
+
MATLAB code for using Acceptance/Rejection Method to sample from a d-dimensional unit ball.
  
 
<pre style='font-size:16px'>
 
<pre style='font-size:16px'>
Line 3,515: Line 3,222:
 
function output = Unitball(d,n)  
 
function output = Unitball(d,n)  
  
u = rand(d,n);     #U1~U(0,1),...,Ud~U(0,1)
+
u = rand(d,n);
z = 1 - 2*u;       #x1=1-2U1
+
z = 1- 2 *u;
R = sum(z.^2);     #R=sum of (xi)^2, i=1,...,d
+
R = sum(z.^2);
 
jj=1;
 
jj=1;
  
Line 3,538: Line 3,245:
 
>> scatter(data(1,:), data(2,:))    %plot 2d graph
 
>> scatter(data(1,:), data(2,:))    %plot 2d graph
  
R(ii) determines whether the generated random coordinates fall within the unit ball. In 2-D we have a random x and
+
R(ii) computes the sum of the square of each element of a vector, so if it is less than 1,
y, thus if x^2+y^2 <=1 then it falls within the unit ball and we increase our count by 1.
+
then the vector is in the unit ball.
  
 
x(:,jj) means all the numbers in the jj column.
 
x(:,jj) means all the numbers in the jj column.
Line 3,556: Line 3,263:
 
Execution:  
 
Execution:  
  
>>[x]=Unitball(2,10000);     %the larger the amount like "10000" , the better the value we will get
+
>>[x]=Unitball(2,10000);
 
>>scatter(x(1,:),x(2,:));    %plot 2D circle
 
>>scatter(x(1,:),x(2,:));    %plot 2D circle
 
>>axis square;                %make the x-y axis has same size                   
 
>>axis square;                %make the x-y axis has same size                   
Line 3,563: Line 3,270:
 
ans =
 
ans =
  
           2        7869
+
           2        7839
  
 
>>scatter(x(1,:),x(2,:))
 
>>scatter(x(1,:),x(2,:))
Line 3,573: Line 3,280:
 
<pre style='font-size:16px'>
 
<pre style='font-size:16px'>
  
>>c=7839/10000                %efficiency of the acceptance, Efficiency = points accepted / total points  
+
>>c=7839/10000                %Efficiency = points accepted / total points  
  
 
c =
 
c =
  
 
     0.7839
 
     0.7839
>>4*.7839
 
 
ans =
 
 
          3.1356              %area of the circle
 
>>pi
 
 
ans =      3.1416
 
  
 
</pre>
 
</pre>
Line 3,630: Line 3,329:
  
 
=== Efficiency ===
 
=== Efficiency ===
 
In statistics, efficiency is a term used in the comparison of various statistical procedures and, in particular, it refers to a measure of the optimality of an estimator, of an experimental design or of a hypothesis testing procedure. Essentially, a more efficient estimator, experiment or test needs fewer samples than a less efficient one to achieve a given performance.
 
  
 
In the above example, the efficiency of the vector A/R is equal to the ratio
 
In the above example, the efficiency of the vector A/R is equal to the ratio
Line 3,646: Line 3,343:
  
 
<span style="color:red;padding:0 auto;"><br>The end of midterm coverage</span>
 
<span style="color:red;padding:0 auto;"><br>The end of midterm coverage</span>
<div style="background-color:#CCCCFF;width:100%;height:200px;">
 
<div style="float:left;margin:5px 5px 5px 5px;width:100%;cursor:wait;position:absolute;left:350px">
 
<span style="font-family:cursive, sans-serif;
 
text-shadow:3px 3px 3px #330000;font-size:150%;font-variant:small-caps;font-size-adjust:0.49;font-stretch: expanded;float:left">Good luck on the midterm
 
</span>
 
</div>
 
<div style="position:absolute;margin-top:40px;margin-left:370px;height:160px;">
 
  
[[File:15g6454656.gif]]
 
‎</div>
 
</div>
 
<div style="margin-top:2px">
 
 
==== Stochastic Process ====
 
==== Stochastic Process ====
 
The basic idea of Stochastic Process (also called random process) is a collection of some random variables,  
 
The basic idea of Stochastic Process (also called random process) is a collection of some random variables,  
Line 3,663: Line 3,349:
  
 
'''Definition:''' In probability theory, a stochastic process /stoʊˈkæstɪk/, or sometimes random process (widely used) is a collection of random variables; this is often used to represent the evolution of some random value, or system, over time. This is the probabilistic counterpart to a deterministic process (or deterministic system). Instead of describing a process which can only evolve in one way (as in the case, for example, of solutions of an ordinary differential equation), in a stochastic or random process there is some indeterminacy: even if the initial condition (or starting point) is known, there are several (often infinitely many) directions in which the process may evolve. (from Wikipedia)
 
'''Definition:''' In probability theory, a stochastic process /stoʊˈkæstɪk/, or sometimes random process (widely used) is a collection of random variables; this is often used to represent the evolution of some random value, or system, over time. This is the probabilistic counterpart to a deterministic process (or deterministic system). Instead of describing a process which can only evolve in one way (as in the case, for example, of solutions of an ordinary differential equation), in a stochastic or random process there is some indeterminacy: even if the initial condition (or starting point) is known, there are several (often infinitely many) directions in which the process may evolve. (from Wikipedia)
</div>
+
 
In other words, stochastic process is non-deterministic. This means that there is some indeterminacy in the final state, even if the initial condition is known.
+
A stochastic process is non-deterministic. This means that there is some indeterminacy in the final state, even if the initial condition is known.
  
 
We can illustrate this with an example of speech: if "I" is the first word in a sentence, the set of words that could follow would be limited (eg. like, want, am), and the same happens for the third word and so on. The words then have some probabilities among them such that each of them is a random variable, and the sentence would be a collection of random variables. <br>
 
We can illustrate this with an example of speech: if "I" is the first word in a sentence, the set of words that could follow would be limited (eg. like, want, am), and the same happens for the third word and so on. The words then have some probabilities among them such that each of them is a random variable, and the sentence would be a collection of random variables. <br>
Also, different stochastic process has different properties.
+
Also, Different Stochastic Process has different properties.
  
In the course, we study two stochastic process models.
+
In the course, we study two Stochastic Process models.
  
The two stochastic process models we will study are:
+
The two stochastic Process models we will study are:
  
1. Poisson Process - This is a continuous time counting process that satisfies a couple of properties that are listed in the next section. The Poisson process is understood to be a good model for independent events such as incoming phone calls, number of traffic accidents, and goals during a game of hockey or soccer. It is also an example of a birth-death process.<br>
+
1. Poisson Process-This is continuous time counting process that satisfies a couple of properties that are listed in the next section. The Poisson process is understood to be a good model for events such as incoming phone calls, number of traffic accidents, and goals during a game of hockey or soccer. It is also an example of a birth-death process.<br>
2. Markov Process - This is a stochastic process that satisfies the Markov property which can be understood as the memory-less property. The property states that the jump to a future state only depends on the current state of the process, and not of the process's history. This model is used to model random walks exhibited by particles, the health state of a life insurance policyholder, decision making by a memory-less mouse in a maze, etc. <br>
+
2. Markov Process- This is a stochastic process that satisfies the Markov property which can be understood as the memory-less property. The property states that the jump to a future state only depends on the current state of the process, and not of the process's history. This model is used to model random walks exhibited by particles, the health state of a life insurance policyholder, decision making by a memory-less mouse in a maze, etc. <br>
 +
  
Stochastic process always has a state space and the index is set to limit the range. For instances, in a stock market, the set of all non-negative numbers is the state space, while <math>x_t</math> are individual stock prices.
+
Stochastic Process means even we get some conditions at the beginning, we just can guess some variables followed the first, but at the end the variable would be unpredictable.
  
We can easily simulate a stochastic process by simulating a sequence of random variables. For instance, to simulate the first t time units of a renewal process having inter arrival distribution F we can simulate indepenf=dent random variables X1,X2,... having distribution F, stopping at
+
=====Example=====
            N=min{n:X1+...+Xn>t}
+
The state space is the set of English words, and <math>x_t</math> are words that are said. Another example involves the stock market: the set of all non-negative numbers is the state space, and <math>x_t</math> are stock prices.
The Xi, i>=1, represent the inter arrival  times of the renewal precess and so the preceding simulation yields N-1 events by time t -- the events occurring at times X1,X1+X2,....,X1+...+Xn-1.
+
 
(From textbook, Introdution to probability, 10th Edition, Sheldon M.Ross)
+
stochastic process always has state space and the index set to limit the range.
 +
 
 +
The state space is the set of cars , while <math>x_t</math> are sport cars.
  
 
==== Poisson Process ====
 
==== Poisson Process ====
Line 3,702: Line 3,391:
 
E[N<sub>t</sub>] = <math>\lambda t</math> and Var[N<sub>t</sub>] = <math>\lambda t</math>
 
E[N<sub>t</sub>] = <math>\lambda t</math> and Var[N<sub>t</sub>] = <math>\lambda t</math>
  
==== Multivariate Normal Example ====
+
==== ====
 
<br />
 
<br />
 
'''How to generate a multivariate normal with build in function "randn": (example)'''<br />
 
'''How to generate a multivariate normal with build in function "randn": (example)'''<br />
Line 3,729: Line 3,418:
 
X = Z*R + ones(n,1)*mu';
 
X = Z*R + ones(n,1)*mu';
 
</pre>
 
</pre>
 +
  
 
==== '''Central Limit Theorem''' ====
 
==== '''Central Limit Theorem''' ====
Line 3,741: Line 3,431:
 
>> X = exprnd (20,20,1000); % 1000 instances of 20 exponential random numbers with mean 20
 
>> X = exprnd (20,20,1000); % 1000 instances of 20 exponential random numbers with mean 20
 
>> hist(X(1,:))
 
>> hist(X(1,:))
>> hist(sum(X(1:2,:)))      %first 2 columns
+
>> hist(X(1:2,:))
 
...
 
...
>>hist(sum(X(1:20,:))) -> approaches to normal
+
>>hist(X(1:20,:)) -> approaches to normal
 
 
>>u=exprnd(200,1000)
 
>>hist(sum(u(1:200,:)))    %the graph is exponential
 
>>hist(sum(u(1,:)))
 
>>hist(u(1,:))            %each u is uniform
 
 
</pre>
 
</pre>
  
'''Theorem: Central Limit Theorem'''
+
==Class 11 - Tuesday,June 11, 2013==
Let <math>X_1, ..., X_n</math> be iid random variables such that <math>E(X_i)=\mu</math> and <math> Var(X_i)=\sigma^2</math>, and <math> \bar{X} = n^{-1} \left ( \sum_{i=1}^n x_i \right ) </math>. <br> Then <math> \ \frac{\sqrt{n}(\bar{X} - \mu)}{\sigma} \xrightarrow{d}\ N(0,1)</math>
+
=== Announcement ===
 +
Midterm covers up to the middle of last lecture, which means stochastic process will not be on midterm. There won't be any Matlab syntax questions. And Students can contribute to any previous classes. We might however be asked to write down algorithms.
  
==Class 11 - Tuesday,June 11, 2013==
 
 
===Poisson Process===
 
===Poisson Process===
 
A discrete stochastic variable ''X'' is said to have a Poisson distribution with parameter ''λ'' > 0 if
 
A discrete stochastic variable ''X'' is said to have a Poisson distribution with parameter ''λ'' > 0 if
 
:<math>\!f(n)= \frac{\lambda^n e^{-\lambda}}{n!}  \qquad n= 0,1,\ldots,</math>.
 
:<math>\!f(n)= \frac{\lambda^n e^{-\lambda}}{n!}  \qquad n= 0,1,\ldots,</math>.
  
'''Definition'''<br>
 
The number of arrivals N(t) in a time interval of length t follows Poisson distribution with mean <math>\lambda*t</math>,i.e.<br>
 
<math>P(N(t)=n) = \frac{e^{-\lambda t} (\lambda t)^n}{n!}</math>
 
 
In probability theory, a Poisson process is a stochastic process which counts the number of events[note 1] and the time that these events occur in a given time interval. The time between each pair of consecutive events has an exponential distribution with parameter λ and each of these inter-arrival times is assumed to be independent of other inter-arrival times. The process is named after the French mathematician Siméon-Denis Poisson and is a good model of radioactive decay,[1] telephone calls[2] and requests for a particular document on a web server,[3] among many other phenomena.
 
By wikipedia
 
  
 
'''Properties of Homogeneous Poisson Process'''<br>
 
'''Properties of Homogeneous Poisson Process'''<br>
 
(a) '''Independence:''' The numbers of arrivals in non-overlapping intervals are independent  <br>
 
(a) '''Independence:''' The numbers of arrivals in non-overlapping intervals are independent  <br>
 
(b) '''Homogeneity or Uniformity:''' The number of arrival in each interval I(a,b] is Poisson distribution with rate <math>\lambda (b-a)</math><br/>
 
(b) '''Homogeneity or Uniformity:''' The number of arrival in each interval I(a,b] is Poisson distribution with rate <math>\lambda (b-a)</math><br/>
(c) '''Individuality:'''  for a sufficiently short time period of length h, the probability of 2 or more events occurring in the interval is close to 0
+
(c) '''Individuality:'''  for a sufficiently short time period of length h, the probability of 2 or more events occurring in the interval is close to 0, or formally <math>\mathcal{O}(h)</math><br>
  
  
 
'''Notation'''<br>
 
'''Notation'''<br>
 
N<sub>t</sub> denotes the number of arrivals up to t, i.e.(0,t] <br>
 
N<sub>t</sub> denotes the number of arrivals up to t, i.e.(0,t] <br>
N(a,b] = N<sub>b</sub> - N<sub>a</sub> denotes the number of arrivals in I(a, b]. <br> where I denotes the an interval.
+
N(a,b] = N<sub>b</sub> - N<sub>a</sub> denotes the number of arrivals in I(a, b]. <br>
  
  
Line 3,785: Line 3,464:
 
Similarly, the probability of not observing an arrival in this interval is 1-<math>\lambda </math> h.<br>
 
Similarly, the probability of not observing an arrival in this interval is 1-<math>\lambda </math> h.<br>
  
*Note : Recall that exponential random variable is the waiting time  until one event of interested occurs.<br>
 
In other words, the inter-arrival times are independent and follows Exponential distribution with mean λ.
 
  
 +
'''Generate a Poisson Process'''<br />
 +
 +
<math>U_n \sim U(0,1)</math><br>
 +
<math>T_n-T_{n-1}=-\frac {1}{\lambda} log(U_n)</math><br>
 +
 +
1. set T<sub>0</sub>=0 and n=1<br />
 +
 +
2. U<sub>n</sub>~ U(0,1)<br />
 +
 +
3. T<sub>n</sub> = T<sub>n-1</sub> <math> -\frac {1}{\lambda} </math>  log (U<sub>n</sub>) (declare an arrival)<br />
 +
 +
4. if T<sub>n</sub>>T stop<br />
 +
&nbsp;&nbsp;&nbsp;&nbsp;else<br />
 +
&nbsp;&nbsp;&nbsp;&nbsp;n=n+1 go to step 2<br />
 +
 +
Since <math>P(N(t,t+h)=1) = e^{-{\lambda} h}\lambda h</math>, we can regard <math>\lambda </math>h as a exponential distribution, and according to what we learnt, <math>T_n-T_{n-1} = h = -\frac {1}{\lambda} log(U_n)</math>.<br>
 +
*Note : Recall that exponential random variable is the waiting time  until one event of interested occurs.
  
 
'''Review of Poisson - Example'''
 
'''Review of Poisson - Example'''
Line 3,804: Line 3,498:
  
 
when we use the inverse-transfer method, we can assume the poisson process to be exp distribution, and get the h function from the  inverse method, and sometimes we assume h is very small.
 
when we use the inverse-transfer method, we can assume the poisson process to be exp distribution, and get the h function from the  inverse method, and sometimes we assume h is very small.
 +
 +
'''Multi-dimensional Poisson Process'''<br>
 +
 +
The poisson distribution arises as the distribution of counts of occurrences of events in (multidimensional) intervals in multidimensional poisson process in a directly equivalent way to the result for unidimensional processes. This,is ''D'' is any region the multidimensional space for which |D|, the area or volume of the region, is finite, and if {{nowrap|''N''(''D'')}} is count of the number of events in ''D'', then
 +
 +
<math> P(N(D)=k)=\frac{(\lambda|D|)^k e^{-\lambda|D|}}{k!} .</math>
  
 
=== Generating a Homogeneous Poisson Process ===
 
=== Generating a Homogeneous Poisson Process ===
Line 3,819: Line 3,519:
 
1) Set T<sub>0</sub> = 0 ,and n = 1 <br>
 
1) Set T<sub>0</sub> = 0 ,and n = 1 <br>
 
2) U<sub>n</sub> follow U(0,1) <br>
 
2) U<sub>n</sub> follow U(0,1) <br>
3) T<sub>n</sub> - T<sub>n-1</sub> =<math> -\frac {1}{\lambda} </math>  log (U<sub>n</sub>)    (Declare an arrival)(repeat this process many times until T<sub>n</sub> >T)<br>
+
3) T<sub>n</sub> - T<sub>n-1</sub> =<math> -\frac {1}{\lambda} </math>  log (U<sub>n</sub>)    (Declare an arrival)<br>
 
4) if T<sub>n</sub> >T stop;
 
4) if T<sub>n</sub> >T stop;
else n = n + 1, go to step 2 (go back to step 2 and generate another random)<br>
+
else n = n + 1, go to step 2 <br>
  
 
h is the a range and we assume the probability of every point in this rang is the same by uniform ditribution.(cause h is small)
 
h is the a range and we assume the probability of every point in this rang is the same by uniform ditribution.(cause h is small)
Line 3,827: Line 3,527:
  
 
<b>Higher Dimensions:</b><br>
 
<b>Higher Dimensions:</b><br>
 
The poisson distribution arises as the distribution of counts of occurrences of events in (multidimensional) intervals in multidimensional poisson process in a directly equivalent way to the result for unidimensional processes. This,is ''D'' is any region the multidimensional space for which |D|, the area or volume of the region, is finite, and if {{nowrap|''N''(''D'')}} is count of the number of events in ''D'', then
 
 
<math> P(N(D)=k)=\frac{(\lambda|D|)^k e^{-\lambda|D|}}{k!} .</math>
 
 
 
To sample from higher dimensional Poisson process:<br>
 
To sample from higher dimensional Poisson process:<br>
 
1. Generate a random number N that is Poisson distributed with parameter <math>{\lambda}</math>*A<sub>d</sub>, where A<sub>d</sub> is the area under the bounded region. (ie A<sub>2</sub> is area of the region, A<sub>3</sub> is the volume of the 3-d space.<br>
 
1. Generate a random number N that is Poisson distributed with parameter <math>{\lambda}</math>*A<sub>d</sub>, where A<sub>d</sub> is the area under the bounded region. (ie A<sub>2</sub> is area of the region, A<sub>3</sub> is the volume of the 3-d space.<br>
Line 3,846: Line 3,541:
 
TT=5;
 
TT=5;
  
while T(ii)<=TT       % we want Tn>T
+
while T(ii)<=TT
 
   u=rand;
 
   u=rand;
 
   ii=ii+1;
 
   ii=ii+1;
Line 3,852: Line 3,547:
 
end
 
end
  
plot(T)
+
plot(T, '.')
  
 
</pre>
 
</pre>
Line 3,896: Line 3,591:
 
<math> x_1 \rightarrow x_2\rightarrow...\rightarrow x_n</math>  
 
<math> x_1 \rightarrow x_2\rightarrow...\rightarrow x_n</math>  
  
===Formal Definition:===
+
Formal Definition:
 
The process <math> \{x_n: n \in T\} </math> is a markov chain if:<br />
 
The process <math> \{x_n: n \in T\} </math> is a markov chain if:<br />
 
<math> Pr(x_n|x_{n-1},...,x_1) = Pr(x_n|x_{n-1}) \ \ \forall n\in T </math> and <math> \forall x\in X</math>
 
<math> Pr(x_n|x_{n-1},...,x_1) = Pr(x_n|x_{n-1}) \ \ \forall n\in T </math> and <math> \forall x\in X</math>
 
The possible values of Xi form a countable set S called the state space of the chain.
 
Markov chains are often described by a directed graph, where the edges are labeled by the probabilities of going from one state to the other states.
 
  
 
<span style="background:#F5F5DC">CONTINUOUS TIME MARKOV PROCESS</span>
 
<span style="background:#F5F5DC">CONTINUOUS TIME MARKOV PROCESS</span>
Line 3,910: Line 3,602:
  
 
====Transition Matrix====
 
====Transition Matrix====
Definition: A Markov transition matrix is a square matrix describing the probabilities of moving from one state to another in a dynamic system. In each row are the probabilities of moving from the state represented by that row, to the other states. Thus the rows of a Markov transition matrix each add to one. <br />
 
A Transition Matrix is used to describe the transitions of a Markov chain. Each of its entries is a non-negative real number representing a probability.
 
  
 
Transition Probability: <math> P_{ij} = P(X_{t+1} =j | X_t =i) </math> is the one-step transition probability from state i to state j.
 
Transition Probability: <math> P_{ij} = P(X_{t+1} =j | X_t =i) </math> is the one-step transition probability from state i to state j.
Line 3,950: Line 3,640:
 
This means our model can be simulated as a sequence of random variables <math> (X_0, X_1, X_2, \ldots ) </math> with state space <math> \Omega </math> and transition matrix <math> P = [P_{ij}] </math> where <math> \forall t \in \N, 0 \leq s \leq t+1, x_s \in \Omega, </math> <br/>
 
This means our model can be simulated as a sequence of random variables <math> (X_0, X_1, X_2, \ldots ) </math> with state space <math> \Omega </math> and transition matrix <math> P = [P_{ij}] </math> where <math> \forall t \in \N, 0 \leq s \leq t+1, x_s \in \Omega, </math> <br/>
  
we have the following property (Markov property): <br/>
+
we have to following property (Markov property): <br/>
 
<math> P(X_{t+1}= x_{t+1} \vert \cap^{t}_{s=0} X_s = x_s) = P(X_{t+1} =x_{t+1} \vert X_t =x_t) = P(x_t,x_{t+1}) </math> <br>
 
<math> P(X_{t+1}= x_{t+1} \vert \cap^{t}_{s=0} X_s = x_s) = P(X_{t+1} =x_{t+1} \vert X_t =x_t) = P(x_t,x_{t+1}) </math> <br>
  
Line 3,959: Line 3,649:
  
 
Then one might consider the periodicity of the chain and derive a notion of cyclic behavior. <br>
 
Then one might consider the periodicity of the chain and derive a notion of cyclic behavior. <br>
<br>
 
 
Example of Double-Stochastic Matrix:<br>
 
 
Consider the following probability transition matrix.<br>
 
<math> P= \left [ \begin{matrix}
 
0 & p & q \\
 
q & 0 & p \\
 
p & q & 0
 
\end{matrix} \right] </math>
 
Where q=1-p<br>
 
Each row sums to 1 and each column also sums to 1. Thus, this kind of probability transition matrix is called a Double-Stochastic Matrix.<br>
 
For all stochastic matrices, each row always sums to 1, but columns do not.
 
  
 
=== Examples of Transition Matrix ===
 
=== Examples of Transition Matrix ===
<div style="border:1px red solid">
 
'''Example 0'''
 
form wikipedia
 
A [[state diagram]] for a simple example is shown in the figure on the right, using a directed graph to picture the state transitions.  The states represent whether a hypothetical stock market is exhibiting a [[Market trend#Bull market|bull market]], [[Market trend#Bear market|bear market]], or stagnant market trend during a given week.  According to the figure, a bull week is followed by another bull week 90% of the time, a bear week 7.5% of the time, and a stagnant week the other 2.5% of the time. Labelling the state space {1&nbsp;=&nbsp;bull, 2&nbsp;=&nbsp;bear, 3&nbsp;=&nbsp;stagnant} the [[transition matrix]] for this example is
 
:<math>P = \begin{bmatrix}
 
0.9 & 0.075 & 0.025 \\
 
0.15 & 0.8 & 0.05 \\
 
0.25 & 0.25 & 0.5
 
\end{bmatrix}.</math>
 
The distribution over states can be written as a [[stochastic row vector]] ''x'' with the relation ''x''<sup>(''n''&nbsp;+&nbsp;1)</sup>&nbsp;=&nbsp;''x''<sup>(''n'')</sup>''P''. So if at time ''n'' the system is in state 2&nbsp;(bear), then three time periods later, at time ''n''&nbsp;+&nbsp;3 the distribution is
 
:<math>\begin{align}
 
x^{(n+3)} &= x^{(n+2)} P = \left(x^{(n+1)} P\right) P \\\\
 
  &= x^{(n+1)} P^2 = \left( x^{(n)} P^2 \right) P\\
 
  &= x^{(n)} P^3 \\
 
  &= \begin{bmatrix} 0 & 1 & 0 \end{bmatrix} \begin{bmatrix}
 
0.9 & 0.075 & 0.025 \\
 
0.15 & 0.8 & 0.05 \\
 
0.25 & 0.25 & 0.5
 
\end{bmatrix}^3 \\
 
  &= \begin{bmatrix} 0 & 1 & 0 \end{bmatrix} \begin{bmatrix}
 
0.7745 & 0.17875 & 0.04675 \\
 
0.3575 & 0.56825 & 0.07425 \\
 
0.4675 & 0.37125 & 0.16125 \\
 
\end{bmatrix} \\
 
& = \begin{bmatrix} 0.3575 & 0.56825 & 0.07425 \end{bmatrix}.
 
\end{align}</math>
 
Using the transition matrix it is possible to calculate, for example, the long-term fraction of weeks during which the market is stagnant, or the average number of weeks it will take to go from a stagnant to a bull market. Using the transition probabilities, the steady-state probabilities indicate that 62.5% of weeks will be in a bull market, 31.25% of weeks will be in a bear market and 6.25% of weeks will be stagnant, since:
 
 
<math>\lim_{N\to \infty } \, P^N=
 
\begin{bmatrix}
 
0.625 & 0.3125 & 0.0625 \\
 
0.625 & 0.3125 & 0.0625 \\
 
0.625 & 0.3125 & 0.0625 \\
 
\end{bmatrix}</math>
 
 
A thorough development and many examples can be found in the on-line monograph
 
Meyn & Tweedie 2005.<ref name=MCSS>S. P. Meyn and R.L. Tweedie, 2005.  [https://netfiles.uiuc.edu/meyn/www/spm_files/book.html Markov Chains and Stochastic Stability].
 
Second edition to appear, Cambridge University Press, 2008.</ref>
 
 
The appendix of Meyn 2007,<ref name=CTCN>S. P. Meyn, 2007.  [http://decision.csl.uiuc.edu/~meyn/pages/CTCN/CTCN.html Control Techniques for Complex Networks], Cambridge University Press, 2007.</ref> also available on-line, contains an abridged Meyn & Tweedie.
 
 
A [[finite state machine]] can be used as a representation of a Markov chain. Assuming a sequence of [[Independent and identically distributed random variables|independent and identically distributed]] input signals (for example, symbols from a binary alphabet chosen by coin tosses), if the machine is in state ''y'' at time ''n'', then the probability that it moves to state ''x'' at time ''n''&nbsp;+&nbsp;1 depends only on the current state.
 
<br>
 
'''Example 1'''
 
  
 
[[File:Mark13.png]]
 
[[File:Mark13.png]]
Line 4,042: Line 3,675:
 
<math> \begin{align} P(X_{1} &=0 &\mid X_{0} &=0) * P(X_{2} &=1 &\mid X_{1} &=0)+P(X_{1} &=1 &\mid X_{0} &=0) * P(X_{2} &=1 &\mid X_{1}&=1) &=1/3*2/3+ 2/3*1/4 &=7/18 \\
 
<math> \begin{align} P(X_{1} &=0 &\mid X_{0} &=0) * P(X_{2} &=1 &\mid X_{1} &=0)+P(X_{1} &=1 &\mid X_{0} &=0) * P(X_{2} &=1 &\mid X_{1}&=1) &=1/3*2/3+ 2/3*1/4 &=7/18 \\
 
\end{align}</math><br />
 
\end{align}</math><br />
</div>
 
 
'''Example 2'''
 
[[File:Tranmatrix.png]] <br>
 
The transition matrix in this case would be: <br>
 
<math> P=\left [ \begin{matrix} 0.9 & 0.1 & 0 \\ 0 & 1 & 0 \\ 0.3 & 0.7 & 0 \\ \end{matrix}\right] </math>. Notice the "1" entry at <math> P_{2,2} </math>, even though the image doesn't show that. This is because there is no way of getting out of state 2, hence the probability of staying in state 2 is 1 (cant get out).
 
  
 +
== Class 12 - Thursday,June 13, 2013 ==
 +
<b>Time</b>
 +
Jun 17, 2013 2:30 PM - 3:30 PM
 
===Midterm Review===
 
===Midterm Review===
  
Line 4,085: Line 3,715:
 
===Acceptance-Rejection Method===
 
===Acceptance-Rejection Method===
 
cg(x)>=f(x)
 
cg(x)>=f(x)
<math>c=max\left[\frac{f(x)}{g(x)}\right]</math>
+
<math>c=max\frac{f(x)}{g(x)}</math>
 
<br><math>\frac{1}{c}</math> is the efficiency of the method/probability of acceptance
 
<br><math>\frac{1}{c}</math> is the efficiency of the method/probability of acceptance
  
Line 4,151: Line 3,781:
 
Gamma(t,λ) <br>
 
Gamma(t,λ) <br>
 
t: The number of exponentials and the shape parameter<br>
 
t: The number of exponentials and the shape parameter<br>
1/λ: The mean of the exponentials and the scale parameter<br>  
+
λ: The mean of the exponentials and the scale parameter<br>  
  
 
Also, Gamma(t,λ) can be expressed into a summation of t exp(λ).<br>
 
Also, Gamma(t,λ) can be expressed into a summation of t exp(λ).<br>
Line 4,178: Line 3,808:
 
<br\>
 
<br\>
 
X~ Bin(n,p)<br/>
 
X~ Bin(n,p)<br/>
1. U1, U2, ... Un ~ U(0,1) , independent n uniform distribution<br/>
+
1. U1, U2, ... Un ~ U(0,1)<br/>
 
2. <math> X= \sum^{n}_{1} I(U_i \leq p) </math> ,where <math>I(U_i \leq p)</math> is an indicator for a successful trial.<br/>
 
2. <math> X= \sum^{n}_{1} I(U_i \leq p) </math> ,where <math>I(U_i \leq p)</math> is an indicator for a successful trial.<br/>
 
Return to 1<br/>
 
Return to 1<br/>
Line 4,206: Line 3,836:
  
 
:<math>\displaystyle \text{Beta}(1,1) = U (0, 1) </math><br>
 
:<math>\displaystyle \text{Beta}(1,1) = U (0, 1) </math><br>
generate uniform directly
 
  
  
 
:<math>\displaystyle \text{Beta}(\alpha,1)={f}(x) = \frac{\Gamma(\alpha+1)}{\Gamma(\alpha)\Gamma(1)}x^{\alpha-1}(1-x)^{1-1}=\alpha x^{\alpha-1}</math><br>
 
:<math>\displaystyle \text{Beta}(\alpha,1)={f}(x) = \frac{\Gamma(\alpha+1)}{\Gamma(\alpha)\Gamma(1)}x^{\alpha-1}(1-x)^{1-1}=\alpha x^{\alpha-1}</math><br>
use inverse method to generate
+
 
 +
 
 +
'''Gamma Distribution'''
  
 
'''Algorithm'''<br\>
 
'''Algorithm'''<br\>
Line 4,220: Line 3,851:
  
 
This distribution models the number of failures before the first success.
 
This distribution models the number of failures before the first success.
<span><p style="color:#ECF1EF">this requires attention</p></span>
+
 
 
X~Geo(p)
 
X~Geo(p)
  
Line 4,261: Line 3,892:
 
===Poisson===
 
===Poisson===
  
 +
This distribution models the number of times and event occurs in a given time period
  
In probability theory and statistics, the Poisson distribution (pronounced [pwasɔ̃]), named after French mathematician Siméon Denis Poisson, is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.[1] The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.
 
 
The number of event happened today and the number of event happened tomorrow ,etc and then make it a sequence. This is called Poisson process.
 
 
This distribution models the number of times and event occurs in a given time period
 
<span style="text-shadow: 0px 2px 3px hsl(310,15%,65%);margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">
 
 
X~Poi<math>(\lambda)</math> <br>
 
X~Poi<math>(\lambda)</math> <br>
 
X is the maximum number of iid Exp(<math>\lambda</math>) whose sum is less than or equal to 1.<br>
 
X is the maximum number of iid Exp(<math>\lambda</math>) whose sum is less than or equal to 1.<br>
Line 4,273: Line 3,899:
 
<math>  = \max\{n: \sum\limits_{i=1}^n \frac{-1}{\lambda} log(U_i)<=1 , U_i \sim U[0,1]\}</math><br>
 
<math>  = \max\{n: \sum\limits_{i=1}^n \frac{-1}{\lambda} log(U_i)<=1 , U_i \sim U[0,1]\}</math><br>
 
<math>  = \max\{n: \prod\limits_{i=1}^n U_i >= e^{-\lambda}, U_i \sim U[0,1]\}</math><br>
 
<math>  = \max\{n: \prod\limits_{i=1}^n U_i >= e^{-\lambda}, U_i \sim U[0,1]\}</math><br>
note: <br>
 
if <math>u1 \geq e^{-\lambda}</math>, count 1, n=1;<br>
 
if <math>u1*u2 \geq e^{-\lambda}</math>, count 1, n=2;<br>
 
if <math>u1*u2*u3 \geq e^{-\lambda}</math>, n=2.<br>
 
  
</span>
 
 
'''Algorithm'''<br\>
 
'''Algorithm'''<br\>
 
*1. Set n=1, a=1<br\>
 
*1. Set n=1, a=1<br\>
Line 4,286: Line 3,907:
 
An alternate way to write an algorithm for Poisson is as followings:
 
An alternate way to write an algorithm for Poisson is as followings:
  
1)  x = 0, F = <math>P(X=0) = e^{-\lambda} = p</math>
+
1)  x = 0, F = P(X=0) = e^-λ = p
  
  
Line 4,295: Line 3,916:
  
  
4)  Else <math>p = \frac{p*\lambda}{x+1} </math>
+
4)  Else p = (λ/(x+1)) * p
  
 
     F = F + p
 
     F = F + p
Line 4,306: Line 3,927:
  
 
== Class 13 - Tuesday June 18th 2013 ==
 
== Class 13 - Tuesday June 18th 2013 ==
 +
An n-step transition matrix is a matrix <math> P_n </math> whose elements are the probability of moving to state j from state i in n steps. <br/>
 +
<math>P_n (i,j)=Pr⁡(X_{m+n}=j|X_m=i)</math> <br/>
  
=== N-step Transition Matrix ===
+
One-step transition probability:<br/>
 +
The probability of  X<sub>n+1</sub> being in state j given that X<sub>n</sub> is in state i is called the
 +
one-step transition probability  and is denoted by P<sub>i,j</sub><sup>n,n+1</sup>. That is <br/>
 +
P<sub>i,j</sub><sup>n,n+1</sup> = Pr(X<sub>n+1</sub> =j/X<sub>n</sub> =i)
  
<b>Definition</b>: The <b>n-step transition matrix</b> is the matrix <math>P_n</math> whose elements are the probability of moving from state <math>i</math> to state <math>j</math> in <math>n</math> steps. <br><math>P_n (i,j)=Pr⁡(X_{m+n}=j|X_m=i)</math>  is the n-step transition probability.<br/>
+
Example from previous class: <br/>
 
 
 
 
One-step transition probability:<br/>
 
The probability of  <math>X_{n+1}</math> being in state j given that X<sub>n</sub> is in state, i is called the
 
one-step transition probability  and is denoted by <math>P_{i,j}(n,n+1)</math>. That is <br/>
 
<math>P_{i,j}(n,n+1) = Pr⁡(X_{n+1}=j|X_n=i)</math><br\>
 
 
 
 
 
Two-step transition probability:<br/>
 
The probability of moving from state a to state a: <br/>
 
<math>P_2 (a,a)=Pr⁡ (X_{m+2}=a| X_m=a)=Pr⁡(X_{m+1}=a| X_m=a)Pr⁡(X_{m+2}=a|X_{m+1}=a)+ Pr⁡(X_{m+1}=b|X_m=a)Pr⁡(X_{m+2}=a|X_{m+1}=b)</math> <br/>
 
 
 
 
 
In general <math>P_n = P^n</math> with <math>P_n(i,j) \geq 0</math> and <math>\sum_{j} P_n(i,j) = 1</math><br/>     
 
<div style="border:5px pink solid">
 
'''Example from previous class:''' <br/>
 
  
 
<math> P= \left [ \begin{matrix}
 
<math> P= \left [ \begin{matrix}
 
 
0.7 & 0.3 \\
 
0.7 & 0.3 \\
 
0.2 & 0.8
 
0.2 & 0.8
 
 
\end{matrix} \right] </math>
 
\end{matrix} \right] </math>
  
Line 4,352: Line 3,960:
 
<math>P_2 = P_1 P_1 </math><br\>
 
<math>P_2 = P_1 P_1 </math><br\>
  
Therefore,
 
 
<math>P_n = P_1^n </math><br\>
 
<math>P_n = P_1^n </math><br\>
  
The equation above is a special case of the Chapman-Kolmogorov Equations.<br />
 
  
It is true because of the Markov property or the memoryless property of Markov chains, where the probabilities of going forward to the next state <br />
+
The two-step transition probability of moving from state a to state a:
only depends on your current state, not your previous states. By intuition, we can multiply the 1-step transition <br />
+
<br/>
matrix n-times to get a n-step transition matrix.<br />
+
<math>P_2 (a,a)=Pr⁡ (X_{m+2}=a| X_m=a)=Pr⁡(X_{m+1}=a| X_m=a)Pr⁡(X_{m+2}=a|X_{m+1}=a)+ Pr⁡(X_{m+1}=b|X_m=a)Pr⁡(X_{m+2}=a|X_{m+1}=b)</math> <br/>
  
<br>
+
<math> =0.7(0.7)+0.3(0.2)=0.55 </math><br/>
'''Example from previous class with change:''' <br/>
 
  
<math> P= \left [ \begin{matrix}
+
=== n-step transition matrix ===
 +
The elements of matrix P<sub>n</sub> (i.e. the ij<sub>th</sub> entry P<sub>ij</sub>) is the probability of moving to state j from state i in n steps
  
0.3 & 0.7\\
+
In general <math>P_n = P^n</math> with <math>P_n(i,j) \geq 0</math> and <math>\sum_{j} P_n(i,j) = 1</math><br />
0.8 & 0.2
+
Note: <math>P_2 = P_1\times P_1; P_n = P^n</math><br />
 
+
The equation above is a special case of the Chapman-Kolmogorov equations.<br />
\end{matrix} \right] </math>
+
It is true because of the Markov property or<br />
 
+
the memoryless property of Markov chains, where the probabilities of going forward to the next state <br />
The two step transition probability matrix is:
+
only depends on your current state, not your previous states. By intuition, we can multiply the 1-step transition <br />
 
+
matrix n-times to get a n-step transition matrix.<br />
<math> P P= \left [ \begin{matrix}
+
0.3 & 0.7 \\
+
Example: We can see how <math>P_n = P^n</math> from the following:
0.8 & 0.2
+
<br/>
\end{matrix} \right] \left [ \begin{matrix}
 
0.3 & 0.7 \\
 
0.8 & 0.2
 
\end{matrix} \right] </math>=<math>\left [ \begin{matrix}
 
0.7(0.7)+0.3(0.2) & 0.7(0.3)+0.3(0.8)              \\
 
0.2(0.3)+0.8(0.8) & 0.2(0.7)+0.8(0.2)
 
\end{matrix} \right] </math>=<math>\left [ \begin{matrix}
 
0.55 &  0.45                  \\
 
0.3  & 0.7
 
\end{matrix} \right] </math><br\>
 
 
 
<math>P_2 = P_1 P_1 </math><br\>
 
 
 
Therefore,
 
<math>P_n = P_1^n </math><br\>
 
 
 
 
 
'''Example:''' <br>
 
We can see how <math>P_n = P^n</math> from the following:<br/>
 
 
<math>\mu_1=\mu_0\cdot P</math> <br/>
 
<math>\mu_1=\mu_0\cdot P</math> <br/>
 
<math>\mu_2=\mu_1\cdot P</math> <br/>
 
<math>\mu_2=\mu_1\cdot P</math> <br/>
 
<math>\mu_3=\mu_2\cdot P</math> <br/>
 
<math>\mu_3=\mu_2\cdot P</math> <br/>
 
 
 
Therefore,  
 
Therefore,  
 
<br/>
 
<br/>
Line 4,405: Line 3,990:
 
</math> <br/>
 
</math> <br/>
  
<math>P_n(i,j)</math> is called N-steps Transition Probability. <br>
+
<math>P_n(i,j)</math> is called n-steps transition probability. <br>
<math>\mu_0 </math> is called the '''Initial Distribution'''. <br>
+
<math>\mu_0 </math> is called the '''initial distribution'''. <br>
 
<math>\mu_n = \mu_0* P^n </math> <br />
 
<math>\mu_n = \mu_0* P^n </math> <br />
<br>
 
  
'''Example 1:''' <br>
+
Example with Markov Chain:
 
Consider a two-state Markov chain {<math>X_t; t = 0, 1, 2,...</math>} with states {1,2} and transition probability matrix
 
Consider a two-state Markov chain {<math>X_t; t = 0, 1, 2,...</math>} with states {1,2} and transition probability matrix
  
Line 4,424: Line 4,008:
 
b)<math> P(X_2=1, X_1=1 |X_0=1) = P(X_2=1|X_1=1)*P(X_1=1|X_0=1)= 1/2 * 1/2 = 1/4 </math>
 
b)<math> P(X_2=1, X_1=1 |X_0=1) = P(X_2=1|X_1=1)*P(X_1=1|X_0=1)= 1/2 * 1/2 = 1/4 </math>
  
c)<math> P(X_2=1|X_0=1)= P_2(1,1) = 5/12 </math> (Start from state 1 and then get back to it in 3 steps) (1 ->1->1 = (1/2)*(1/2) or 1->2->1 = (1/2)*(1/3)) 
+
c)<math> P(X_2=1|X_0=1)= P_2(1,1) = 5/12 </math>
  
 
d)<math> P^2=P*P= \left [ \begin{matrix}
 
d)<math> P^2=P*P= \left [ \begin{matrix}
Line 4,431: Line 4,015:
 
\end{matrix} \right] </math>
 
\end{matrix} \right] </math>
  
'''Example 2:''' <br>
+
=== Marginal Distribution of Markov Chain ===
Consider a 3-state Markov chain {<math>X_t; t = 0, 1, 2,...</math> with states {1,2,3} and transition probability matrix
+
We represent the probability of all states at time t with a vector <math>\underline{\mu_t}</math><br/>
 
+
<math>\underline{\mu_t}~=(\mu_t(1), \mu_t(2),...\mu_t(n))</math> where <math>\mu_t(1)</math> is the probability of being on state 1 at time t.<br/>
<math> P= \left [\begin{matrix}
+
and in general, <math>\mu_t(i)</math> shows  the probability of being on state i at time t.<br/>
1/4 & 1/2 & 1/4 \\
 
1/3 & 1/3 & 1/3 \\
 
1/7 & 2/7 & 4/7
 
\end{matrix} \right] </math>
 
 
 
Given <math> X_0=1 </math>. Compute the following:
 
 
 
a)<math> P(X_2=1, X_1=3 | X_0 = 2) = P(X_2=1|X_1=3)*P(X_1=3|X_0=2)= 2/7 * 1/4 = 1/14 </math>
 
 
 
b)<math>P(X_2=3|X_0=1) = P(X_2=3|X_1=1)*P(X_1=1|X_0=1)+P(X_2=3|X_1=2)*P(X_1=2|X_0=1)+P(X_2=3|X_1=3)*P(X_1=3|X_0=1)=(1/4)*
 
(1/7)+(1/3)*(2/7)+(1/7)*(4/7)=125/588 </math>
 
</div>
 
 
 
=== Marginal Distribution of Markov Chain ===
 
<span style="text-shadow: 0px 2px 3px hsl(310,15%,65%);margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">We represent the probability of all states at time t with a vector <math>\underline{\mu_t}</math><br/>
 
<math>\underline{\mu_t}~=(\mu_t(1), \mu_t(2),...\mu_t(n))</math> where <math>\mu_t(1)</math> is the probability of being on state 1 at time t.<br/>
 
and in general, <math>\mu_t(i)</math> shows  the probability of being on state i at time t.<br/>
 
 
For example, if there are two states a and b, then <math>\underline{\mu_5}</math>=(0.1, 0.9) means that the chance of being in state a at time 5 is 0.1 and the chance of being on state b at time 5 is 0.9. <br/>
 
For example, if there are two states a and b, then <math>\underline{\mu_5}</math>=(0.1, 0.9) means that the chance of being in state a at time 5 is 0.1 and the chance of being on state b at time 5 is 0.9. <br/>
 
If we generate a chain for many times, the frequency of states at each time shows marginal distribution of the chain at that time. <br/>
 
If we generate a chain for many times, the frequency of states at each time shows marginal distribution of the chain at that time. <br/>
 
The vector <math>\underline{\mu_0}</math> is called the initial distribution. <br/>
 
The vector <math>\underline{\mu_0}</math> is called the initial distribution. <br/>
</span>
+
 
 
<math> P_2~=P_1 P_1 </math> (as verified above)  
 
<math> P_2~=P_1 P_1 </math> (as verified above)  
  
Line 4,462: Line 4,029:
 
<math>\mu_n~=\mu_0 P_n</math><br/>
 
<math>\mu_n~=\mu_0 P_n</math><br/>
 
where <math>\mu_0</math> is the initial distribution,
 
where <math>\mu_0</math> is the initial distribution,
and <math>\mu_{m+n}~=\mu_m P_n</math><br/>
+
and <math>\mu_m+n~=\mu_m P_n</math><br/>
 
N can be negative, if P is invertible.
 
N can be negative, if P is invertible.
  
Line 4,483: Line 4,050:
 
<math>\mu_1~ = \mu_0P</math> <br>
 
<math>\mu_1~ = \mu_0P</math> <br>
 
<math>\mu_2~ = \mu_1P = \mu_0PP = \mu_0P^2</math> <br>
 
<math>\mu_2~ = \mu_1P = \mu_0PP = \mu_0P^2</math> <br>
Given the marginal distribustion at time n-1, we can compute the distribution at time n:<br>
+
 
<math>\mu_n~ = \mu_{n-1}~P</math> <br>
 
After repeating the algorithm starting at time 0, we can have the following relationship:<br>
 
 
In general, <math>\mu_n~ = \mu_0P^n</math><br />
 
In general, <math>\mu_n~ = \mu_0P^n</math><br />
Property: If <math>\mu_n~\neq\mu_t~</math>(for any t less than n), then we say P does not converge. <br />
+
Property: If <math>\mu_n~ =/= \mu_t~</math>(for any t less than n), then we say P does not converge. <br />
  
  
Line 4,493: Line 4,058:
  
 
==== Stationary Distribution ====
 
==== Stationary Distribution ====
 
Stationary distribution may refer to:
 
 
-The limiting distribution in a Markov chain.<br>
 
-The marginal distribution of a stationary process or stationary time series.<br>
 
-The set of joint probability distributions of a stationary process or stationary time series.<br>
 
  
  
<math>\pi</math> is stationary distribution of the transition matrix P  if <math>\pi \cdot </math> P = <math>\pi</math> which means <math>\pi</math> is an eigenvector of P with eigenvalue 1.
 
  
where <math>\pi</math> is a probability vector <math>\pi</math>=(<math>\pi</math><sub>i</sub> | <math>i \in X</math>) such that, all the entries are nonnegative and sum to 1.
 
  
In other words, if X''<sub>0</sub>'' is drawn from <math>\pi</math>. Generally, X''<sub>n</sub>'' is drawn from the same distribution as <math>\pi</math> for every n≥0.
+
<math>\pi</math> is stationary distribution of the chain if <math>\pi</math>P = <math>\pi</math>
  
Example: consider the following transition matrix<br>
+
where <math>\pi</math> is a probability vector <math>\pi</math>=(<math>\pi</math><sub>i</sub> | <math>i \in X</math>) such that all the entries are nonnegative and sum to 1.
  
<math> P= \left [ \begin{matrix}
+
In other words, if X''<sub>0</sub>'' is draw from <math>\pi</math>. Then marginally, X''<sub>n</sub>'' is also drawn from the same distribution <math>\pi</math> for every n≥0.
  
0 & 1 \\
 
1 & 0
 
  
\end{matrix} \right] </math>
+
'''Comments:'''<br/>
 +
As n gets bigger and bigger, <math>\mu_n</math> will possibly stop changing, so the quantity <math>\pi</math> <sub>i</sub> can also be interpreted as the limiting probability that the chain is in the state <math>j</math>
  
<span> so how to compute it: use <math>\pi = \pi * P</math></span>
+
Comments: <br/>
Intuitively, the chain spends half of the time in each of the states, so <math>\pi</math> = (1/2,1/2)
+
1. <math>\pi</math> may not exist and even if it exists, it may not always be unique. <br/>
 +
2. If <math>\pi</math> exists and is unique, then <math>\pi</math><sub>i</sub> is called the long-run proportion of the process in state i and the stationary distribution is also the limiting distribution of the process.<br/>
  
Comments:<br/>
 
1. As n gets bigger and bigger, <math>\mu_n</math> will possibly stop changing, so the quantity <math>\pi</math> <sub>i</sub> can also be interpreted as the limiting probability that the chain is in the state <math>j</math> <br>
 
2. <math>\pi</math> may not exist and even if it exists, it may not always be unique. <br/>
 
3. If <math>\pi</math> exists and is unique, then <math>\pi</math><sub>i</sub> is called the long-run proportion of the process in state i and the stationary distribution is also the limiting distribution of the process.<br/>
 
  
 
+
==== MatLab Code ====
'''MatLab Code'''
 
 
<pre style='font-size:14px'>
 
<pre style='font-size:14px'>
  
Line 4,544: Line 4,096:
 
     0.3000    0.7000
 
     0.3000    0.7000
  
>> mu=[.9 .1]                       %initial distribution         
+
>> mu=[.9 .1]                                
  
 
mu =
 
mu =
Line 4,550: Line 4,102:
 
     0.9000    0.1000
 
     0.9000    0.1000
  
>> mu*p                        % marginal distribution at time 1, enter mu=mu*P, repeat multiple times until the value of the vector mu remains unchanged
+
>> mu*p                        % enter mu=mu*P, repeat multiple times until the value of the vector mu remains unchanged
  
 
ans =
 
ans =
Line 4,575: Line 4,127:
 
     0.4000    0.6000
 
     0.4000    0.6000
 
     0.4000    0.6000
 
     0.4000    0.6000
 +
  
 
</pre>
 
</pre>
 +
The definition of stationary distribution is that <math>\pi</math> is the stationary distribution of the chain if <math>\pi=\pi~P</math>, where <math>\pi</math> is a probability vector. For every n<math>>=</math>0.
  
 +
However, just because X<sub>''n''</sub> ~ <math>\pi</math> for every n<math>>=</math>0 does ''not'' mean every state is independently identically distributed.
  
'''An alternate Method of Computing the Stationary Distribution''' <br>
+
'''Limiting distribution''' of the chain refers the transition matrix that reaches the stationary state. If the lim(n-> infinite)P^n -> c, where c is a constant, then, we say this Markov chain is coverage;  otherwise, it's not coverage.
  
Recall that if <math>\lambda v = A v</math>, then <math>\lambda</math> is the eigenvalue of <math>A</math> corresponding to the eigenvector <math>v</math><br>
+
Example: Find the stationary distribution of P= <math>\left[ {\begin{array}{ccc}
 +
1/3 & 1/3 & 1/3 \\
 +
1/4 & 3/4 & 0 \\
 +
1/2 & 0 & 1/2 \end{array} } \right]</math>
  
By definition of stationary distribution,  <math>\pi = \pi  P</math><br>
+
Solution:
Taking the transpose, <math>\pi^T  = (\pi  P)^T </math><br>
+
<math>\pi=\left[ {\begin{array}{ccc}
then  <math>I \pi^T  = P^T \pi^T \Rightarrow (P^T-I) \pi^T = 0 </math><br>
+
\pi_0 & \pi_1 & \pi_2 \end{array} } \right]</math>
So <math>\pi^T </math> is an eigenvector of <math>P^T</math> with corresponding eigenvalue 1. <br>
 
  
the transpose method to calculate the pi matrix probability.
+
Using the stationary distribution property <math>\pi=\pi~P</math> we get, <br>
 +
<math>\pi_0=\frac{1}{3}\pi_0+\frac{1}{4}\pi_1+\frac{1}{2}\pi_2 </math><br>
 +
<math>\pi_1=\frac{1}{3}\pi_0+\frac{3}{4}\pi_1+0\pi_2 </math><br>
 +
<math>\pi_2=\frac{1}{3}\pi_0+0\pi_1+\frac{1}{2}\pi_2 </math><br>
  
It is thus possible to compute the stationary distribution by taking the eigenvector of the transpose of the transition matrix corresponding to 1, and normalize it such that all elements are non-negative and sum to one so that the elements satisfy the definition of a stationary distribution. The transformed vector is still an eigenvector since a linear transformation of an eigenvector is still within the eigenspace. Taking the transpose of this transformed eigenvector gives the stationary distribution. <br>  
+
And since <math>\pi</math> is a probability vector, <br>
 +
<math> \pi_{0}~ + \pi_{1} + \pi_{2} = 1 </math>
  
<span style="background:#F5F5DC">
+
Solving the 4 equations for the 3 unknowns gets, <br>
Generating Random Initial distribution<br>
+
<math>\pi_{0}~=1/3</math>, <math>\pi_{1}~=4/9</math>, and <math>\pi_{2}~=2/9</math> <br>
 +
Therefore <math>\pi=\left[ {\begin{array}{ccc}
 +
1/3 & 4/9 & 2/9 \end{array} } \right]</math>
 +
 
 +
Example 2: Find the stationary distribution of P= <math>\left[ {\begin{array}{ccc}
 +
1/3 & 1/3 & 1/3 \\
 +
1/4 & 1/2 & 1/4 \\
 +
1/6 & 1/3 & 1/2 \end{array} } \right]</math>
 +
 
 +
Solution:
 +
<math>\pi=\left[ {\begin{array}{ccc}
 +
\pi_0 & \pi_1 & \pi_2 \end{array} } \right]</math>
 +
 
 +
Using the stationary distribution property <math>\pi=\pi~P</math> we get, <br>
 +
<math>\pi_0=\frac{1}{3}\pi_0+\frac{1}{4}\pi_1+\frac{1}{6}\pi_2 </math><br>
 +
<math>\pi_1=\frac{1}{3}\pi_0+\frac{1}{2}\pi_1+\frac{1}{3}\pi_2 </math><br>
 +
<math>\pi_2=\frac{1}{3}\pi_0+\frac{1}{4}\pi_1+\frac{1}{2}\pi_2 </math><br>
 +
 
 +
And since <math>\pi</math> is a probability vector, <br>
 +
<math> \pi_{0}~ + \pi_{1} + \pi_{2} = 1 </math>
 +
 
 +
Solving the 4 equations for the 3 unknowns gets, <br>
 +
<math>\pi_{0}=\frac {6}{25}</math>, <math>\pi_{1}~=\frac {2}{5}</math>, and <math>\pi_{2}~=\frac {9}{25}</math> <br>
 +
Therefore <math>\pi=\left[ {\begin{array}{ccc}
 +
\frac {6}{25} & \frac {2}{5} & \frac {9}{25} \end{array} } \right]</math>
 +
 
 +
The above two examples are designed to solve for the stationary distribution of the matrix P however they also give us the limiting distribution of the matrices as we have mentioned earlier that the stationary distribution is equivalent to the limiting distribution.
 +
 
 +
'''Alternate Method of Computing the Stationary Distribution''' <br>
 +
 
 +
Recall that if <math>\lambda v = A v</math>, then <math>\lambda</math> is the eigenvalue of <math>A</math> corresponding to the eigenvector <math>v</math><br>
 +
 
 +
By definition of stationary distribution,  <math>\pi = \pi  P</math><br>
 +
Taking the transpose, <math>\pi^T  = (\pi  P)^T </math><br>
 +
then  <math>I \pi^T  = P^T \pi^T \Rightarrow (P^T-I) \pi^T = 0 </math><br>
 +
So <math>\pi^T </math> is an eigenvector of <math>P^T</math> with corresponding eigenvalue 1. <br>
 +
 
 +
the transpose method to calculate the pi matrix probability.
 +
 
 +
It is thus possible to compute the stationary distribution by taking the eigenvector of the transpose of the transition matrix corresponding to 1, and normalize it such that all elements are non-negative and sum to one so that the elements satisfy the definition of a stationary distribution. The transformed vector is still an eigenvector since a linear transformation of an eigenvector is still within the eigenspace. Taking the transpose of this transformed eigenvector gives the stationary distribution. <br>
 +
 
 +
 
 +
 
 +
 
 +
<span style="background:#F5F5DC">
 +
Generating Random Initial distribution<br>
 
<math>\mu~=rand(1,n)</math><br>
 
<math>\mu~=rand(1,n)</math><br>
 
<math>\mu~=\mu/\Sigma(\mu)</math></span>
 
<math>\mu~=\mu/\Sigma(\mu)</math></span>
Line 4,607: Line 4,213:
  
 
A Markov chain is a random process usually characterized as '''memoryless''': the next state depends only on the current state and not on the sequence of events that preceded it. This specific kind of "memorylessness" is called the Markov property. Markov chains have many applications as statistical models of real-world processes.
 
A Markov chain is a random process usually characterized as '''memoryless''': the next state depends only on the current state and not on the sequence of events that preceded it. This specific kind of "memorylessness" is called the Markov property. Markov chains have many applications as statistical models of real-world processes.
 
'''Formal Definition:''' <br>
 
A ''Markov Chain'' consists of a countable (possibly finite) set ''S'' (called the state space) together with a countable family of random variables X<sub>0</sub>,X<sub>1</sub>,X<sub>2</sub>,X<sub>3</sub>,... with values in S such that:<br>
 
  <math>P(X_{t+1}=s \vert X_t=s_t,X_{t-1}=s_{t-1},...,X_0=s_0)=P(X_{t+1}=s \vert X_t=s_t)</math><br>
 
and we refer to this fundamental equation as the ''Markov property''. Note the sufficient property here is that the closest future state is independent of the past states. It is not necessary to be dependent on the current state. Here are some more properties of Markov Chain:<br>
 
  
 
1. Reducibility <br>
 
1. Reducibility <br>
Line 4,629: Line 4,230:
 
== Class 14 - Thursday June 20th 2013 ==
 
== Class 14 - Thursday June 20th 2013 ==
  
== Properties of Markov Chain (continued) ==
+
Example: Find the stationary distribution of <math> P= \left[ {\begin{array}{ccc}
<span style="text-shadow: 0px 2px 3px hsl(310,15%,65%);margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">4. Ergodicity<br>
+
\frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\[6pt]
If state i is aperiodic and positive recurrent, state i is said to be ergodic. In other words, state i is ergodic if it has a period of 1 and has finite mean recurrence time. Consequently, an irreducible Markov Chain is said to be ergodic if all states in the chain are ergodic. <br>
+
\frac{1}{4} & \frac{3}{4} & 0 \\[6pt]
 +
\frac{1}{2} & 0 & \frac{1}{2} \end{array} } \right]</math>
  
In statistics, the term describes a random process for which the time average of one sequence of events is the same as the ensemble average.
+
<math>\pi=\pi  p</math>
source: wikipedia
 
An extra note here is that if a finite state irreducible Markov chain has an aperiodic state, then it is ergodic. If there is a finite number N such that any state can be reached from any other state in exactly N steps, a model is said to have the ergodic property. For example, if we have a fully connected transition matrix where all transitions have a non-zero probability, this condition is fulfilled with N=1. A model with more than one state and just one out-going transition per state cannot be ergodic. We will have more examples to follow later on.<br>
 
  
<span style="text-shadow: 0px 2px 3px hsl(310,15%,65%);margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">
+
Solve the system of linear equations to find a stationary distribution
5. Steady-state analysis and limiting distributions<br>
 
Definition: If a Markov Chain is time-homogeneous, so that the process is described by a single, time-independent matrix p_{ij}, then the vector  <math>\boldsymbol{\pi}</math>is called a '''stationary distribution''' if <math>\forall j \in S</math> it satisfies:<br>
 
    1)<math>0\leq\pi_j\leq1</math><br>
 
    2)<math>\sum_{j \in S}\pi_j = 1</math><br>
 
    3)<math>\pi_j = \sum_{i \in S} \pi_i p_{ij}</math><br>
 
  
If all states of an irreducible chain are positive recurrent, it has a stationary distribution. In this case,  <math>\boldsymbol{\pi}</math> is unique. It is also related to the expected return time:<br>
+
<math>\pi=(\frac{1}{3},\frac{4}{9}, \frac{2}{9})</math>
 
 
    <math>\pi_j = \frac{C}{M_j}\,, </math><br>
 
 
 
where C is the normalizing constant. <br>
 
 
 
If the chain is both irreducible and aperiodic, then for any i and j,<br>
 
 
 
    <math>\lim_{n \rarr \infty} p_{ij}^{(n)} = \frac{C}{M_j} </math><br>
 
 
 
'''Final result:''' <br>
 
<math>\boldsymbol{\pi}</math> is called the equilibrium distribution of the chain if the chain converges to the stationary distribution regardless of where it begins.<br>
 
 
 
Source: https://en.wikipedia.org/wiki/Markov_chain
 
 
 
== Examples of finding stationary distribution ==
 
 
 
We will have examples for this later. <br>
 
Example: Find the stationary distribution of <math> P= \left[ {\begin{array}{ccc}
 
\frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\[6pt]
 
\frac{1}{4} & \frac{3}{4} & 0 \\[6pt]
 
\frac{1}{2} & 0 & \frac{1}{2} \end{array} } \right]</math>
 
 
 
Solve the system of linear equations to find the Stationary Distribution<br>
 
<br>
 
<math>\pi_0=\frac{1}{3}\pi_0+\frac{1}{4}\pi_1+\frac{1}{2}\pi_2 </math><br>
 
<math>\pi_1=\frac{1}{3}\pi_0+\frac{3}{4}\pi_1+0\pi_2 </math><br>
 
<math>\pi_2=\frac{1}{3}\pi_0+0\pi_1+\frac{1}{2}\pi_2 </math><br>
 
<math>\pi_{0}~ + \pi_{1} + \pi_{2} = 1 </math><br>
 
<br>
 
Solving the 4 equations, <br>
 
<math>\pi_{0}=\frac {1}{3}</math>, <math>\pi_{1}~=\frac {4}{9}</math>, and <math>\pi_{2}~=\frac {2}{9}</math> <br>
 
 
 
Therefore
 
<math>\pi=(\frac{1}{3},\frac{4}{9}, \frac{2}{9})</math>
 
 
 
Similarly, this can be achieved by calculating <br/>
 
<math>P^{30}=\left[ {\begin{array}{ccc}
 
\frac{1}{3} & \frac{4}{9} & \frac{2}{9} \\[6pt]
 
\frac{1}{3} & \frac{4}{9} & \frac{2}{9} \\[6pt]
 
\frac{1}{3} & \frac{4}{9} & \frac{2}{9} \end{array} } \right]</math><br/>
 
Which produces the same result as solving the systems of equations.
 
 
 
Alternatively, the system of equations can also be solved using matrix simplification. Namely, when the system of equations is more complicated and doesn't simplify easily. Continuing from the above example, we would have [-2/3 1/4 1/2 | 0]
 
[1/3 -1/4 0 | 0]
 
[1/3 0 -1/2 | 0]
 
[1 1 1 | 1]
 
 
 
Simplifying this matrix (from math 136), we can easily obtain the results for <math>\pi_0</math>, <math>\pi_1</math>, and <math>\pi_2</math>.
 
  
 
<math>\lambda u=A u</math>
 
<math>\lambda u=A u</math>
  
 
<math>\pi</math> can be considered as an eigenvector of P with eigenvalue = 1.
 
<math>\pi</math> can be considered as an eigenvector of P with eigenvalue = 1.
 
 
But the vector u here needs to be a column vector. So we need to transform <math>\pi</math> into a column vector.
 
But the vector u here needs to be a column vector. So we need to transform <math>\pi</math> into a column vector.
  
 
<math>\pi</math><sup>T</sup>= P<sup>T</sup><math>\pi</math><sup>T</sup>
 
<math>\pi</math><sup>T</sup>= P<sup>T</sup><math>\pi</math><sup>T</sup>
 
 
Then <math>\pi</math><sup>T</sup> is an eigenvector of P<sup>T</sup> with eigenvalue = 1. <br />
 
Then <math>\pi</math><sup>T</sup> is an eigenvector of P<sup>T</sup> with eigenvalue = 1. <br />
 
MatLab tips:[V D]=eig(A), where D is a diagonal matrix of eigenvalues and V is a matrix of eigenvectors of matrix A<br />
 
MatLab tips:[V D]=eig(A), where D is a diagonal matrix of eigenvalues and V is a matrix of eigenvectors of matrix A<br />
 +
==== MatLab Code ====
  
 
+
==== Limiting distribution ====
==== Limiting Distribution ====
+
A Markov chain has limiting distribution <math>\pi</math> if
A Markov chain has '''Limiting Distribution''' <math>\pi</math> if
 
  
 
<math>\lim_{n\to \infty} P^n= \left[ {\begin{array}{ccc}
 
<math>\lim_{n\to \infty} P^n= \left[ {\begin{array}{ccc}
Line 4,717: Line 4,262:
 
That is <math>\pi_j=\lim[P^n]_{ij}</math> exists and is independent of i.<br/>  
 
That is <math>\pi_j=\lim[P^n]_{ij}</math> exists and is independent of i.<br/>  
  
A Markov Chain is convergent if and only if its limiting distribution exists. <br/>
+
A Markov Chain is convergent if and only if its limiting distribution exists. <br/>
 
 
If the Limiting Distribution <math>\pi</math> exists, it must be equal to the Stationary Distribution. Of course, the Stationary Distribution is unique in this case.<br/>
 
  
 +
If the limiting distribution <math>\pi</math> exists, it must be equal to the stationary distribution.<br/>
  
In general, there are chains with Stationary distributions that don't converge, which means that they have Stationary Distributions but do not converge in distribution to any one of them, this is especially true for periodic chains. Convergence here means that the limit of <math> X_(t+1) |X_t = i </math> for any i in the state space converges in distribution to the stationary distribution.<br/>
+
In general, there are chains with stationery distributions that don't converge, this means that they have stationary distribution but are not limiting.<br/>
  
  
Line 4,734: Line 4,278:
 
\end{matrix} \right] </math>
 
\end{matrix} \right] </math>
  
This chain converges but is not a Limiting Distribution as the rows are not the same and it doesn't converge to the Stationary Distribution.<br />
+
This chain converges but is not a limiting distribution as the rows are not the same and it doesn't converge to the stationary distribution.<br />
  
 
<span style="font-size:20px;color:red">The following contents are problematic. Please correct it if possible.</span><br />
 
<span style="font-size:20px;color:red">The following contents are problematic. Please correct it if possible.</span><br />
Suppose we're given that the Limiting Distribution <math> \pi </math> exists for  stochastic matrix P, that is, <math> \pi = \pi * P </math> <br>
+
Suppose we're given that the limiting distribution <math> \pi </math> exists for  stochastic matrix P, that is, <math> \pi = \pi * P </math> <br>
  
 
WLOG assume P is diagonalizable, (if not we can always consider the Jordan form and the computation below is exactly the same. <br>
 
WLOG assume P is diagonalizable, (if not we can always consider the Jordan form and the computation below is exactly the same. <br>
  
Let <math> P = U * \Sigma * U^{-1} </math> be the eigenvalue decomposition of <math> P </math>, where <math>\Sigma = diag(\lambda_1,\ldots,\lambda_n) ; 1=|\lambda_1|; |\lambda_i| > |\lambda_j|, \forall i < j </math><br>
+
Let <math> P = U * \Sigma * U^{-1} </math> be the eigenvalue decomposition of <math> P </math>, where <math>\Sigma = diag(\lambda_1,\ldots,\lambda_n) ; |\lambda_i| > |\lambda_j|, \forall i < j </math><br>
  
Suppose <math> x^T = \sum a_i u_i </math> where <math> a_i \in \mathcal{R} </math> and <math> u_i </math> are eigenvectors of <math> P </math> for <math> i = 1\ldots n </math> <br>
+
Suppose <math> \pi^T = \sum a_i u_i </math> where <math> a_i \in \mathcal{R} </math> and <math> u_i </math> are eigenvectors of <math> P </math> for <math> i = 1\ldots n </math> <br>
  
 
By definition: <math> \pi^k = \pi*P = \pi*P^k \implies \pi = \pi*(U * \Sigma * U^{-1}) *(U * \Sigma * U^{-1} )*\ldots*(U * \Sigma * U^{-1}) </math> <br>
 
By definition: <math> \pi^k = \pi*P = \pi*P^k \implies \pi = \pi*(U * \Sigma * U^{-1}) *(U * \Sigma * U^{-1} )*\ldots*(U * \Sigma * U^{-1}) </math> <br>
Line 4,753: Line 4,297:
 
=== MatLab Code ===
 
=== MatLab Code ===
 
<pre style='font-size:14px'>
 
<pre style='font-size:14px'>
>> P=[1/3, 1/3, 1/3; 1/4, 3/4, 0; 1/2, 0, 1/2]      % We input a matrix P.This is the same matrix as last class'.   
+
>> P=[1/3, 1/3, 1/3; 1/4, 3/4, 0; 1/2, 0, 1/2]      % We input a matrix P.This is the same matrix as last class.   
  
 
P =
 
P =
Line 4,779: Line 4,323:
 
>> P^10
 
>> P^10
  
the example of code and an example of Stationary Distribution, then all the pi probability in the matrix are the same.
+
the example of code and an example of stand distribution, then the all the pi probability in the matrix are the same.
  
 
ans =
 
ans =
Line 4,787: Line 4,331:
 
     0.3360    0.4358    0.2282
 
     0.3360    0.4358    0.2282
  
>> P^100                                            % The Stationary Distribution is [0.3333 0.4444 0.2222]  since values remain unchanged.
+
>> P^100                                            % The stationary distribution is [0.3333 0.4444 0.2222]  since values keep unchanged.
  
 
ans =
 
ans =
Line 4,811: Line 4,355:
 
         0        0  -0.0643
 
         0        0  -0.0643
  
>> a=-vec(:,1)     % The eigenvectors can be mutiplied by (-1) since probability can not be negative and λV=AV  can be written as λ(-V)=A(-V)
+
>> a=-vec(:,1)                                     % The eigenvectors can be mutiplied by (-1) since λV=AV  can be written as   λ(-V)=A(-V)
  
 
a =
 
a =
Line 4,836: Line 4,380:
 
This is <math>\pi_j = lim[p^n]_(ij)</math> exist and is independent of i  
 
This is <math>\pi_j = lim[p^n]_(ij)</math> exist and is independent of i  
  
Example: Find the Stationary Distribution of P= <math>\left[ {\begin{array}{ccc}
+
Example: Find the stationary distribution of P= <math>\left[ {\begin{array}{ccc}
 
0 & 1 & 0 \\
 
0 & 1 & 0 \\
 
0 & 0 & 1 \\
 
0 & 0 & 1 \\
Line 4,855: Line 4,399:
 
Also, <math>\pi</math><sub>1</sub> = <math>\pi</math><sub>2</sub> = 1/3 <br>
 
Also, <math>\pi</math><sub>1</sub> = <math>\pi</math><sub>2</sub> = 1/3 <br>
 
So, <math>\pi</math> = <math>[\frac{1}{3}, \frac{1}{3}, \frac{1}{3}]</math> <br>
 
So, <math>\pi</math> = <math>[\frac{1}{3}, \frac{1}{3}, \frac{1}{3}]</math> <br>
 
+
 
 
when the p matrix is a standard matrix, then all the probabilities of pi are the same in the matrix.
 
when the p matrix is a standard matrix, then all the probabilities of pi are the same in the matrix.
  
Line 4,919: Line 4,463:
 
</pre>
 
</pre>
  
The first condition of Limiting Distribution is satisfied; however, the second condition where <math>\pi</math><sub>j</sub> has to be independent of i (i.e. all rows of the matrix are the same) is not met.<br>
+
The first condition of limiting distribution is satisfied; however, the second condition where <math>\pi</math><sub>j</sub> has to be independent of i (i.e. all rows of the matrix are the same) is not met.<br>
  
This example shows the distinction between having a Stationary Distribution and having a Limiting Distribution(convergence).Note: <math>\pi=(1/3,1/3,1/3)</math> is the stationary distribution as <math>\pi=\pi*p</math>. However, upon repeatedly multiplying P by itself (repeating the step <math>P^n</math> as n goes to infinite) one will note that the results become a cycle (of period 3) of the same sequence of matrices. The chain has a Stationary Distribution but does not converge to it. Thus, there is no Limiting Distribution.<br>
+
This example shows the distinction between having a stationary distribution and convergence(having a limiting distribution).Note: <math>\pi=(1/3,1/3,1/3)</math> is the stationary distribution as <math>\pi=\pi*p</math>. However, upon repeatedly multiplying P by itself (repeating the step <math>P^n</math> as n goes to infinite) one will note that the results become a cycle (of period 3) of the same sequence of matrices. The chain has a stationary distribution, but does not converge to it. Thus, there is no limiting distribution.<br>
  
 
Another example:
 
Another example:
Line 4,946: Line 4,490:
 
So, <math>\pi = [\frac{1}{2}, \frac{1}{4}, \frac{1}{4}]</math> <br>
 
So, <math>\pi = [\frac{1}{2}, \frac{1}{4}, \frac{1}{4}]</math> <br>
  
The definition of stationary distribution is that <math>\pi</math> is the stationary distribution of the chain if <math>\pi=\pi~P</math>, where <math>\pi</math> is a probability vector. For every n<math>>=</math>0.
+
=== Ergodic Chain ===
  
However, just because X<sub>''n''</sub> ~ <math>\pi</math> for every n<math>>=</math>0 does ''not'' mean every state is independently identically distributed.
+
A Markov chain is called an ergodic chain if it is possible to go from every state to every state (not necessarily in one move). For instance, note that we can claim a Markov chain is ergodic if it is possible to somehow start at any state i and end at any state j in the matrix. We could have a chain with states 0, 1, 2, 3, 4 where it is not possible to go from state 0 to state 4 in just one step. However, it may be possible to go from 0 to 1, then from 1 to 2, then from 2 to 3, and finally 3 to 4 so we can claim that it is possible to go from 0 to 4 and this would satisfy a requirement of an ergodic chain. The example below will further explain this concept.
  
'''Limiting Distribution''' of the chain refers the transition matrix that reaches the stationary state. If the lim(n-> infinite)P^n -> c, where c is a constant, then, we say this Markov chain is convergent;  otherwise, it's not convergent.
+
'''Note:'''if there's a finite number N then every other state can be reached in N steps.
  
Example: Find the stationary distribution of P= <math>\left[ {\begin{array}{ccc}
 
1/3 & 1/3 & 1/3 \\
 
1/4 & 3/4 & 0 \\
 
1/2 & 0 & 1/2 \end{array} } \right]</math>
 
  
Solution:
+
====Example====
<math>\pi=\left[ {\begin{array}{ccc}
+
<math> P= \left[ \begin{matrix}
\pi_0 & \pi_1 & \pi_2 \end{array} } \right]</math>
+
\frac{1}{3} \; & \frac{1}{3} \; & \frac{1}{3} \\ \\
 +
\frac{1}{4} \; & \frac{3}{4} \; & 0 \\ \\
 +
\frac{1}{2} \; & 0 \; & \frac{1}{2}
 +
\end{matrix} \right] </math><br />
  
Using the stationary distribution property <math>\pi=\pi~P</math> we get, <br>
 
<math>\pi_0=\frac{1}{3}\pi_0+\frac{1}{4}\pi_1+\frac{1}{2}\pi_2 </math><br>
 
<math>\pi_1=\frac{1}{3}\pi_0+\frac{3}{4}\pi_1+0\pi_2 </math><br>
 
<math>\pi_2=\frac{1}{3}\pi_0+0\pi_1+\frac{1}{2}\pi_2 </math><br>
 
  
And since <math>\pi</math> is a probability vector, <br>
+
<math> \pi=\left[ \begin{matrix}
<math> \pi_{0}~ + \pi_{1} + \pi_{2} = 1 </math>
+
\frac{1}{3} & \frac{4}{9} & \frac{2}{9}
 +
\end{matrix} \right] </math><br />
  
Solving the 4 equations for the 3 unknowns gets, <br>
 
<math>\pi_{0}~=1/3</math>, <math>\pi_{1}~=4/9</math>, and <math>\pi_{2}~=2/9</math> <br>
 
Therefore <math>\pi=\left[ {\begin{array}{ccc}
 
1/3 & 4/9 & 2/9 \end{array} } \right]</math>
 
  
Example 2: Find the Stationary Distribution of P= <math>\left[ {\begin{array}{ccc}
+
There are three states in this example.
1/3 & 1/3 & 1/3 \\
 
1/4 & 1/2 & 1/4 \\
 
1/6 & 1/3 & 1/2 \end{array} } \right]</math>
 
  
Solution:
+
[[File:ab.png]]
<math>\pi=\left[ {\begin{array}{ccc}
 
\pi_0 & \pi_1 & \pi_2 \end{array} } \right]</math>
 
  
Using the Stationary Distribution property <math>\pi=\pi~P</math> we get, <br>
+
In this case, state a can go to state a, b, or c; state b can go to state a, b, or c; and state c can go to state a, b, or c so it is possible to go from every state to every state. (Although state b cannot directly go into c in one move, it must go to a, and then to c.).
<math>\pi_0=\frac{1}{3}\pi_0+\frac{1}{4}\pi_1+\frac{1}{6}\pi_2 </math><br>
 
<math>\pi_1=\frac{1}{3}\pi_0+\frac{1}{2}\pi_1+\frac{1}{3}\pi_2 </math><br>
 
<math>\pi_2=\frac{1}{3}\pi_0+\frac{1}{4}\pi_1+\frac{1}{2}\pi_2 </math><br>
 
  
And since <math>\pi</math> is a probability vector, <br>
+
A k-by-k matrix indicates that the chain has k states.
<math> \pi_{0}~ + \pi_{1} + \pi_{2} = 1 </math>
 
  
Solving the 4 equations for the 3 unknowns gets, <br>
+
- Ergodic Markov chains are irreducible.
<math>\pi_{0}=\frac {6}{25}</math>, <math>\pi_{1}~=\frac {2}{5}</math>, and <math>\pi_{2}~=\frac {9}{25}</math> <br>
 
  
 +
- A Markov chain is called a '''regular''' chain if some power of the transition matrix has only positive elements.<br />
 +
*Any transition matrix that has no zeros determines a regular Markov chain
 +
*However, it is possible for a regular Markov chain to have a transition matrix that has zeros.
 +
<br />
 +
For example, recall the matrix of the Land of Oz
  
Therefore <math>\pi=\left[ {\begin{array}{ccc}
+
<math>P = \left[ \begin{matrix}
\frac {6}{25} & \frac {2}{5} & \frac {9}{25} \end{array} } \right]</math>
+
& R & N & S \\
 +
R & 1/2 & 1/4 & 1/4 \\
 +
N & 1/2 & 0 & 1/2 \\
 +
S & 1/4 & 1/4 & 1/2 \\
 +
\end{matrix} \right]</math><br />
  
 +
=== Theorem ===
 +
An ergodic Markov chain has a unique stationary distribution <math>\pi</math>. The limiting distribution exists and is equal to <math>\pi</math><br/>
 +
Note: Ergodic Markov Chain is irreducible, aperiodic and positive recurrent.
  
The above two examples are designed to solve for the stationary distribution of the matrix P. Meanwhile, they also give us the Limiting Distribution of the matrices as we have mentioned earlier that the Stationary Distribution is equivalent to the Limiting Distribution.
+
Example: Consider the markov chain of <math>\left[\begin{matrix}0 & 1 \\ 1 & 0\end{matrix}\right]</math>, the stationary distribution is obtained by solving <math>\pi P = \pi</math>, getting <math>\pi=[0.5, 0.5]</math>, but from the assignment we know that it does not converge, ie. there is no limiting distribution, because the Markov chain is not aperiodic and cycle repeats <math>P^2=\left[\begin{matrix}1 & 0 \\ 0 & 1\end{matrix}\right]</math> and <math>P^3=\left[\begin{matrix}0 & 1 \\ 1 & 0\end{matrix}\right]</math>
  
=== Ergodic Chain ===
+
'''Another Example'''
  
A Markov chain is called an ergodic chain if it is possible to go from every state to every state (not necessarily in one move). For instance, note that we can claim a Markov chain is ergodic if it is possible to somehow start at any state i and end at any state j in the matrix. We could have a chain with states 0, 1, 2, 3, 4 where it is not possible to go from state 0 to state 4 in just one step. However, it may be possible to go from 0 to 1, 1 to 2, then from 2 to 3, and finally from 3 to 4 .Therefore, we can claim that it is possible to go from 0 to 4 and this would satisfy a requirement of an ergodic chain. The example below will further explain this concept.
+
<math>P=\left[ {\begin{array}{ccc}
 +
\frac{1}{4} & \frac{3}{4} \\[6pt]
 +
\frac{1}{5} & \frac{4}{5} \end{array} } \right]</math> <br>
  
  
'''Note:'''if there's a finite number N then every other state can be reached in N steps.
 
  
 +
[[File:Untitled*.jpg]]
  
====Example====
+
This matrix means that there are two points in the space, let's call them a and b<br/>
<div style="color:#99CCFF">
+
Starting from a, the probability of staying in a is 1/4 <br/>
<math> P= \left[ \begin{matrix}
+
Starting from a, the probability of going from a to b is 3/4 <br/>
\frac{1}{3} \; & \frac{1}{3} \; & \frac{1}{3} \\ \\
+
Starting from b, the probability of going from b to a is 1/5 <br/>
\frac{1}{4} \; & \frac{3}{4} \; & 0 \\ \\
+
Starting from b, the probability of staying in b is 4/5 <br/>
\frac{1}{2} \; & 0 \; & \frac{1}{2}
 
\end{matrix} \right] </math><br />
 
  
 +
Solve the equation <math> \pi = \pi P </math> <br>
 +
<math> \pi_0 = .25 \pi_0 + .2 \pi_1 </math> <br>
 +
<math> \pi_1 = .75 \pi_0 + .8 \pi_1 </math> <br>
 +
<math> \pi_0 + \pi_1 = 1 </math> <br>
 +
Solving this system of equations we get: <br>
 +
<math> \pi_0 = \frac{4}{15} \pi_1 </math> <br>
 +
<math> \pi_1 = \frac{15}{19} </math> <br>
 +
<math> \pi_0 = \frac{4}{19} </math> <br>
 +
<math> \pi = [\frac{4}{19}, \frac{15}{19}] </math> <br>
 +
<math> \pi </math> is the long run distribution
  
<math> \pi=\left[ \begin{matrix}
+
We can use the stationary distribution to compute the expected waiting time to return to state 'a' <br/>
\frac{1}{3} & \frac{4}{9} & \frac{2}{9}
+
given that we start at state 'a' and so on.. Formula for this will be : <math> E[T_{i,i}]=\frac{1}{\pi_i}</math><br/>
\end{matrix} \right] </math><br />
+
In the example above this will mean that that expected waiting time for the markov process to return to<br/>
 +
state 'a' given that we start at state 'a' is 19/4.<br/>
  
 +
definition of limiting distribution.
  
There are three states in this example.
+
remark:satisfied balance of <math>\pi_i P_{ij} = P_{ji} \pi_j</math>, so there is other way to calculate the step probability.
  
[[File:ab.png]]
+
=== MatLab Code ===
 +
In the following, P is the transition matrix. eye(n) refers to the n by n Identity matrix. L is the Laplacian matrix, L = (I - P). The Laplacian matrix will have at least 1 zero Eigenvalue. For every 0 in the diagonal, there is a component. If there is exactly 1 zero Eigenvalue, then the matrix is connected and has only 1 component. The number of zeros in the Laplacian matrix is the number of parts in your graph/process. If there is more than one zero on the diagonal of this matrix, this means there is a disconnect in the graph.
  
In this case, state a can go to state a, b, or c; state b can go to state a, b, or c; and state c can go to state a, b, or c. Henceit is possible to go from every state to every state. (Although state b cannot directly go into c in one move, it must go to a, and then to c.).
 
  
A K-by-K Matrix indicates that the chain has k states.
+
<pre style='font-size:14px'>
 +
>> P=[1/3, 1/3, 1/3; 1/4, 3/4, 0; 1/2, 0, 1/2]
  
- Ergodic Markov chains are irreducible.<br>
+
P =
Recall that a Markov chain is irreducible if all the states communicate with each other.
+
 
 +
    0.3333    0.3333    0.3333
 +
    0.2500    0.7500        0
 +
    0.5000        0    0.5000
  
- A Markov chain is called a '''regular''' chain if some power of the transition matrix has only positive elements.<br />
+
>> eye(3) %%returns 3x3 identity matrix
*Any transition matrix that has no zeros determines a regular Markov chain
 
*However, it is possible for a regular Markov chain to have a transition matrix that has zeros.
 
<br />
 
For example, recall the matrix of the Land of Oz
 
  
<math>P = \left[ \begin{matrix}
+
ans =
& R & N & S \\
 
R & 1/2 & 1/4 & 1/4 \\
 
N & 1/2 & 0 & 1/2 \\
 
S & 1/4 & 1/4 & 1/2 \\
 
\end{matrix} \right]</math><br />
 
</div>
 
  
===Example===
+
    1    0    0
The following chain is <b>not</b> an ergodic chain.
+
    0    1    0
[[File:Notergodic.jpg]]
+
    0    0    1
<br>Note that this is true because it is not possible to get to every state from any state. So you cannot get to states C or D from states A or B, and vice versa.
 
  
The matrix looks like this:
+
>> L=(eye(3)-P) 
  
<math>L=
+
L =
\left[ {\begin{matrix} 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \end{matrix} } \right]</math>
 
  
Obviously, you cannot get from A or B to C or D.
+
    0.6667  -0.3333  -0.3333
 +
  -0.2500    0.2500        0
 +
  -0.5000        0    0.5000
  
=== Theorem ===
+
>> [vec val]=eigs(L)
An ergodic Markov chain has a unique Stationary Distribution <math>\pi</math>.
 
The Limiting Distribution exists and is equal to <math>\pi</math><br/>
 
*Note1: Ergodic Markov Chain is irreducible, aperiodic and positive recurrent, meaning all states have to be positive recurrent.<br>
 
*Note2: positive recurrent if the expected amount of time between recurrences is finite.Or, E(T<sub>i</sub>) < <math>infinite</math>. <br>
 
*Note3: If the limiting distribution exists for a markov chain does '''not''' always imply that the chain is ergodic.
 
  
'''Example'''
+
vec =
  
Consider the markov chain of <math>\left[\begin{matrix}0 & 1 \\ 1 & 0\end{matrix}\right]</math>, the Stationary Distribution( If the Markov chain is a time-homogeneous Markov chain, so that the process is described by a single, time-independent matrix , then the vector  is called a stationary distribution (or invariant measure) if  it satisfies)
+
  -0.7295    0.2329    0.5774
 +
    0.2239  -0.5690    0.5774
 +
    0.6463    0.7887    0.5774
  
  
is obtained by solving <math>\pi P = \pi</math>,
+
val =
getting <math>\pi=[0.5, 0.5]</math>, but from the assignment we know that it does not converge, ie. there is no limiting distribution, because the Markov chain is not aperiodic and cycle repeats <math>P^2=\left[\begin{matrix}1 & 0 \\ 0 & 1\end{matrix}\right]</math> and <math>P^3=\left[\begin{matrix}0 & 1 \\ 1 & 0\end{matrix}\right]</math>
 
  
'''Another Example'''
+
    1.0643        0        0
 +
        0    0.3523        0
 +
        0        0  -0.0000
 +
 
 +
%% Only one value of zero on the diagonal means the chain is connected
  
<math>P=\left[ {\begin{array}{ccc}
+
>> P=[0.8, 0.2, 0, 0;0.2, 0.8, 0, 0; 0, 0, 0.8, 0.2; 0, 0, 0.1, 0.9]
\frac{1}{4} & \frac{3}{4} \\[6pt]
 
\frac{1}{5} & \frac{4}{5} \end{array} } \right]</math> <br>
 
  
 +
P =
  
 +
    0.8000    0.2000        0        0
 +
    0.2000    0.8000        0        0
 +
        0        0    0.8000    0.2000
 +
        0        0    0.1000    0.9000
  
[[File:Untitled*.jpg]]
+
>> eye(4)
  
This matrix means that there are two points in the space, let's call them a and b<br/>
+
ans =
Starting from a, the probability of staying in a is 1/4 <br/>
 
Starting from a, the probability of going from a to b is 3/4 <br/>
 
Starting from b, the probability of going from b to a is 1/5 <br/>
 
Starting from b, the probability of staying in b is 4/5 <br/>
 
  
Solve the equation <math> \pi = \pi P </math> <br>
+
    1    0    0    0
<math> \pi_0 = .25 \pi_0 + .2 \pi_1 </math> <br>
+
    0    1    0    0
<math> \pi_1 = .75 \pi_0 + .8 \pi_1 </math> <br>
+
    0    0    1    0
<math> \pi_0 + \pi_1 = 1 </math> <br>
+
    0    0    0    1
Solving this system of equations we get: <br>
 
<math> \pi_0 = \frac{4}{15} \pi_1 </math> <br>
 
<math> \pi_1 = \frac{15}{19} </math> <br>
 
<math> \pi_0 = \frac{4}{19} </math> <br>
 
<math> \pi = [\frac{4}{19}, \frac{15}{19}] </math> <br>
 
<math> \pi </math> is the long run distribution
 
  
We can use the Stationary Distribution to compute the expected waiting time to return to state 'a' <br/>
+
>> L=(eye(4)-P)
given that we start at state 'a' and so on.. Formula for this will be : <math> E[T_{i,i}]=\frac{1}{\pi_i}</math><br/>
 
In the example above this will mean that that expected waiting time for the markov process to return to<br/>
 
state 'a' given that we start at state 'a' is 19/4.<br/>
 
  
Definition of Limiting Distribution.
+
L =
  
 +
    0.2000  -0.2000        0        0
 +
  -0.2000    0.2000        0        0
 +
        0        0    0.2000  -0.2000
 +
        0        0  -0.1000    0.1000
  
remark:satisfied balance of <math>\pi_i P_{ij} = P_{ji} \pi_j</math>, so there is other way to calculate the step probability.
+
>> [vec val]=eigs(L)
  
=== MatLab Code ===
+
vec =
In the following, P is the transition matrix. eye(n) refers to the n by n Identity matrix. L is the Laplacian matrix, L = (I - P). The Laplacian matrix will have at least 1 zero Eigenvalue. For every 0 in the diagonal, there is a component. If there is exactly 1 zero Eigenvalue, then the matrix is connected and has only 1 component. The number of zeros in the Laplacian matrix is the number of parts in your graph/process. If there is more than one zero on the diagonal of this matrix, this means there is a disconnect in the graph.
 
  
 +
    0.7071        0    0.7071        0
 +
  -0.7071        0    0.7071        0
 +
        0    0.8944        0    0.7071
 +
        0  -0.4472        0    0.7071
  
<pre style='font-size:14px'>
 
>> P=[1/3, 1/3, 1/3; 1/4, 3/4, 0; 1/2, 0, 1/2]
 
  
P =
+
val =
  
     0.3333   0.3333    0.3333
+
     0.4000        0        0        0
    0.2500    0.7500         0
+
        0   0.3000        0        0
    0.5000         0   0.5000
+
        0         0  -0.0000         0
 +
        0        0        0   -0.0000
  
>> eye(3)                      %returns 3x3 identity matrix
+
%% Two values of zero on the diagonal means there are two 'islands' of chains
  
ans =
+
</pre>
  
    1    0    0
+
<math>\Pi</math> satisfies detailed balance if <math>\Pi_i P_{ij}=P_{ij} \Pi_j</math>. Detailed balance guarantees that <math>\Pi</math> is stationary distribution.<br />
    0    1    0
 
    0    0    1
 
  
>> L=(eye(3)-P)              %return the Laplace transform
+
'''Adjacency matrix''' - a matrix <math>A</math> that dictates which states are connected and way of portraying which vertices in the matrix are adjacent. If we compute <math>A^2</math>, we can know which states are connected with paths of length 2.<br />
 +
 
 +
A '''Markov chain''' is called an irreducible chain if it is possible to go from every state to every state (not necessary in one more).<br />
 +
Theorem: An '''ergodic''' Markov chain has a unique stationary distribution <math>\pi</math>. The limiting distribution exists and is equal to <math>\pi</math>. <br />
  
L =
 
  
    0.6667  -0.3333  -0.3333
+
Markov process satisfies detailed balance  if and only if it is a '''reversible''' Markov process
  -0.2500    0.2500        0
+
where P is the matrix of  Markov transition.<br />
  -0.5000        0    0.5000
 
  
>> [vec val]=eigs(L)
+
Satisfying the detailed balance condition guarantees that <math>\pi</math> is stationary distributed.
  
vec =
+
<math> \pi </math> satisfies detailed balance if <math> \pi_i P_{ij} = P_{ji} \pi_j </math> <br>
 +
which is the same as the Markov process equation.
  
  -0.7295    0.2329    0.5774
+
Example in the class:
    0.2239  -0.5690    0.5774
+
<math>P= \left[ {\begin{array}{ccc}
    0.6463    0.7887    0.5774
+
\frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\[6pt]
 +
\frac{1}{4} & \frac{3}{4} & 0 \\[6pt]
 +
\frac{1}{2} & 0 & \frac{1}{2} \end{array} } \right]</math>
  
 +
and  <math>\pi=(\frac{1}{3},\frac{4}{9}, \frac{2}{9})</math>
  
val =
+
<math>\pi_1 P_{1,2} = 1/3 \times 1/3 = 1/9,\, P_{2,1} \pi_2 = 1/4 \times 4/9 = 1/9 \Rightarrow \pi_1 P_{1,2} = P_{2,1} \pi_2 </math><br>
  
    1.0643        0         0
+
<math>\pi_2 P_{2,3} = 4/9 \times 0 = 0,\, P_{3,2} \pi_3 = 0 \times 2/9 = 0 \Rightarrow \pi_2 P_{2,3} = P_{3,2} \pi_3</math><br>
        0   0.3523        0
+
Remark:Detailed balance of <math> \pi_i * Pij = Pji * \pi_j</math> , so there is other way to calculate the step probability<br />
        0        0  -0.0000
+
<math>\pi</math> is stationary but is not limiting.
  
%% Only one value of zero on the diagonal means the chain is connected
+
== Class 15 - Tuesday June 25th 2013 ==
 +
=== Announcement ===
 +
Note to all students, the first half of today's lecture will cover the midterm's solution; however please do not post the solution on the Wikicoursenote.<br />
  
>> P=[0.8, 0.2, 0, 0;0.2, 0.8, 0, 0; 0, 0, 0.8, 0.2; 0, 0, 0.1, 0.9]
+
====Detailed balance====
  
P =
+
Let <math>P</math> be the transition probability matrix of a Markov chain. If there exists a distribution vector <math>\pi</math> such that <math>\pi_i \cdot P_{ij}=P_{ji} \cdot \pi_j, \; \forall i,j</math>, then the Markov chain is said to have '''detailed balance'''. A detailed balanced Markov chain must have <math>\pi</math> given above as a stationary distribution, that is <math>\pi=\pi P</math>, where <math>\pi</math> is a 1 by n matrix and P is a n by n matrix.<br>
  
    0.8000    0.2000        0        0
 
    0.2000    0.8000        0        0
 
        0        0    0.8000    0.2000
 
        0        0    0.1000    0.9000
 
  
>> eye(4)
 
  
ans =
+
'''Proof:''' <br>
 +
<math>\; [\pi P]_j = \sum_i \pi_i P_{ij} =\sum_i P_{ji}\pi_j =\pi_j\sum_i P_{ji} =\pi_j  ,\forall j</math>
  
    1    0    0    0
+
:Note: Since <math>\pi_j</math> is a sum of coloum j and we can do this proof for every element in matrix P; in general, we can prove <math>\pi=\pi P</math> <br>
    0    1    0    0
 
    0    0    1    0
 
    0    0    0    1
 
  
>> L=(eye(4)-P)
+
Hence <math>\pi</math> is always a stationary distribution of <math>P(X_n+1=j|X_n=i)</math>, for every n.
  
L =
+
In other terms, <math> P_{ij} = P(X_n = j| X_{n-1} = i) </math>, where <math>\pi_j</math> is the equilibrium probability of being in state j and <math>\pi_i</math> is the equilibrium probability of being in state i. <math>P(X_{n-1} = i) = \pi_i</math> is equivalent to <math>P(X_{n-1} = i,  Xn = j)</math> being symmetric in i and j.
  
    0.2000  -0.2000        0        0
+
Keep in mind that the detailed balance is a sufficient but not required condition for a distribution to be stationary.  
  -0.2000    0.2000        0        0
+
i.e. A distribution satisfying the detailed balance is stationary, but a stationary distribution does not necessarily satisfy the detailed balance.
        0        0    0.2000  -0.2000
 
        0        0  -0.1000    0.1000
 
  
>> [vec val]=eigs(L)
+
In the stationary distribution <math>\pi=\pi P</math>, in the proof the sum of the p is equal 1 so the <math>\pi P=\pi</math>.
  
vec =
+
=== PageRank (http://en.wikipedia.org/wiki/PageRank) ===
  
    0.7071        0    0.7071        0
+
*PageRank is a link-analysis algorithm developed by Larry Page from Google; used for measuring a website's importance, relevance and popularity.
  -0.7071        0    0.7071        0
+
*PageRank is a graph containing web pages and their links to each other.
        0    0.8944        0    0.7071
+
*Many social media sites use this (such as Facebook and Twitter)
        0  -0.4472        0    0.7071
+
*It can also be used to find criminals (ie. theives, hackers, terrorists, etc.) by finding out the links.  
 +
This is what made Google the search engine of choice over Yahoo, Bing, etc.- What made Google's search engine a huge success is not its search function, but rather the algorithm it used to rank the pages. (Ex. If we come up with 100 million search results, how do you list them by relevance and importance so the users can easily find what they are looking for. Most users will not go past the first 3 or so search pages to find what they are looking for. It is this ability to rank pages that allow Google to remain more popular than Yahoo, Bink, AskJeeves, etc.)<br />
  
 +
<br />'''The order of importance'''<br />
 +
1. A web page is important if it has many other pages linked to it<br />
 +
2. The more important a web page is, the more weight should be assigned to its outgoing links<br/ >
 +
3. If a webpage has many outgoing links then its links have less value (ex: if a page links to everyone, like 411, it is not as important as pages that have incoming links)<br />
  
val =
+
<br />
 +
[[File:diagram.jpg]]
 +
<math>L=  
 +
\left[ {\begin{matrix}
 +
0 & 0 & 0 & 0 & 0 \\
 +
1 & 0 & 0 & 0 & 0 \\
 +
1 & 1 & 0 & 1 & 0 \\
 +
0 & 1 & 0 & 0 & 1 \\
 +
0 & 0 & 0 & 0 & 0 \end{matrix} } \right]</math>
  
    0.4000        0        0        0
+
ie: According to the above example <br/ >
        0    0.3000        0        0
+
Page 3 is the most important since it has the most links pointing to it, therefore more weigh should be placed on its outgoing links.<br/ >
        0        0  -0.0000        0
+
Page 4 comes after page 3 since it has the second most links pointing to it<br/ >
        0        0        0  -0.0000
+
Page 2 comes after page 4 since it has the third most links pointing to it<br/>
 +
Page 1 and page 5 are the least important since no links point to them<br/ >
 +
As page 1 and page 2 has the most outgoing links, then their links have less value compared to the other pages. <br/ >
  
%% Two values of zero on the diagonal means there are two 'islands' of chains
+
<math>L_{ij} = 1</math> if j has a link to i;<br/ >
 +
<math>L_{ij} = 0</math> otherwise.<br />
 +
<br />
 +
C<sub>j</sub> The number of outgoing links of page <math>j</math>:
 +
<math>C_j=\sum_i L_{ij}</math>
 +
(i.e. sum of entries in column j)<br />
 +
<br />
 +
<math>P_j</math> is the rank of page <math>j</math>.<br />
 +
Suppose we have <math>N</math> pages, <math>P</math> is a vector containing ranks of all pages.<br />
 +
- <math>P</math> is a <math>N \times 1</math> vector.
  
</pre>
+
- <math>P_i</math> counts the number of incoming links of page <math>i</math>
 +
<math>P_i=\sum_j L_{ij}</math> <br />(i.e. sum of entries in row i)
  
<math>\Pi</math> satisfies detailed balance if <math>\Pi_i P_{ij}=P_{ij} \Pi_j</math>. Detailed balance guarantees that <math>\Pi</math> is stationary distribution.<br />
+
for each row, if there is a 1 in the third column, it means page three point to that page.
  
 +
A PageRank results from a mathematical algorithm based on the webgraph, created by all World Wide Web pages as nodes and hyperlinks as edges, taking into consideration authority hubs such as cnn.com or usa.gov. The rank value indicates an importance of a particular page. A hyperlink to a page counts as a vote of support. (This would be represented in our diagram as an arrow pointing towards the page. Hence in our example, Page 3 is the most important, since it has the most 'votes of support). The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it ("incoming links"). A page that is linked to by many pages with high PageRank receives a high rank itself. If there are no links to a web page, then there is no support for that page (In our example, this would be Page 1 and Page 5).
 +
(source:http://en.wikipedia.org/wiki/PageRank#Description)
  
 +
For those interested in PageRank, here is the original paper by Google co-founders Brin and Page: http://infolab.stanford.edu/pub/papers/google.pdf
  
'''Adjacency Matrix''' - a matrix <math>A</math> that dictates which states are connected and way of portraying which vertices in the matrix are adjacent. If we compute <math>A^2</math>, we can know which states are connected with paths of length 2.<br />
+
=== Example of Page Rank Application in Real Life ===
  
 +
'''Page Rank checker'''
 +
- This is a free service to check Google™ page rank instantly via online PR checker or by adding a PageRank checking button to the
 +
  web pages (http://www.prchecker.info/check_page_rank.php)
  
'''Markov chain''' -an irreducible chain if it is possible to go from every state to every state (not necessary in one more).<br />
 
  
 +
GoogleMatrix G = d * [ (Hyperlink Matrix H) + (Dangling Nodes Matrix A) ] + ((1-d)/N) * (NxN Matrix U of all 1's)
  
<span style="text-shadow: 0px 2px 3px hsl(310,15%,65%);margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">
 
Theorem:
 
  
An '''Ergodic''' Markov chain has a unique stationary distribution <math>\pi</math>. The limiting distribution exists and it is equal to <math>\pi</math>. <br />
+
[[File:Google matrix.png]]
  
  
Markov process satisfies detailed balance  if and only if it is a '''reversible''' Markov process
+
(source: https://googledrive.com/host/0B2GQktu-wcTiaWw5OFVqT1k3bDA/)
where P is the matrix of  Markov transition.<br />
 
  
Satisfying the detailed balance condition guarantees that <math>\pi</math> is stationary distributed.
+
== Class 16 - Thursday June 27th 2013 ==
  
<math> \pi </math> satisfies detailed balance if <math> \pi_i P_{ij} = P_{ji} \pi_j </math> <br>which is the same as the Markov process equation.
+
=== Page Rank ===
 +
<math>L_{ij}</math> equals 1 if j has a link to i, and equals 0 otherwise. <br>
 +
<math>C_j</math> :The number of outgoing links for page j, where <math>c_j=\sum_i L_{ij}</math>  
  
Example in the class:
+
P is N by 1 vector contains rank of all N pages; for page i, the rank is <math>P_i</math>
<math>P= \left[ {\begin{array}{ccc}
 
\frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\[6pt]
 
\frac{1}{4} & \frac{3}{4} & 0 \\[6pt]
 
\frac{1}{2} & 0 & \frac{1}{2} \end{array} } \right]</math>
 
  
and  <math>\pi=(\frac{1}{3},\frac{4}{9}, \frac{2}{9})</math>
+
<math>P_i= (1-d) + d\cdot \sum_j \frac {L_{ij}P_j}{c_j}</math>
  
<math>\pi_1 P_{1,2} = 1/3 \times 1/3 = 1/9,\, P_{2,1} \pi_2 = 1/4 \times 4/9 = 1/9 \Rightarrow \pi_1 P_{1,2} = P_{2,1} \pi_2 </math><br>
+
where 0 < d < 1 is constant (in original page rank algorithm d = 0.8).
  
<math>\pi_2 P_{2,3} = 4/9 \times 0 = 0,\, P_{3,2} \pi_3 = 0 \times 2/9 = 0 \Rightarrow \pi_2 P_{2,3} = P_{3,2} \pi_3</math><br>
+
Interpretation of the formula:<br/>
==Reversible Markov chain{{Anchor|detailed balance}}==
+
1) sum of L<sub>ij</sub> is the total number of incoming links<br/>
 +
2) the above sum is weighted by page rank of the pages that contain the link to i (P<sub>j</sub>) i.e. if a high-rank page points to page i, then this link carries more weight than links from lower-rank pages.<br/>
 +
3) the sum is then weighted by the inverse of the number of outgoing links from the pages that contain links to i (c<sub>j</sub>). i.e. if a page has more outgoing links than other pages then its links carry less weight.<br/>
 +
4) finally, we take a linear combination of the page rank obtained from above and a constant 1. This ensures that every page has a rank greater than zero.<br/>
  
A Markov chain is said to be '''reversible''' if there is a probability distribution over states, '''π''', such that
+
Note that this is a system of N equations with N unknowns.<br/>
:<math>\pi_i \Pr(X_{n+1} = j \mid X_{n} = i) = \pi_j \Pr(X_{n+1} = i \mid X_{n} = j)</math>
 
for all times ''n'' and all states ''i'' and ''j''.
 
This condition is also known as the '''[[detailed balance]]''' condition (some books refer the local balance equation).
 
With a time-homogeneous Markov chain, Pr(''X''<sub>n+1</sub>&nbsp;=&nbsp;''j''&nbsp;|&nbsp;''X''<sub>n</sub>&nbsp;=&nbsp;''i'') does not change with time ''n'' and it can be written more simply as <math>p_{ij}</math>.  In this case, the detailed balance equation can be written more compactly as
 
:<math>\pi_i p_{ij} = \pi_j p_{ji}\,.</math>
 
  
Summing the original equation over ''i'' gives
+
<math>c_j</math> is the number of outgoing links, less outgoing links means more important.<br/>
  
:<math>\begin{align}\sum_i \pi_i \Pr(X_{n+1} = j \mid X_{n} = i) &= \sum_i \pi_j \Pr(X_{n+1} = i \mid X_{n} = j) \\ &= \pi_j \sum_i \Pr(X_{n+1} = i \mid X_{n} = j) = \pi_j\,,\end{align}</math>
 
so, for reversible Markov chains, '''π''' is always a steady-state distribution of Pr(''X''<sub>n+1</sub>&nbsp;=&nbsp;''j''&nbsp;|&nbsp;''X''<sub>n</sub>&nbsp;=&nbsp;''i'') for every ''n''.
 
  
If the Markov chain begins in the steady-state distribution, ''i.e.'', if Pr(''X''<sub>0</sub>&nbsp;=&nbsp;''i'')&nbsp;=&nbsp;π<sub>''i''</sub>, then Pr(''X''<sub>''n''</sub>&nbsp;=&nbsp;''i'')&nbsp;=&nbsp;π<sub>''i''</sub> for all ''n'' and the detailed balance equation can be written as
+
Let D be a diagonal N by N matrix such that <math> D_{ii}</math> = <math>c_i</math>
:<math>\Pr(X_{n} = i, X_{n+1} = j) = \Pr(X_{n+1} = i, X_{n} = j)\,.</math>
 
The left- and right-hand sides of this last equation are identical except for a reversing of the time indices ''n'' and&nbsp;''n''&nbsp;+&nbsp;1.
 
  
[[Kolmogorov's criterion]] gives a necessary and sufficient condition for a Markov chain to be reversible directly from the transition matrix probabilities. The criterion requires that the products of probabilities around every closed loop are the same in both directions around the loop.
+
<math>D=
 +
\left[ {\begin{matrix}
 +
c_1 & 0 & ... & 0  \\
 +
0 & c_2 & ... &  0  \\
 +
0 & 0 & ... &  0 \\
 +
0 & 0 & ... & c_N \end{matrix} } \right]</math>
  
Reversible Markov chains are common in [[Markov chain Monte Carlo|Markov chain Monte Carlo (MCMC)]] approaches because the detailed balance equation for a desired distribution '''π''' necessarily implies that the Markov chain has been constructed so that '''π''' is a steady-state distribution. Even with time-inhomogeneous Markov chains, where multiple transition matrices are used, if each such transition matrix exhibits detailed balance with the desired '''π''' distribution, this necessarily implies that '''π''' is a steady-state distribution of the Markov chain.
+
Then P = (1-d) e + dLD<sup>-1</sup>P <br/> where e =[1 1 ....]<sup>T</sup> , i.e. a N by 1 vector.<br/>
<b>from wikipedia</b>
+
We assume that rank of all N pages sums to N. The sum of rank of all N pages can be any number, as long as the ranks have certain propotion. <br/>
 +
i.e. e<sup>T</sup> P = N, then <math>~\frac{e^{T}P}{N} = 1</math>
  
== Class 15 - Tuesday June 25th 2013 ==
 
=== Announcement ===
 
Note to all students, the first half of today's lecture will cover the midterm's solution; however, please do not post the solution on the Wikicoursenote.<br />
 
  
====Detailed balance====
+
D<sup>-1</sup> will be: 
  
Let <math>\displaystyle P</math> be the transition probability matrix of a Markov chain.  <math> P_{ij}</math> means the probability from in to j. If there exists a distribution vector <math>\displaystyle \underline{\pi} = [\pi_1 \pi_2 ... \pi_n]</math> such that <math>\pi_i \cdot P_{ij}=P_{ji} \cdot \pi_j, \; \forall i,j</math>, then the Markov chain is said to have '''detailed balance'''. The principle of detailed balance, formulated for kinetic systems, are decomposed into elementary processes (collisions, or steps, or elementary reactions): Each elementary process should be equilibrated by its reverse process at equilibrium, that is to say that detailed balance holds if and only if the Markov Chain is reversible.
+
D<sup>-1</sup><math>=
 +
\left[ {\begin{matrix}
 +
\frac {1}{c_1} & 0 & ... & 0  \\
 +
0 & \frac {1}{c_2} & ...  &  0  \\
 +
0 & 0 & ... &  0 \\
 +
0 & 0 & ... & \frac {1}{c_N} \end{matrix} } \right]</math>
  
<br />A detailed balanced Markov chain must have <math>\displaystyle \underline{\pi}</math> given above as a stationary distribution, that is <math>\displaystyle \underline{\pi} = \underline{\pi} P</math>, where <math>\displaystyle \underline{\pi}</math> is a 1 by n matrix and <math>\displaystyle P</math> is a n by n matrix.<br />
+
<math>P=~(1-d)e+dLD^{-1}P</math> where <math>e=\begin{bmatrix}
 +
1\\  
 +
1\\  
 +
...\\  
 +
1
 +
\end{bmatrix}</math>
  
 +
<math>P=(1-d)~\frac{ee^{T}P}{N}+dLD^{-1}P</math>
 +
 +
<math>P=[(1-d)~\frac{ee^T}{N}+dLD^{-1}]P</math>
 +
 +
P=AP
 +
 +
 +
N is a N*N matrix,
 +
L is a N*N matrix,
 +
D<sup>-1</sup> is a N*N matrix,
 +
P is a N*1 matrix
 +
d is a constant between 0 and 1
  
'''Proof:''' <br>
+
'''P=AP'''<br />
 +
P is an eigenvector of A with corresponding eigenvalue equal to 1.<br>
 +
'''P<sup>T</sup>=P<sup>T</sup>A<sup>T</sup><br>'''
 +
Notice that all entries in A are non-negative and each row sums to 1. Hence A satisfies the definition of a transition probability matrix.<br>
 +
P<sup>T</sup> is the stationary distribution of a Markov Chain with transition probability matrix A<sup>T</sup>.
  
    <math>\; [\pi P]_j = \sum_i \pi_i P_{ij} =\sum_i P_{ji}\pi_j =\pi_j\sum_i P_{ji} =\pi_j  ,\forall j</math>
+
We can consider A to be the matrix describing all possible movements following links on the internet, and P<sup>t</sup> as the probability of being on any given webpage if we were on the internet long enough.
  
:Note: Since <math>\pi_j</math> is a sum of column j and we can do this proof for every element in matrix P; in general, we can prove <math>\pi=\pi P</math> . Hence <math>\pi</math> is always a stationary distribution of <math>P(X_n+1=j|X_n=i)</math>, for every n.  
+
Definition of rank page and proof it steps by steps, it shows with 3 n*n matrix and and one n*1 matrix and a constant d between 0 to 1.
 +
p is the stationary distribution so p=Ap.
  
In other terms, <math> P_{ij} = P(X_n = j| X_{n-1} = i) </math>, where <math>\pi_j</math> is the equilibrium probability of being in state j and <math>\pi_i</math> is the equilibrium probability of being in state i. <math>P(X_{n-1} = i) = \pi_i</math> is equivalent to <math>P(X_{n-1} = i,  Xn = j)</math> being symmetric in i and j.
+
=== Damping Factor "d" ===
  
Keep in mind that the detailed balance is a sufficient but not required condition for a distribution to be stationary.  
+
The PageRank assumes that any imaginary user who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will keep on clicking is a damping factor, "d". After many studies, the approximation of "d" is 0.85. Other values for "d" have been used in class and may appear on assignments/exams.
i.e. A distribution satisfying the detailed balance is stationary, but a stationary distribution does not necessarily satisfy the detailed balance.Also note that in the stationary distribution <math>\pi=\pi P</math>, in the proof the <math> \sum_i P_(ji) = 1  </math> as we are summing across a row, so <math>\pi P=\pi</math>.
 
  
=== PageRank (http://en.wikipedia.org/wiki/PageRank) ===
+
===Examples===
 +
<span style="background:#F5F5DC">
  
*PageRank is a link-analysis algorithm developed by Larry Page and Sergey Brin at Stanford University in 1996. Two years later they founded Google, using PageRank as their basis for measuring a website's importance, relevance and popularity.
+
==== Example 1 ====
*PageRank is a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page.
 
*PageRank is a graph containing web pages and their links to each other.
 
*Many social media sites use this (such as Facebook and Twitter).
 
*It can also be used to find criminals (ie. theives, hackers, terrorists, etc.) by finding out the links.
 
This is what made Google the search engine of choice over Yahoo, Bing, etc.- What made Google's search engine a huge success is not its search function, but rather the algorithm it used to rank the pages. (Eg. If we come up with 100 million search results, how do you list them by relevance and importance so the users can easily find what they are looking for. Most users will not go past the first 3 or so search pages to find what they are looking for. It is this ability to rank pages that allows Google to remain more popular than Yahoo, Bink, AskJeeves, etc.)<br />
 
  
<br />'''The order of importance'''<br />
 
1. A web page is important if it has many other pages linked to it<br />
 
2. The more important a web page is, the more weight should be assigned to its outgoing links<br/ >
 
3. If a webpage has many outgoing links then its links have less value (eg: if a page links to everyone, like 411, it is not as important as pages that have incoming links)<br />
 
  
 +
[[File:eg1.jpg]]
 
<br />
 
<br />
[[File:diagram.jpg]]
+
</span>
 
<math>L=  
 
<math>L=  
 
\left[ {\begin{matrix}
 
\left[ {\begin{matrix}
0 & 0 & 0 & 0 & 0 \\
+
0 & 0 & 1 \\
1 & 0 & 0 & 0 & 0 \\
+
1 & 0 & 0 \\
1 & 1 & 0 & 1 & 0 \\
+
0 & 1 & 0 \end{matrix} } \right]</math>
0 & 1 & 0 & 0 & 1 \\
 
0 & 0 & 0 & 0 & 0 \end{matrix} } \right]</math>
 
  
ie: According to the above example <br/ >
+
<math>c=
Page 3 is the most important since it has the most links pointing to it (3 links), therefore more weight should be placed on its outgoing links.<br/ >
+
\left[ {\begin{matrix}
Page 4 comes after page 3 since it has the second most links (2) pointing to it<br/ >
+
1 & 1 & 1 \end{matrix} } \right]</math>
Page 2 comes after page 4 since it has the third most links (1) pointing to it<br/>
 
Page 1 and page 5 are the least important since no links point to them<br/ >
 
As page 1 and page 2 has the most outgoing links, then their links have less value compared to the other pages. <br/ >
 
  
<math>L_{ij} = 1</math> if j has a link to i;<br/ >
+
<math>D=  
<math>L_{ij} = 0</math> otherwise.<br />
+
\left[ {\begin{matrix}
<br />
+
1 & 0 & 0 \\
C<sub>j</sub> = The number of outgoing links of page <math>j</math>:
+
0 & 1 & 0  \\
<math>C_j=\sum_i L_{ij}</math>
+
0 & 0 & 1 \end{matrix} } \right]</math>
suppose we have N pages, then p is a N by 1 vector contains rank of the pages
 
(i.e. sum of entries in column j)<br />
 
<br />
 
<math>P_j</math> is the rank of page <math>j</math>.<br />
 
- <math>P</math> is a <math>N \times 1</math> vector.
 
  
- <math>P_i</math> counts the number of incoming links of page <math>i</math>
+
<pre style='font-size:14px'>
<br />(i.e. sum of entries in row i)
 
<br /><math>P_i=\sum_j L_{ij}</math>
 
For a row i, if there is a 1 in the third column, it means page three points to page i.
 
  
===Alternate Example===
+
MATLAB Code
[[File:pagerank.jpg]]<br>
 
In this case, Page 5 does not have any pointers to or from the cluster of pages on its left. When we build the algorithm to conduct page rank in the next lecture, we will ensure that Page 5 is not ignored in the ranking system.
 
  
Obviously the rank is Page 3, Page 4, Page 2, Page 1, Page 5. Without an additional term (<math>d</math>), poor old Page 5 would not come up on the search. However, we will make sure that it does. Yay!
+
d=0.8
 +
N=3
 +
A=(1-d)*ones(N)/N+d*L*pinv(D) #pinv: Moore-Penrose inverse (pseudoinverse) of symbolic matrix
 +
We use the pinv(D) function [pseudo-inverse] instead of the inv(D) function because in
 +
the case of a non-invertible matrix, it would not crash the program.
 +
[vec val]=eigs(A) (eigen-decomposition)
 +
a=-vec(:,1) (find the eigenvector equals to 1)
 +
a=a/sum(a) (normalize a)
 +
or to show that A transpose is a stationary transition matrix
 +
(transpose(A))^200 will be the same as a=a/sum(a)
 +
</pre>
  
===Explanation===
+
'''NOTE:''' Changing the value of d, does not change the ranking order of the pages.  
A PageRank results from a mathematical algorithm based on the webgraph, created by all World Wide Web pages as nodes and hyperlinks as edges, taking into consideration authority hubs such as cnn.com or usa.gov. The rank value indicates the importance of a particular page. A hyperlink to a page counts as a vote of support. (This would be represented in our diagram as an arrow pointing towards the page. Hence in our example, Page 3 is the most important, since it has the most votes of support). The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it ("incoming links"). <br />
 
  
A page that is linked to by many pages with high PageRank receives a high rank itself. If there are no links to a web page, then there is no support for that page (In our example, this would be Page 1 and Page 5).
+
By looking at each entry after normalizing a, we can tell the ranking order of each page.<br>
(source:http://en.wikipedia.org/wiki/PageRank#Description)
+
<span style="background:#F5F5DC">
  
For those interested in PageRank, here is the original paper by Google co-founders Brin and Page: http://infolab.stanford.edu/pub/papers/google.pdf
+
c = [1 1 1] since there are 3 pages, each page is one way recurrent to each other and there is only one outgoing for each page. Hence, D is a 3x3 standard diagonal matrix.
  
Notice: Page and Brin confused a formula in above paper.
+
==== Example 2 ====
  
=== Example of Page Rank Application in Real Life ===
+
[[File:Screen_shot_2013-07-02_at_3.43.04_AM.png]]
  
'''Page Rank checker'''
 
- This is a free service to check Google™ page rank instantly via online PR checker or by adding a PageRank checking button to the
 
  web pages (http://www.prchecker.info/check_page_rank.php)
 
  
 +
<math>L=
 +
\left[ {\begin{matrix}
 +
0 & 0 & 1  \\
 +
1 & 0 & 1  \\
 +
0 & 1 & 0 \end{matrix} } \right]</math>
  
GoogleMatrix G = d * [ (Hyperlink Matrix H) + (Dangling Nodes Matrix A) ] + ((1-d)/N) * (NxN Matrix U of all 1's)
+
<math>c=  
 +
\left[ {\begin{matrix}
 +
1 & 1 & 2 \end{matrix} } \right]</math>
  
 +
<math>D=
 +
\left[ {\begin{matrix}
 +
1 & 0 & 0  \\
 +
0 & 1 & 0  \\
 +
0 & 0 & 2 \end{matrix} } \right]</math>
  
[[File:Google matrix.png]]
+
<pre style='font-size:14px'>
  
 +
Matlab code
  
(source: https://googledrive.com/host/0B2GQktu-wcTiaWw5OFVqT1k3bDA/)
+
>> L=[0 0 1;1 0 1;0 1 0];
 
+
>> C=sum(L);
== Class 16 - Thursday June 27th 2013 ==
+
>> D=diag(C);
 +
>> d=0.8;
 +
>> N=3;
 +
>> A=(1-d)*ones(N)/N+d*L*pinv(D);
 +
>> [vec val]=eigs(A)
  
=== Page Rank ===
+
vec =
<math>L_{ij}</math> = 1 (if j has a link to i) & 0 (otherwise) <br>
 
<math>C_j</math> :The number of outgoing links for page j, where <math>c_j=\sum_i L_{ij}</math>
 
  
P is a Nx1 vector that contains the rank of all N pages.<br>
+
  -0.3707            -0.3536 + 0.3536i  -0.3536 - 0.3536i
For page i, the rank is <math>P_i</math>
+
  -0.6672            -0.3536 - 0.3536i  -0.3536 + 0.3536i
 +
  -0.6461            0.7071            0.7071         
  
<math>P_i= (1-d) + d\cdot \sum_j \frac {L_{ij}P_j}{c_j}</math> <br>
 
where 0 < d < 1 is constant (in original page rank algorithm d = 0.8).
 
  
<math>L=  
+
val =
\left[ {\begin{matrix}
 
0 & 1 & 1 \\
 
1 & 0 & 0 \\
 
1 & 0 & 0 \end{matrix} } \right]</math>
 
  
then C = [2 1 1]
+
  1.0000                  0                  0         
 +
        0            -0.4000 - 0.4000i        0         
 +
        0                  0            -0.4000 + 0.4000i
  
<br/>'''Note:'''<br/>
+
>> a=-vec(:,1)
The rank of page i is <br>
 
1) Proportional to the importance of each page that linked to it and <br>
 
2) Inversely proportional to the total number of links coming from each of those pages.
 
  
If given all other P<sub>j</sub>, we can easily calculate P<sub>i</sub>. However, we don't know any of them, thus we need to solve N unknowns with the N equations that we have.
+
a =
  
<br/>'''Note:'''<br/>
+
    0.3707
We do not want a page with rank 0 to occur (to give new websites an opportunity to be clicked), so we use d (damping factor)
+
    0.6672
 +
    0.6461
  
The reason why we are multiplying by the d term in front of the <math>\sum_j \frac {L_{ij}P_j}{c_j}</math> is to get rid of the problem with extreme cases where term one or two is dominant over the others.
+
>> a=a/sum(a)
  
 +
a =
  
<br/>'''Interpretation of the formula:'''<br/>
+
    0.2201
1) Sum of L<sub>ij</sub> is the total number of incoming links.<br/>
+
    0.3962
2) The above sum is weighted by page rank of the pages that contain the link to i (P<sub>j</sub>) i.e. if a high-rank page points to page i, then this link carries more weight than links from lower-rank pages.<br/>
+
    0.3836
3) The sum is then weighted by the inverse of the number of outgoing links from the pages that contain links to i (c<sub>j</sub>). i.e. if a page has more outgoing links than other pages then its links carry less weight.<br/>
+
</pre>
4) Finally, we take a linear combination of the page rank obtained from above and a constant 1. This ensures that every page has a rank greater than zero.<br/>
+
'''NOTE:''' Page 2 is the most important page because it has 2 incomings. Similarly, page 3 is more important than page 1 because page 3 has the incoming result from page 2.
  
 +
This example is similar to the first example, but here, page 3 can go back to page 2, so the matrix of the outgoing matrix, the third column of the D matrix is 3 in the third row. And we use the code to calculate the p=Ap.
  
Note that this is a system of N equations with N unknowns.<br/>
+
==== Example 3 ====
  
<math>c_j</math> is the number of outgoing links, the less outgoing links a page has, the more important the page is.<br/>
+
[[File:eg 3.jpg]]<br>
  
 +
<math>L=
 +
\left[ {\begin{matrix}
 +
0 & 1 & 0  \\
 +
1 & 0 & 1  \\
 +
0 & 1 & 0 \end{matrix} } \right]</math>
  
Let D be a diagonal N by N matrix such that <math> D_{ii}</math> = <math>c_i</math>
+
<math>c=
 +
\left[ {\begin{matrix}
 +
1 & 2 & 1 \end{matrix} } \right]</math>
  
 
<math>D=  
 
<math>D=  
 
\left[ {\begin{matrix}
 
\left[ {\begin{matrix}
c_1 & 0 & ... & 0  \\
+
1 & 0 & 0  \\
0 & c_2 & ...  & 0  \\
+
0 & 2 & 0  \\
0 & 0 & ... &  0 \\
+
0 & 0 & 1 \end{matrix} } \right]</math>
0 & 0 & ... & c_N \end{matrix} } \right]</math>
 
  
Then P = (1-d) e + dLD<sup>-1</sup>P <br/> where e =[1 1 ....]<sup>T</sup> , i.e. a N by 1 vector.<br/>
+
<math>d=0.8</math><br>
 +
<math>N=3</math><br>
  
The relative value of page rank is valuable, but the absolute value is meaningless. <br/>
 
P is a vector of rank which could contain any arbitrary numbers. We only care about whether one web is more important than others(P<sub>i</sub><math>>=</math> P<sub>j</sub> for any j)
 
  
We assume that rank of all N pages sums to N. The sum of rank of all N pages can be any number, as long as the ranks have certain proportions. <br/>
+
</span>
 
+
this example is the second page have 2 outgoings.
i.e. e<sup>T</sup> P = N, then <math>~\frac{e^{T}P}{N} = 1</math>
 
 
 
 
 
D<sup>-1</sup> will be: 
 
  
D<sup>-1</sup><math>=
 
\left[ {\begin{matrix}
 
\frac {1}{c_1} & 0 & ... & 0  \\
 
0 & \frac {1}{c_2} & ...  &  0  \\
 
0 & 0 & ... &  0 \\
 
0 & 0 & ... & \frac {1}{c_N} \end{matrix} } \right]</math>
 
  
<math>P=~(1-d)e+dLD^{-1}P</math>  where <math>e=\begin{bmatrix}
 
1\\
 
1\\
 
...\\
 
1
 
\end{bmatrix}</math>
 
  
<math>P=(1-d)~\frac{ee^{T}P}{N}+dLD^{-1}P</math>
+
Another Example:
  
<math>P=[(1-d)~\frac{ee^T}{N}+dLD^{-1}]P</math>
+
Consider: 1 -> ,<-2 ->3
  
<math>P=AP, where A=[(1-d)~\frac{ee^T}{N}+dLD^{-1}]</math>
+
L= [0 1 0; 1 0 0; 0 1 0]; c=[1,1,1]; D= [1 0 0; 0 1 0; 0 0 1]
  
<br> P is the stationary distribution of A and it is a row vector
+
==== Example 4 ====
<br> A is a transitional probability matrix (N<math>*</math>N)
 
  
<b>The following variables are necessary to calculate page rank:</b>
+
1 <-> 2 -> 3 <-> 4
* <math>N</math>: the number of pages <br>
 
* <math>L</math>: an <math>N</math> by <math>N</math> matrix (Binary Matrix - every entry is 1 or 0) <br>
 
* <math>D^{-1}</math>: an <math>N</math> by <math>N</math> diagonal matrix <br>
 
* <math>P</math>: an <math>N</math> by 1 row vector <br>
 
* <math>d</math>: a constant between 0 and 1 (In the origional algorithm, d =0.8 ) <br>
 
  
 
<br/>'''Example:'''<br/>Given that <br>
 
 
<math>L=  
 
<math>L=  
 
\left[ {\begin{matrix}
 
\left[ {\begin{matrix}
L11 & L12 \\
+
0 & 1 & 0 & 0 \\
L21 & L22 \end{matrix} } \right]</math>
+
1 & 0 & 0 & 0 \\
 +
0 & 1 & 0 & 1 \\
 +
0 & 0 & 1 & 0 \end{matrix} } \right]</math>
  
<math>D^{-1}=  
+
<math>c=  
 
\left[ {\begin{matrix}
 
\left[ {\begin{matrix}
1/C1 & 0 \\
+
1 & 2 & 1 & 1 \end{matrix} } \right]</math>
0 & 1/C2 \end{matrix} } \right]</math>
 
  
<math>P=  
+
<math>D=  
 
\left[ {\begin{matrix}
 
\left[ {\begin{matrix}
P1 \\
+
1 & 0 & 0 & 0 \\
P2 \end{matrix} } \right]</math>
+
0 & 2 & 0 & 0 \\
 +
0 & 0 & 1 & 0  \\
 +
0 & 0 & 0 & 1 \end{matrix} } \right]</math><br />
  
Then<br>
+
Matlab code
L D<sup>-1</sup> P =
+
<pre style='font-size:14px'>
<math>
 
\left[ {\begin{matrix}
 
L11/C1 & L12/C2\\
 
L21/C1 & L22/C2
 
\end{matrix} } \right] \left [ \begin{matrix}
 
P1 \\
 
P2
 
\end{matrix} \right] </math>=<math>\left [ \begin{matrix}
 
L11*P1/C1+L12*P2/C2            \\
 
L21*P1/C1+L22*P2/C2
 
\end{matrix} \right] </math><br\>
 
  
'''P=AP'''<br />
+
>> L=L= [0 1 0 0;1 0 0 0;0 1 0 1;0 0 1 0];
P is an eigenvector of A with corresponding eigenvalue equal to 1.<br>
+
>> C=sum(L);
 +
>> D=diag(C);
 +
>> d=0.8;
 +
>> N=4;
 +
>> A=(1-d)*ones(N)/N+d*L*pinv(D)
 +
 
 +
A =
  
'''P<sup>T</sup>=P<sup>T</sup>A<sup>T</sup><br>'''
+
    0.0500    0.4500    0.0500    0.0500
 +
    0.8500    0.0500    0.0500    0.0500
 +
    0.0500    0.4500    0.0500    0.8500
 +
    0.0500    0.0500    0.8500    0.0500
  
Notice that all entries in A<sup>T </sup>are non-negative and each row sums to 1. <br>
+
>> [vec val]=eigs(A)
Hence A<sup>T</sup> satisfies the definition of a transition probability matrix.<br>
 
P<sup>T</sup> is the stationary distribution of a Markov Chain with transition probability matrix A<sup>T</sup>.
 
  
 +
vec =
  
We can consider A<sup>T</sup> to be the matrix describing all possible movements following links on the internet, and P<sup>T</sup> as the probability of being on any given webpage if we have been on the internet long enough.
+
    0.1817  -0.0000  -0.4082    0.4082
 +
    0.2336    0.0000    0.5774    0.5774
 +
    0.7009  -0.7071    0.4082  -0.4082
 +
    0.6490    0.7071  -0.5774  -0.5774
  
=== Damping Factor "d" (http://en.wikipedia.org/wiki/PageRank) ===
 
  
The PageRank assumes that an imaginary user who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will continue is a damping factor, "d". It is generally assumed that "d" is set around a value of 0.85. <br/>
+
val =
(1-d) is the likelihood of a web surfer jumping to a page chosen at random throughout the entire web. <br/>
 
A web page with no outbound links is known as a "sink" page. If a web surfer lands on a sink page, they are randomly transferred to another page and continue surfing. Thus, it is assumed that a page with no outbound links is linked to all other pages in the web.<br/><br/>
 
Additionally, the higher the damping factor, the larger is the effect of an additional inbound link for the PageRank of the page that receives the link, and the more evenly distributes PageRank over the other pages of a site.
 
  
<span style="text-shadow: 0px 2px 3px hsl(310,15%,65%);margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">also page ranking system are used in data base management system vastly</span>
+
    1.0000        0        0        0
===tips===
+
        0  -0.8000        0        0
<b>function can be used to calculate A in matlab</b>
+
        0        0  -0.5657        0
<div>
+
        0        0        0    0.5657
function [A]=getA(d,N,L,D)
 
A=(1-d)*ones(N)/N+d*L*pinv(D);
 
end
 
</div>
 
  
===Examples===
+
>> a=vec(:,1)
<span style="background:#F5F5DC">
 
  
<div style="border:1px green solid">
+
>> a=vec(:,1)
  
==== Example 1 ====
+
a =
  
 +
    0.1817
 +
    0.2336
 +
    0.7009
 +
    0.6490
  
[[File:eg1.jpg]]
+
>> a=a/sum(a)
<br />
 
</span>
 
<math>L=  
 
\left[ {\begin{matrix}
 
0 & 0 & 1  \\
 
1 & 0 & 0  \\
 
0 & 1 & 0 \end{matrix} } \right]</math>
 
  
<math>c=  
+
a =
\left[ {\begin{matrix}
 
1 & 1 & 1 \end{matrix} } \right]</math>
 
  
<math>D=
+
    0.1029
\left[ {\begin{matrix}
+
    0.1324
1 & 0 & 0  \\
+
    0.3971
0 & 1 & 0  \\
+
    0.3676
0 & 0 & 1 \end{matrix} } \right]</math>
 
 
 
<pre style='font-size:14px'>
 
 
 
MATLAB Code
 
 
 
d=0.8
 
N=3
 
A=(1-d)*ones(N)/N+d*L*pinv(D)        #pinv: Moore-Penrose inverse (pseudoinverse) of symbolic matrix
 
We use the pinv(D) function [pseudo-inverse] instead of the inv(D) function because in the case of a non-invertible matrix,
 
it would not crash the program. 
 
[vec val]=eigs(A)                    #eigen-decomposition
 
a=-vec(:,1)                          #find the eigenvector equals to 1
 
a=a/sum(a)                          #normalize a
 
or to show that A transpose is a stationary transition matrix
 
(transpose(A))^200 will be the same as a=a/sum(a)
 
 
</pre>
 
</pre>
 +
'''NOTE:''' The ranking of each page is as follows: Page 3, Page 4, Page 2 and Page 1. Page 3 is the highest since it has the most incoming links. All of the other pages only have one incoming link but since Page 3, highest ranked page, links to Page 4, Page 4 is the second highest ranked. Lastly, since Page 2 links into Page 3 it is the next highest rank.
  
'''NOTE:''' <br>
+
Page 2 has 2 outgoing links. Pages with the same incoming links can be ranked closest to the highest ranked page. If the highest page P1 is incoming into a page P2, then P2 is ranked second, and so on.
Changing the value of d, does not change the ranking order of the pages.
 
 
 
By looking at each entry after normalizing a, we can tell the ranking order of each page.<br>
 
<span style="background:#F5F5DC">
 
 
 
c = [1 1 1] since there are 3 pages, each page is one way recurrent to each other and there is only one outgoing for each page. <br>
 
Hence, D is a 3x3 standard diagonal matrix.
 
 
 
==== Example 2 ====
 
 
 
[[File:Screen_shot_2013-07-02_at_3.43.04_AM.png]]
 
  
 +
==== Example 5 ====
  
 
<math>L=  
 
<math>L=  
 
\left[ {\begin{matrix}
 
\left[ {\begin{matrix}
0 & 0 & 1 \\
+
0 & 1 & 0 & 1 \\
1 & 0 & 1 \\
+
1 & 0 & 1 & 1 \\
0 & 1 & 0 \end{matrix} } \right]</math>
+
1 & 0 & 0 & 1 \\
 +
1 & 0 & 0 & 0 \end{matrix} } \right]</math>
  
 
<math>c=  
 
<math>c=  
 
\left[ {\begin{matrix}
 
\left[ {\begin{matrix}
1 & 1 & 2 \end{matrix} } \right]</math>
+
3 & 1 & 1 & 3 \end{matrix} } \right]</math>
  
 
<math>D=  
 
<math>D=  
 
\left[ {\begin{matrix}
 
\left[ {\begin{matrix}
1 & 0 & 0 \\
+
3 & 0 & 0 & 0 \\
0 & 1 & 0  \\
+
0 & 1 & 0 & 0 \\
0 & 0 & 2 \end{matrix} } \right]</math>
+
0 & 0 & 1 & 0  \\
 +
0 & 0 & 0 & 3 \end{matrix} } \right]</math>
  
 
<pre style='font-size:14px'>
 
<pre style='font-size:14px'>
Line 5,595: Line 5,123:
 
Matlab code
 
Matlab code
  
>> L=[0 0 1;1 0 1;0 1 0];
+
>> L= [0 1 0 1; 1 0 1 1; 1 0 0 1;1 0 0 0];
>> C=sum(L);
+
>> d = 0.8;
>> D=diag(C);
+
>> N = 4;
>> d=0.8;
+
>> C = sum(L);
>> N=3;
+
>> D = diag(C);
 
>> A=(1-d)*ones(N)/N+d*L*pinv(D);
 
>> A=(1-d)*ones(N)/N+d*L*pinv(D);
>> [vec val]=eigs(A)
+
>> [vec val]=eigs(A);
 +
>> a=vec(:,1);
 +
>> a=a/sum(a)
  
vec =
+
a =
  
  -0.3707            -0.3536 + 0.3536i  -0.3536 - 0.3536i
+
    0.3492
  -0.6672            -0.3536 - 0.3536i  -0.3536 + 0.3536i
+
    0.3263
  -0.6461            0.7071            0.7071         
+
    0.1813
 +
    0.1431
 +
</pre>
  
 +
==== Example 6 ====
 +
<math>L=
 +
\left[ {\begin{matrix}
 +
0 & 1 & 0 & 0 & 1\\
 +
1 & 0 & 0 & 0 & 0\\
 +
0 & 1 & 0 & 0 & 0\\
 +
0 & 1 & 1 & 0 & 1\\
 +
0 & 0 & 0 & 1 & 0 \end{matrix} } \right]</math>
 +
<br />
  
val =
+
Matlab Code<br />
 +
<pre style="font-size:16px">
 +
>> d=0.8
  
  1.0000                  0                  0         
+
d =
        0            -0.4000 - 0.4000i        0         
 
        0                  0            -0.4000 + 0.4000i
 
  
>> a=-vec(:,1)
+
    0.8000
  
a =
+
>> L=[0 1 0 0 1;1 0 0 0 0;0 1 0 0 0;0 1 1 0 1;0 0 0 1 0]
  
    0.3707
+
L =
    0.6672
 
    0.6461
 
  
>> a=a/sum(a)  ;;standardize a
+
    0    1    0    0    1
 
+
    1    0    0    0    0
a =
+
    0    1    0    0    0
 +
    0    1    1    0    1
 +
    0    0    0    1    0
  
    0.2201
+
>> c=sum(L)
    0.3962
 
    0.3836
 
</pre>
 
'''NOTE:''' <br>
 
Page 2 is the most important page, because it has 2 incomings. <br>
 
Page 3 is more important than page 1 because page 3 has the incoming result from page 2.
 
  
 +
c =
  
This example is similar to the first example, but here, page 3 can go back to page 2.<br>
+
    1    3     1    1    2
Hence, the matrix of the outgoing matrix, the third column of the D matrix is 2 in the third row. And we use the code to calculate the p=Ap.
 
  
==== Example 3 ====
+
>> D=diag(c)
  
[[File:eg 3.jpg]]<br>
+
D =
  
<math>L=
+
    1    0    0    0    0
\left[ {\begin{matrix}
+
    0    3    0    0    0
0 & 1 & 0 \\
+
    0    0     1     0    0
1 & 0 & 1  \\
+
    0    0    0    1     0
0 & 1 & 0 \end{matrix} } \right]</math>
+
    0    0    0     0     2
  
<math>c=  
+
>> N=5
\left[ {\begin{matrix}
 
1 & 2 & 1 \end{matrix} } \right]</math>
 
  
<math>D=  
+
N =
\left[ {\begin{matrix}
 
1 & 0 & 0  \\
 
0 & 2 & 0  \\
 
0 & 0 & 1 \end{matrix} } \right]</math>
 
  
<math>d=0.8</math><br>
+
    5
<math>N=3</math><br>
 
  
 +
>> A=(1-d)*ones(N)/N+d*L*pinv(D)
  
</span>
+
A =
In this example the second page has 2 incomings (from the first and third page). So, page 2 is the most important.
 
Since the number of incoming for page 1 and 3 both come from page 2, they are equally important.
 
  
 +
    0.0400    0.3067    0.0400    0.0400    0.4400
 +
    0.8400    0.0400    0.0400    0.0400    0.0400
 +
    0.0400    0.3067    0.0400    0.0400    0.0400
 +
    0.0400    0.3067    0.8400    0.0400    0.4400
 +
    0.0400    0.0400    0.0400    0.8400    0.0400
  
</div>
+
>> [vec val]=eigs(A)
====Another Example:====
 
  
Consider: 1 <-> 2 -> 3
+
vec =
  
 +
  Columns 1 through 4
  
<math>L=
+
  -0.4129            0.4845 + 0.1032i  0.4845 - 0.1032i -0.0089 + 0.2973i
\left[ {\begin{matrix}
+
  -0.4158            -0.6586            -0.6586            -0.5005 + 0.2232i
0 & 1 & \\
+
  -0.1963            0.2854 - 0.0608i  0.2854 + 0.0608i -0.2570 - 0.2173i
1 & 0 & 0 \\
+
  -0.5700            0.1302 + 0.2612i  0.1302 - 0.2612i  0.1462 - 0.3032i
0 & 1 & \end{matrix} } \right]</math>
+
  -0.5415            -0.2416 - 0.3036i -0.2416 + 0.3036i  0.6202         
<math>c=
 
\left[ {\begin{matrix}
 
1 & 2 & 0 \end{matrix} } \right]</math>
 
<math>D=
 
\left[ {\begin{matrix}
 
1 & 0 & 0 \\
 
0 & 2 & \\
 
0 & 0 & 0 \end{matrix} } \right]</math>
 
  
==== Example 4 ====
+
  Column 5
  
1 <-> 2 -> 3 <-> 4
+
  -0.0089 - 0.2973i
 +
  -0.5005 - 0.2232i
 +
  -0.2570 + 0.2173i
 +
  0.1462 + 0.3032i
 +
  0.6202         
  
<math>L=
 
\left[ {\begin{matrix}
 
0 & 1 & 0 & 0 \\
 
1 & 0 & 0 & 0 \\
 
0 & 1 & 0 & 1 \\
 
0 & 0 & 1 & 0 \end{matrix} } \right]</math>
 
  
<math>c=  
+
val =
\left[ {\begin{matrix}
 
1 & 2 & 1 & 1 \end{matrix} } \right]</math>
 
  
<math>D=
+
  Columns 1 through 4
\left[ {\begin{matrix}
 
1 & 0 & 0 & 0 \\
 
0 & 2 & 0 & 0 \\
 
0 & 0 & 1 & 0  \\
 
0 & 0 & 0 & 1 \end{matrix} } \right]</math><br />
 
  
Matlab code
+
  1.0000                  0                  0                  0         
<pre style='font-size:14px'>
+
        0            -0.5886 - 0.1253i        0                  0         
 +
        0                  0            -0.5886 + 0.1253i        0         
 +
        0                  0                  0            0.1886 - 0.3911i
 +
        0                  0                  0                  0         
  
>> L= [0 1 0 0;1 0 0 0;0 1 0 1;0 0 1 0];
+
  Column 5
>> C=sum(L);
 
>> D=diag(C);
 
>> d=0.8;
 
>> N=4;
 
>> A=(1-d)*ones(N)/N+d*L*pinv(D)
 
  
A =
+
        0         
 +
        0         
 +
        0         
 +
        0         
 +
  0.1886 + 0.3911i
  
    0.0500    0.4500    0.0500    0.0500
+
>> a=-vec(:,1)
    0.8500    0.0500    0.0500    0.0500
 
    0.0500    0.4500    0.0500    0.8500
 
    0.0500    0.0500    0.8500    0.0500
 
  
>> [vec val]=eigs(A)
+
a =
  
vec =
+
    0.4129
 +
    0.4158
 +
    0.1963
 +
    0.5700
 +
    0.5415
  
    0.1817  -0.0000  -0.4082    0.4082
+
>> a=a/sum(a)
    0.2336    0.0000    0.5774    0.5774
 
    0.7009  -0.7071    0.4082  -0.4082
 
    0.6490    0.7071  -0.5774  -0.5774
 
  
 +
a =
  
val =
+
    0.1933
 +
    0.1946
 +
    0.0919
 +
    0.2668 % (the most important)
 +
    0.2534
 +
</pre>
 +
for the matrix , page 1 has 2 outgoing, and page 2 has one from page 1, page 3 has 2 from 2, page 4 has 3 outgoing ,and page 5 has 1 outgoing , 1 and 2, 4 and 5 outgoing to each other. so the c=[2 1 1 3 1], the rank is page 4 , page 5, page 2 and page 1 and page 3.<br />
  
    1.0000        0        0        0
+
== Class 17 - Tuesday July 2nd 2013 ==
        0  -0.8000        0        0
+
=== Markov Chain Monte Carlo (MCMC) ===
        0        0  -0.5657        0
+
Idea: generate a Markov Chain whose stationary distribution is the target distribution.
        0        0        0    0.5657
 
  
>> a=vec(:,1)
+
Motivation example<br />
 +
- Suppose we want to generate random variable X according to distribution <math>\pi=(\pi_1, \pi_2,  ...  , \pi_m)</math> <br/>
 +
X can take m different values from x={1,2,3,....,m}<br />
 +
- We want to generate {Xt, t=0, 1, ....., n } according to <math>\pi</math><br />
 +
{X0, X1, ....., Xn}<br />
  
>> a=vec(:,1)
 
  
a =
+
example:
 +
M=6,
 +
Pi=[0.1,0.1,0.1,0.2,0.3,0.2]
 +
X=[1,2,3,4,5,6]
  
    0.1817
+
- Suppose Xt=j<br />
    0.2336
+
consider an arbitrary probability transition matrix Q with entry q<sub>ij</sub>. <br/>
    0.7009
 
    0.6490
 
  
>> a=a/sum(a)
+
<math> \mathbf{Q} =
 +
\begin{bmatrix}
 +
q_{11} & q_{12} & \cdots & q_{1m} \\
 +
q_{21} & q_{22} & \cdots & q_{2m} \\
 +
\vdots & \vdots & \ddots & \vdots \\
 +
q_{m1} & q_{m2} & \cdots & q_{mm}
 +
\end{bmatrix}
 +
</math> <br/>
  
a =
+
q<sub>ij</sub> is the probability of moving to state i from state j. <br/>
 +
Generate Y according to i-th row of matrix Q.
 +
i.e.P(Y=j)=q<sub>ij</sub><br />
 +
r<sub>ij</sub> is the probabiliy of accepting j. <br/>
  
    0.1029
 
    0.1324
 
    0.3971
 
    0.3676
 
</pre>
 
'''NOTE:'''<br>
 
The ranking of each page is as follows: Page 3, Page 4, Page 2 and Page 1.<br>
 
Page 3 has the highest ranking,since it has the most incoming links. <br>
 
All of the other pages only have one incoming link, but Page 4 becomes the second highest ranked since Page 3 (highest ranked page) links to Page 4. <br>
 
Page 2 is ranked next, because it has incoming links from Page 3. It has 2 outgoing links; page with the same number of incoming links can be ranked closest to the highest ranked page. <br>
 
  
==== Example 5 ====
 
  
<math>L=  
+
'''Algorithm:'''  <br/>
\left[ {\begin{matrix}
+
*<math>P(Y=j) = q_{ij} </math>(*) <br/>
0 & 1 & 0 & 1 \\
+
*<math> r_{ij} = min (\frac{\pi_j q_{ji}}{\pi_i q_{ij}}, 1) </math> <br/>
1 & 0 & 1 & 1 \\
+
* <math>
1 & 0 & 0 & 1 \\
+
x_{t+1} = \begin{cases}
1 & 0 & 0 & 0 \end{matrix} } \right]</math>
+
Y, & \text{w/p} r_{ij} \\
 +
x_t, & \text{otherwise} \end{cases} </math> <br/>
 +
*go to (*)  <br/>
  
<math>c=  
+
This algorithm generates <math> {x_t, t=0,...,n} </math>. In the long run, the marginal distribution of <math> x_t </math> is <math>\underline{\Pi} </math></br>
\left[ {\begin{matrix}
+
<math> {x_t, t = 0, 1,...,n}</math> is a Markov chain with probability transition matrix P.
3 & 1 & 1 & 3 \end{matrix} } \right]</math>
+
This is a Markov Chain since <math> x_t+1 </math> only depends on <math> x_t
 
+
<math>
<math>D=
+
\text{Where }P_{ij}= \begin{cases}
\left[ {\begin{matrix}
+
q_{ij} r_{ij}& \text {if i != j} \\
3 & 0 & 0 & 0 \\
+
1 - \sum_{j} (q_{ij} r_{ij}), & \text{if i = j} \end{cases} </math>
0 & 1 & 0 & 0 \\
 
0 & 0 & 1 & 0  \\
 
0 & 0 & 0 & 3 \end{matrix} } \right]</math>
 
 
 
<pre style='font-size:14px'>
 
 
 
Matlab code
 
 
 
>> L= [0 1 0 1; 1 0 1 1; 1 0 0 1;1 0 0 0];
 
>> d = 0.8;
 
>> N = 4;
 
>> C = sum(L);
 
>> D = diag(C);
 
>> A=(1-d)*ones(N)/N+d*L*pinv(D);
 
>> [vec val]=eigs(A);
 
>> a=vec(:,1);
 
>> a=a/sum(a)
 
 
 
a =
 
 
 
    0.3492
 
    0.3263
 
    0.1813
 
    0.1431
 
</pre>
 
 
 
 
 
 
 
==== Example 6 ====
 
<math>L=
 
\left[ {\begin{matrix}
 
0 & 1 & 0 & 0 & 1\\
 
1 & 0 & 0 & 0 & 0\\
 
0 & 1 & 0 & 0 & 0\\
 
0 & 1 & 1 & 0 & 1\\
 
0 & 0 & 0 & 1 & 0 \end{matrix} } \right]</math>
 
<br />
 
 
 
Matlab Code<br />
 
<pre style="font-size:16px">
 
>> d=0.8
 
 
 
d =
 
 
 
    0.8000
 
 
 
>> L=[0 1 0 0 1;1 0 0 0 0;0 1 0 0 0;0 1 1 0 1;0 0 0 1 0]
 
 
 
L =
 
 
 
    0    1    0    0    1
 
    1    0    0    0    0
 
    0    1    0    0    0
 
    0    1    1    0    1
 
    0    0    0    1    0
 
 
 
>> c=sum(L)
 
 
 
c =
 
 
 
    1    3    1    1    2
 
 
 
>> D=diag(c)
 
 
 
D =
 
 
 
    1    0    0    0    0
 
    0    3    0    0    0
 
    0    0    1    0    0
 
    0    0    0    1    0
 
    0    0    0    0    2
 
 
 
>> N=5
 
 
 
N =
 
 
 
    5
 
 
 
>> A=(1-d)*ones(N)/N+d*L*pinv(D)
 
 
 
A =
 
 
 
    0.0400    0.3067    0.0400    0.0400    0.4400
 
    0.8400    0.0400    0.0400    0.0400    0.0400
 
    0.0400    0.3067    0.0400    0.0400    0.0400
 
    0.0400    0.3067    0.8400    0.0400    0.4400
 
    0.0400    0.0400    0.0400    0.8400    0.0400
 
 
 
>> [vec val]=eigs(A)
 
 
 
vec =
 
 
 
  Columns 1 through 4
 
 
 
  -0.4129            0.4845 + 0.1032i  0.4845 - 0.1032i  -0.0089 + 0.2973i
 
  -0.4158            -0.6586            -0.6586            -0.5005 + 0.2232i
 
  -0.1963            0.2854 - 0.0608i  0.2854 + 0.0608i  -0.2570 - 0.2173i
 
  -0.5700            0.1302 + 0.2612i  0.1302 - 0.2612i  0.1462 - 0.3032i
 
  -0.5415            -0.2416 - 0.3036i  -0.2416 + 0.3036i  0.6202         
 
 
 
  Column 5
 
 
 
  -0.0089 - 0.2973i
 
  -0.5005 - 0.2232i
 
  -0.2570 + 0.2173i
 
  0.1462 + 0.3032i
 
  0.6202         
 
 
 
 
 
val =
 
 
 
  Columns 1 through 4
 
 
 
  1.0000                  0                  0                  0         
 
        0            -0.5886 - 0.1253i        0                  0         
 
        0                  0            -0.5886 + 0.1253i        0         
 
        0                  0                  0            0.1886 - 0.3911i
 
        0                  0                  0                  0         
 
 
 
  Column 5
 
 
 
        0         
 
        0         
 
        0         
 
        0         
 
  0.1886 + 0.3911i
 
 
 
>> a=-vec(:,1)
 
 
 
a =
 
 
 
    0.4129
 
    0.4158
 
    0.1963
 
    0.5700
 
    0.5415
 
 
 
>> a=a/sum(a)
 
 
 
a =
 
 
 
    0.1933
 
    0.1946
 
    0.0919
 
    0.2668 % (the most important)
 
    0.2534
 
</pre>
 
 
 
For the matrix above:<br>
 
page 4 has 3 incomings;<br>
 
page 5 has 1 incomings from page 4;<br>
 
page 2 has 1 incoming;<br>
 
page 1 has 2 incoming;.<br>
 
page 3 has 1 incoming;<br>
 
 
 
 
 
The rank is then: page 4 , page 5, page 2 and page 1 and page 3.<br />
 
 
 
where c=[2 1 1 3 1]. <br>
 
 
 
== Class 17 - Tuesday July 2nd 2013 ==
 
=== Markov Chain Monte Carlo (MCMC) ===
 
 
 
====Definition:====
 
Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a large number of steps is then used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps. (http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo)</span>
 
 
 
<a style="color:red" href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-165.pdf">some notes form UCb</a>
 
 
 
'''One of the main purposes of MCMC''' : to simulate samples from a joint distribution where the joint random variables are dependent. In general, this is not easily sampled from. Other methods learned in class allow us to simulate i.i.d random variables, but not dependent variables . In this case, we could sample non-independent random variables using a Markov Chain. Its Markov properties help to simplify the simulation process.
 
 
 
 
 
<b>Basic idea:</b>  Given a probability distribution <math>\pi</math> on a set <math>\Omega</math>, we want to generate random elements of <math>\Omega</math> with distribution <math>\pi</math>. MCMC does that by constructing a Markov Chain with stationary distribution <math>\pi</math> and simulating the chain. After a large number of iterations, the Markov Chain will reach its stationary distribution. By sampling from the Markov chain for large amount of iterations, we are effectively sampling from the desired distribution as the Markov Chain would converge to its stationary distribution <br/>
 
 
 
Idea: generate a Markov chain whose stationary distribution is the same as target distribution. <br/>
 
 
 
 
 
'''Note''' <br/>
 
1) Regardless of the chosen starting point, the Markov Chain will converge to its stationary distribution (if it exists). However, the time taken for the chain to converge depends on its chosen starting point. Typically, the burn-in period is longer if the chain is initialized with a value of low probability density<br/>
 
 
 
2) Markov Chain Monte Carlo can be used for sampling from a distribution, estimating the distribution, and computing the mean and optimization (e.g. simulated annealing, more on that later). <br>
 
 
 
3) Markov Chain Monte Carlo is used to sample using “local” information. It is used as a generic “problem solving technique” to solve decision/optimization/value problems, but is not necessarily very efficient.<br/>
 
 
 
4) MCMC methods do not suffer as badly from the "curse of dimensionality" that badly affects efficiency in the acceptance-rejection method. This is because a point is always generated at each time-step according to the Markov Chain regardless of how many dimensions are introduced.<br>
 
 
 
5) The goal when simulating with a Markov Chain is to create a chain with the same stationary distribution as the target distribution.<br/>
 
 
 
6) The MCMC method is usually used in continuous cases but a discrete example is given below.<br />
 
 
 
 
 
'''Some properties of the stationary distribution <math>\pi</math>'''
 
 
 
<math>\pi</math> indicates the proportion of time the process spends in each of the states 1,2,...,n. Therefore <math>\pi</math> satisfies the following two inequalities: <br>
 
 
 
1. <math>\pi_j</math> = <math>\sum_{i=1}^{n}\pi_i*P{ij}</math>.<br>
 
This is because <math>\pi_i</math> is the proportion of time the process spends in state i, and <math>p{ij}</math> is the probability the process transition out of state i into state j. Therefore, <math>\pi_i*p_{ij}</math> is the proportion of time it takes for the process to enter state j. Therefore, <math>\pi_j</math> is the sum of this probability over overall states i. <br>
 
 
 
2)<math> \sum_{i=1}^{n}\pi_i= 1 </math> as <math>\pi</math> shows the proportion of time the chain is in each state. If we view it as the probability of the chain being in state i at time t for t sufficiently large, then it should sum to one as the chain must be in one of the states. <br>
 
 
 
====Motivation example====
 
- Suppose we want to generate a random variable X according to distribution <math>\pi=(\pi_1, \pi_2,  ...  , \pi_m)</math> <br/>
 
X can take m possible different values from <math>{1,2,3,\cdots, m}</math><br />
 
- We want to generate <math>\{X_t: t=0, 1, \cdots\}</math> according to <math>\pi</math><br />
 
 
 
Suppose our example is of a bias die. <br/>
 
Now we have m=6, <math>\pi=[0.1,0.1,0.1,0.2,0.3,0.2]</math>, <math>X \in [1,2,3,4,5,6]</math><br/>
 
 
 
Suppose <math>X_t=i</math>. Consider an arbitrary probability transition matrix Q with entry <math>q_{ij}</math> being the probability of moving to state j from state i. (<math>q_{ij}</math> can not be zero.) <br/>
 
 
 
<math> \mathbf{Q} =
 
\begin{bmatrix}
 
q_{11} & q_{12} & \cdots & q_{1m} \\
 
q_{21} & q_{22} & \cdots & q_{2m} \\
 
\vdots & \vdots & \ddots & \vdots \\
 
q_{m1} & q_{m2} & \cdots & q_{mm}
 
\end{bmatrix}
 
</math> <br/>
 
 
 
 
 
We generate Y = j according to the i-th row of Q. Note that the i-th row of Q is a probability vector that shows the probability of moving to any state j from the current state i, i.e.<math>P(Y=j)=q_{ij}</math><br />
 
 
 
In the following algorithm: <br>
 
<math>q_{ij}</math> is the <math>ij^{th}</math> entry of matrix Q. It is the probability of Y=j given that <math>x_t = i</math>. <br/>
 
<math>r_{ij}</math> is the probability of accepting Y as <math>x_{t+1}</math>. <br/>
 
 
 
 
 
'''How to get the acceptance probability?'''
 
 
 
If <math>\pi </math> is the stationary distribution, then it must satisfy the detailed balance condition:<br/>
 
<math>\pi_i P_{ij}</math> = <math>\pi_j P_{ji}</math><br/>
 
 
 
Since <math>P_{ij}</math> = <math>q_{ij} r_{ij}</math>, we have <math>\pi_i q_{ij} r_{ij}</math> = <math>\pi_j q_{ji} r_{ji}</math>.<br/>
 
We want to find a general solution: <math>r_{ij} = a(i,j) \pi_j q_{ji}</math>, where a(i,j) = a(j,i).<br/>
 
 
 
'''Recall'''
 
<math>r_{ij}</math> is the probability of acceptance, thus it must be that <br/>
 
 
 
1.<math>r_{ij} = a(i,j)</math> <math>\pi_j q_{ji} </math>≤1, then we get: <math>a(i,j) </math>≤ <math>1/(\pi_j q_{ji})</math>
 
 
 
2. <math>r_{ji} = a(i,j) </math> <math>\pi_i q_{ij} </math> ≤ 1, then we get: <math>a(j,i)</math> ≤ <math>1/(\pi_j q_{ji})</math>
 
 
 
So we choose a(i,j) as large as possible, but it needs to satisfy the two conditions above.<br/>
 
 
 
<math>a(i,j) = \min \{\frac{1}{\pi_j q_{ji}},\frac{1}{\pi_i q_{ij}}\} </math><br/>
 
 
 
Thus, <math> r_{ij} = \min \{\frac{\pi_j q_{ji}}{\pi_i q_{ij}}, 1\} </math><br/>
 
 
 
'''Note''':
 
1 is the upper bound to make r<sub>ij</sub> a probability
 
 
 
 
 
'''Algorithm:'''  <br/>
 
*<math>P(Y=j) = q_{ij} </math>. <math>\frac{\pi_j q_{ji}}{\pi_i q_{ij}}</math> is a positive ratio.
 
 
 
*<math> r_{ij} = \min \{\frac{\pi_j q_{ji}}{\pi_i q_{ij}}, 1\} </math> <br/>
 
*<math>
 
x_{t+1} = \begin{cases}
 
Y, & \text{with probability } r_{ij} \\
 
x_t, & \text{otherwise} \end{cases} </math> <br/>
 
* go back to the first step  <br/>
 
 
 
We can compare this with the Acceptance-Rejection model we learned before. <br/>
 
* <math>U</math> ~ <math>Uniform(0,1)</math> <br/>
 
* If <math>U < r_{ij}</math>, then accept. <br/>
 
EXCEPT that a point is always generated at each time-step. <br>
 
 
 
The algorithm generates a stochastic sequence that only depends on the last state, which is a Markov Chain.<br>
 
 
 
====Metropolis Algorithm====
 
 
 
'''Proposition: ''' Metropolis works:
 
 
 
The <math>P_{ij}</math>'s from Metropolis Algorithm satisfy detailed balance property w.r.t <math>\pi</math> . i.e. <math>\pi_i P_{ij} = \pi_j P_{ji}</math>. The new Markov Chain has a stationary distribution <math>\pi</math>. <br/>
 
'''Remarks:''' <br/>
 
1) We only need to know ratios of values of <math>\pi_i</math>'s.<br/>
 
2) The MC might converge to <math>\pi</math> at varying speeds depending on the proposal distribution and the value the chain is initialized with<br/>
 
 
 
 
 
This algorithm generates <math>\{x_t:  t=0,...,m\}</math>. <br/>
 
In the long run, the marginal distribution of <math> x_t </math> is the stationary distribution <math>\underline{\Pi} </math><br>
 
<math>\{x_t: t = 0, 1,...,m\}</math> is a Markov chain with probability transition matrix (PTM), P.<br>
 
 
 
This is a Markov Chain since <math> x_{t+1} </math> only depends on <math> x_t </math>, where <br>
 
<math> P_{ij}= \begin{cases}
 
q_{ij} r_{ij}, & \text{if }i \neq j \\[6pt]
 
1 - \displaystyle\sum_{k \neq i} q_{ik} r_{ik}, & \text{if }i = j \end{cases} </math><br />
 
 
 
<math>q_{ij}</math> is the probability of generating state j; <br/>
 
<math> r_{ij}</math> is the probability of accepting state j as the next state. <br/>
 
 
 
Therefore, the final probability of moving from state i to j when i does not equal to j is <math>q_{ij}*r_{ij}</math>. <br/>
 
For the probability of moving from state i to state i, we deduct all the probabilities of moving from state i to any j that are not equal to i, therefore, we get the second probability.
 
 
 
===Proof of the proposition:===
 
 
 
A good way to think of the detailed balance equation is that they balance the probability from state i to state j with that from state j to state i.
 
We need to show that the stationary distribition of the Markov Chain is <math>\underline{\Pi}</math>, i.e. <math>\displaystyle \underline{\Pi} = \underline{\Pi}P</math><br />
 
<div style="text-size:20px">
 
Recall<br/>
 
If a Markov chain satisfies the detailed balance property, i.e. <math>\displaystyle \pi_i P_{ij} = \pi_j P_{ji} \, \forall i,j</math>, then <math>\underline{\Pi}</math> is the stationary distribution of the chain.<br /><br />
 
</div>
 
 
 
'''Proof:'''
 
 
 
WLOG, we can assume that <math>\frac{\pi_j q_{ji}}{\pi_i q_{ij}}<1</math><br/>
 
 
 
LHS:<br />
 
<math>\pi_i P_{ij} = \pi_i q_{ij} r_{ij} = \pi_i q_{ij} \cdot \min(\frac{\pi_j q_{ji}}{\pi_i q_{ij}},1) = \cancel{\pi_i q_{ij}} \cdot \frac{\pi_j q_{ji}}{\cancel{\pi_i q_{ij}}} = \pi_j q_{ji}</math><br />
 
 
 
RHS:<br />
 
Note that by our assumption, since <math>\frac{\pi_j q_{ji}}{\pi_i q_{ij}}<1</math>, its reciprocal <math>\frac{\pi_i q_{ij}}{\pi_j q_{ji}} \geq 1</math><br />
 
So <math>\displaystyle \pi_j P_{ji} = \pi_ j q_{ji} r_{ji} = \pi_ j q_{ji} \cdot \min(\frac{\pi_i q_{ij}}{\pi_j q_{ji}},1) =  \pi_j q_{ji} \cdot 1 = \pi_ j q_{ji}</math><br />
 
 
 
Hence LHS=RHS
 
 
 
If we assume that <math>\frac{\pi_j q_{ji}}{\pi_i q_{ij}}=1</math><br/> (essentially <math>\frac{\pi_j q_{ji}}{\pi_i q_{ij}}>=1</math>)<br/>
 
 
 
LHS:<br />
 
<math>\pi_i P_{ij} = \pi_i q_{ij} r_{ij} = \pi_i q_{ij} \cdot \min(\frac{\pi_j q_{ji}}{\pi_i q_{ij}},1)  =\pi_i q_{ij} \cdot 1 = \pi_i q_{ij}</math><br />
 
 
 
RHS:<br />
 
'''Note''' <br/>
 
by our assumption, since <math>\frac{\pi_j q_{ji}}{\pi_i q_{ij}}\geq 1</math>, its reciprocal <math>\frac{\pi_i q_{ij}}{\pi_j q_{ji}} \leq 1 </math> <br />
 
 
 
So <math>\displaystyle \pi_j P_{ji} = \pi_ j q_{ji} r_{ji} = \pi_ j q_{ji} \cdot \min(\frac{\pi_i q_{ij}}{\pi_j q_{ji}},1) =  \cancel{\pi_j q_{ji}} \cdot \frac{\pi_i q_{ij}}{\cancel{\pi_j q_{ji}}} = \pi_i q_{ij}</math><br />
 
 
 
Hence LHS=RHS <math>\square</math><br /><br />
 
 
 
'''Note'''<br />
 
1) If we instead assume <math>\displaystyle \frac{\pi_i q_{ij}}{\pi_j q_{ji}} \geq 1</math>, the proof is similar with LHS= RHS =  <math> \pi_i q_{ij} </math> <br />
 
 
 
2) If <math>\displaystyle i = j</math>, then detailed balance is satisfied trivially.<br />
 
 
 
since <math>{\pi_i q_{ij}}</math>, and <math>{\pi_j q_{ji}}</math> are smaller than one. so the above steps show the proof of  <math>\frac{\pi_i q_{ij}}{\pi_j q_{ji}}<1</math>.
 
 
 
== Class 18 - Thursday July 4th 2013 ==
 
=== Last class ===
 
recall for the acceptance probability <math>r_{ij}</math> <br />
 
<math>r_{ij}=min(\frac {{\pi_j}q_{ji}}{{\pi_i}q_{ij}},1)</math> <br />
 
when <math>\frac {{\pi_j}q_{ji}}{{\pi_i}q_{ij}} < 1</math>, we have 
 
<math>r_{ij}=\frac {{\pi_j}q_{ji}}{{\pi_i}q_{ij}}</math>, and <math>r_{ji}=1 </math><br />
 
when <math>\frac {{\pi_j}q_{ji}}{{\pi_i}q_{ij}} \geq 1</math>, we have
 
<math>r_{ji}=\frac {{\pi_i}q_{ij}}{{\pi_j}q_{ji}}</math>, and <math> r{ij}=1 </math><br />
 
 
 
===Example: Discrete Case===
 
Consider a biased die
 
<math>\pi</math>= [0.1, 0.1, 0.2, 0.4, 0.1, 0.1]
 
 
 
We could use any <math>6 x 6 </math> matrix <math> \mathbf{Q} </math> as the proposal distribution <br>
 
For the sake of simplicity using a discrete uniform distribution is the simplest.
 
 
 
<math> \mathbf{Q} =
 
\begin{bmatrix}
 
1/6 & 1/6 & \cdots & 1/6 \\
 
1/6 & 1/6 & \cdots & 1/6 \\
 
\vdots & \vdots & \ddots & \vdots \\
 
1/6 & 1/6 & \cdots & 1/6
 
\end{bmatrix}
 
</math> <br/>
 
 
 
'''Algorithm''' <br>
 
1. <math>x_t=5</math> (sample from the 5th row, although we can initialize the chain from anywhere within the support)<br />
 
2. Y~Unif[1,2,...,6]<br />
 
3. <math> r_{ij} = \min \{\frac{\pi_j q_{ji}}{\pi_i q_{ij}}, 1\} = \min \{\frac{\pi_j  1/6}{\pi_i  1/6}, 1\} = \pi_j/\pi_i </math><br>
 
 
 
Note:  current state i is X<sub>t</sub>,  the candidate state j is Y. <br>
 
Note: since q<sub>ij</sub> = q<sub>ji</sub> for all i and j, that is, the proposal distribution is symmetric, we have <math> r_{ij} = \min \{\frac{\pi_j}{\pi_i }, 1\} </math>
 
 
 
4. U~Unif(0,1)<br />
 
  if <math>u \leq r_{ij}</math>,<br />X<sub>t+1</sub>=Y<br />
 
  else<br />
 
  X<sub>t+1</sub>=X<sub>t</sub><br />
 
  end if<br />
 
  go to (2)<br>
 
 
 
Notice how a point is always generated for X<sub>t+1</sub> regardless of whether the candidate state Y is accepted <br>
 
 
 
'''Matlab'''
 
<pre style="font-size:14px">
 
pii=[.1,.1,.2,.4,.1,.1];
 
x(1)=5;
 
for ii=2:1000
 
  Y=unidrnd(6);                %%% Unidrnd(x) is a built-in function which generates a number between (0) and (x)
 
  r = min (pii(Y)/pii(x(ii-1)), 1);
 
  u=rand;
 
  if u<r
 
    x(ii)=Y;
 
  else
 
    x(ii)=x(ii-1);
 
  end
 
end
 
hist(x,6)    %generate histogram displaying all 1000 points
 
xx = x(501,end);    %After 500, the chain will mix well and converge.
 
hist(xx,6)                % The result should be better.
 
</pre>
 
[[File:MH_example1.jpg|300px]]
 
 
 
 
 
'''NOTE:''' Generally, we will generate a large number of points (say, 1500) and throw away the first points (say, 500). Those first points are called the [[burn-in period]]. Since the chain is said to converge in the long run, the burn-in period is where the chain is converging toward the limiting distribution, but has not converged yet; by discarding those 500 points, our data set will be more representative of the desired limiting distribution, once the burn-in period is over, we say that the chain "mixes well".
 
 
 
 
 
'''Generalization of the above framework to the continuous case'''<br>
 
 
 
In place of <math>\pi</math> use <math>f(x)</math>
 
In place of <math>q<sub>i,j</sub></math> use <math>q(y|x)</math> <br>
 
In place of <math>r<sub>i,j</sub></math> use <math>r(x,y)</math> <br>
 
Here, q(y|x) is a friendly distribution that is easy to sample, usually a symmetric distribution will be preferable, such that <math>q(y|x) = q(x|y)</math> to simplify the computation for <math>r(x,y)</math>.
 
 
 
 
 
'''Remarks'''<br>
 
1. The chain may not get to a stationary distribution if the # of steps generated are small. That is it will take a very large amount of steps to step through the whole support<br>
 
2. The algorithm can be performed with a <math>\pi</math> that is not even a probability mass function, it merely needs to be proportional to the probability mass function we wish to sample from. This is useful as we do not need to calculate the normalization factor. <br>
 
 
 
For example, if we are given <math>\pi^'=\pi\alpha=[5,10,11,2,100,1]</math>, we can normalize this vector by dividing the sum of all entries <math>s</math>.<br>
 
However we notice that when calculating <math>r_{ij}</math>, <br>
 
<math>\frac{\pi^'_j/s}{\pi^'_i/s}\times\frac{q_{ji}}{q_{ij}}=\frac{\pi^'_j}{\pi^'_i}\times\frac{q_{ji}}{q_{ij}}</math> <br>
 
<math>s</math> cancels out in this case. Therefore it is not necessary to calculate the sum and normalize the vector.<br>
 
 
 
This also applies to the continuous case,where we merely need <math> f(x) </math> to be proportional to the pdf of the distribution we wish to sample from. <br>
 
 
 
===Metropolis–Hasting Algorithm===
 
 
 
<b>Definition:</b> Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult. The <b>purpose</b> of the Metropolis-Hastings algorithm is to <b>generate a collection of states according to a desired distribution</b> <math>P(x)</math>. <math>P(x)</math> is chosen to be the stationary distribution of a Markov process, <math>\pi(x)</math>. (http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm)
 
 
 
Metropolis-Hastings is an algorithm for constructing a Markov chain with a given limiting probability distribution. In particular, we consider what happens if we apply the Metropolis-Hastings algorithm repeatedly to a “proposal” distribution which has already been updated.
 
 
 
The algorithm was named after Nicholas Metropolis and W. K. Hastings who extended it to the more general case in 1970.
 
 
 
There are several differences between the discrete and continuous case of the Markov Chain:<br/>
 
1. <math>q(y|x)</math> is used in continuous, instead of <math>q_{ij}</math> in discrete<br/>
 
2. <math>r(x,y)</math> is used in continuous, instead of <math>r{ij}</math> in discrete<br/>
 
3. <math>f</math> is used instead of <math>\pi</math><br/>
 
 
 
Before we consider the algorithm there are a couple general steps to follow to build the acceptance ratio:<br/>
 
a) Find the distribution you wish to use to generate samples from<br/>
 
b) Find a candidate distribution that fits the desired distribution, q(y|x). (the proposed moves are independent of the current state)<br/>
 
c) Build the acceptance ratio <math>\displaystyle \frac{f(y)q(x|y)}{f(x)q(y|x)}</math>
 
 
 
 
 
Assume that f(y) is the target distribution; Choose q(y|x) such that it is a friendly distribution and easy to sample from.<br />
 
 
 
'''Algorithm:'''<br />
 
# Set <math>\displaystyle i = 0</math> and initialize the chain, i.e. <math>\displaystyle x_0 = s</math> where <math>\displaystyle s</math> is some state of the Markov Chain.
 
# Sample <math>\displaystyle Y \sim q(y|x)</math>
 
# Set <math>\displaystyle r(x,y) = min(\frac{f(y)q(x|y)}{f(x)q(y|x)},1)</math>
 
# Sample <math>\displaystyle u \sim \text{UNIF}(0,1)</math>
 
# If <math>\displaystyle u \leq r(x,y), x_{i+1} = Y</math><br /> Else <math>\displaystyle x_{i+1} = x_i</math>
 
# Increment i by 1 and go to Step 2, i.e. <math>\displaystyle i=i+1</math>
 
 
 
<br> Note: q(x|y) is moving from y to x and q(y|x) is moving from x to y.
 
<br>We choose q(y|x) so that it is simple to sample from. Usually, we choose a normal distribution.
 
 
 
<br />
 
Comparing with previous sampling methods we have learned, samples generated from M-H algorithm are not independent of each other, since we accept future sample based on the current sample. Furthermore, unlike acceptance and rejection method, we are not going to reject any points in Metropolis-Hastings. In the equivalent of the "reject" case, we just leave the state unchanged. In other words, if we need a sample of 1000 points, we only need to generate the sample 1000 times.<br/>
 
<div style="border:1px yellow solid">
 
<p style="font-size:20px;color:red;">
 
remarks
 
</p>
 
===='''Remark 1'''====
 
<span style="text-shadow: 0px 2px 3px 3399CC;margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">
 
A common choice for q(y|x) is a normal distribution centered at x with standard deviation b. q(y|x)=N(x,b<sup>2</sup>)
 
 
 
i.e.
 
<math>q(y|x)=q(x|y)</math>
 
<math>q(y|x)=\frac{1}{\sqrt{2\pi}b}\,e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2b^2} (y-x)^2}</math>
 
<math>q(x|y)=\frac{1}{\sqrt{2\pi}b}\,e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2b^2} (x-y)^2}</math>
 
<math>\Rightarrow (y-x)^2=(x-y)^2</math>
 
so <math>~q(y \mid x)=q(x \mid y)</math> <br>
 
In this case <math>\frac{q(x \mid y)}{q(y \mid x)}=1</math> and therefore <math> r(x,y)=\min \{\frac{f(y)}{f(x)}, 1\} </math> <br/><br />
 
This is true for any symmetric q. In general if q(y|x) is symmetric, then this algorithm is called Metropolis.<br/>
 
When choosing function q, it makes sense to choose a distribution with the same support as the distribution you want to simulate. eg. Beta ---> Choose q ~ Uniform(0,1)<br>
 
The chosen q is not necessarily symmetric. Depending on different target distribution, q can be uniform.</span>
 
 
 
===='''Remark 2'''====
 
 
 
The value y is accepted if u<=<math>min\{\frac{f(y)}{f(x)},1\}</math>, so it is accepted with the probability <math>min\{\frac{f(y)}{f(x)},1\}</math>.
 
Thus, if <math>f(y)>f(x)</math>, then y is always accepted.
 
The higher that value of the pdf is in the vicinity of a point <math>y_1</math> , the more likely it is that a random variable will take on values around <math>y_1</math>. As a result it makes sense that we would want a high probability of acceptance for points generated near <math>y_1</math>.<br>
 
[[File:Diag1.png‎]]<br>
 
Note: if the proposal comes from a region with low density, we may or may not accept; however, we accept for sure if the proposal comes from a region with high density.<br>
 
 
 
===='''Remark 3'''====
 
<span style="text-shadow: 0px 2px 3px hsl(310,15%,65%);margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">
 
One strength of the Metropolis-Hastings algorithm is that normalizing constants, which are often quite difficult to determine, can be cancelled out in the ratio <math> r </math>. For example, consider the case where we want to sample from the beta distribution, which has the pdf:
 
 
 
 
 
<span style="text-shadow: 0px 2px 3px hsl(310,15%,65%);margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">also notice that Metropolis Hastings is just a special case of Metropolis algorithm</span>
 
<math>
 
\begin{align}
 
f(x;\alpha,\beta)& = \frac{1}{\mathrm{B}(\alpha,\beta)}\, x^{\alpha-1}(1-x)^{\beta-1}\end{align}
 
</math>
 
 
 
The beta function, ''B'', appears as a normalizing constant but it can be simplified by construction of the method.
 
</span>
 
</div>
 
====='''Example'''=====
 
 
 
<math>\,f(x)=\frac{1}{\pi^{2}}\frac{1}{1+x^{2}}</math>
 
 
 
Then, we have <math>\,f(x)\propto\frac{1}{1+x^{2}}</math>.
 
 
 
And let us take <math>\,q(x|y)=\frac{1}{\sqrt{2\pi}b}e^{-\frac{1}{2b^{2}}(y-x)^{2}}</math>.
 
 
 
Then <math>\,q(x|y)</math> is symmetric since <math>\,(y-x)^{2} = (x-y)^{2}</math>.
 
 
 
Therefore Y can be simplified.
 
 
 
 
 
We get :
 
 
 
<math>\,\begin{align}
 
\displaystyle r(x,y)
 
& =min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} \\
 
& =min\left\{\frac{f(y)}{f(x)},1\right\} \\
 
& =min\left\{ \frac{ \frac{1}{1+y^{2}} }{ \frac{1}{1+x^{2}} },1\right\}\\
 
& =min\left\{ \frac{1+x^{2}}{1+y^{2}},1\right\}\\
 
\end{align}
 
</math>.
 
 
 
<br/>
 
<math>\pi=[0.1\,0.1\,...] </math><br/>
 
<math>\pi \propto [3\,2\, 10\, 100\, 1.5] </math><br/>
 
<math>\Rightarrow \pi=1/c \times [3\, 2\, 10\, 100\, 1.5]</math><br/>
 
<math>\Rightarrow c=3+2+10+100+1.5 </math><br/>
 
<br/>
 
<br/>
 
 
 
In practice, if elements of <math>\pi</math> are functions or random variables, we need c to be the normalization factor, the summation/integration over all members of <math>\pi</math>. This is usually very difficult. Since we are taking ratios, with the Metropolis-Hasting algorithm, it is not necessary to do this.
 
 
 
<br>
 
For example, to find the relationship between weather temperature and humidity, we only have a proportional function instead of a probability function. To make it into a probability function, we need to compute c, which is really difficult. However, we don't need to compute c as it will be cancelled out during calculation of r.<br>
 
 
 
======'''MATLAB'''======
 
The Matlab code of the algorithm is the following :
 
<pre style="font-size:12px">
 
clear all
 
close all
 
clc
 
b=2;
 
x(1)=0;
 
for i=2:10000
 
    y=b*randn+x(i-1);
 
    r=min((1+x(i-1)^2)/(1+y^2),1);
 
    u=rand;
 
    if u<r
 
        x(i)=y;
 
    else
 
        x(i)=x(i-1);
 
    end
 
   
 
end
 
hist(x,100);
 
%The Markov Chain usually takes some time to converge and this is known as the "burning time".
 
</pre>
 
[[File:MH_example2.jpg|300px]]
 
 
 
However, while the data does approximately fit the desired distribution, it takes some time until the chain gets to the stationary distribution. To generate a more accurate graph, we modify the code to ignore the initial points.<br>
 
 
 
'''MATLAB'''
 
<pre style="font-size:16px">
 
b=2;
 
x(1)=0;
 
for ii=2:10500
 
y=b*randn+x(ii-1);
 
r=min((1+x(ii-1)^2)/(1+y^2),1);
 
u=rand;
 
if u<=r
 
x(ii)=y;
 
else
 
x(ii)=x(ii-1);
 
end
 
end
 
xx=x(501:end) %we don't display the first 500 points because they don't show the limiting behaviour of the Markov Chain
 
hist(xx,100)
 
</pre>
 
<br>
 
'''If a function f(x) can only take values from (0,inf), but we need to use normal distribution as the candidate distribution, then we can use <math>q=2/sqrt(2*Pi)*exp(-1/2*(y-x)^2)</math>, where y is from 0 to inf. (This is essentially the pdf of the absolute value of a normal distribution centered around x)'''<br>
 
 
 
Example
 
We want to sample from exp(2), q(y|x)~N(x,b^2)<br>
 
r=f(y)/f(x)=2exp^(-2x)/2exp^(-2y)=exp(2*(x-y))<br>
 
r=<math>min(exp(2*(x-y)),1)</math><br>
 
 
 
'''MATLAB'''
 
<pre style="font-size:16px">
 
x(1)=0;
 
for ii=2:100
 
y=2*(randn*b+abs(x(ii-1)))
 
r=min(exp(2*(x-y)),1);
 
u=rand;
 
if u<=r
 
x(ii)=y;
 
else
 
x(ii)=x(ii-1);
 
end
 
end
 
</pre>
 
<br>
 
 
 
'''Definition of Burn in:'''
 
 
 
The Metropolis–Hasting Algorithm is started from an arbitrary initial value <math>x_0</math> and the algorithm is run for many iterations until this initial state is "forgotten". These samples, which are discarded, are known as ''burn-in''. The remaining
 
set of accepted values of <math>x</math> represent a sample from the distribution f(x).(http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm)
 
 
 
Several extensions have been proposed in the literature to speed up the convergence and reduce the so called “burn-in” period.
 
 
 
 
 
'''Aside''': The algorithm works best if the candidate density q(y|x) matches the shape of the target distribution f(x). If a normal distribution is used as a candidate distribution, the variance parameter b<sup>2</sup> has to be tuned during the burn-in period.
 
 
 
1. If b is chosen to be too small, the chain will mix slowly (smaller proposed move, the acceptance rate will be high and the chain will converge only slowly the f(x)).
 
 
 
2. If b is chosen to be too large, the acceptance rate will be low (larger proposed move and the chain will converge only slowly the f(x)).
 
 
 
 
 
 
 
Note: The histogram looks much nicer if we reject the points within the burning time.<br>
 
 
 
 
 
Example: Use M-H method to generate sample from f(x)=2x
 
0<x<1, 0 otherwise.
 
 
 
1) Initialize the chain with x<sub>i</sub> and set i=0
 
 
 
2)Y~q(y|xi)
 
where our proposal function would be uniform [0,1] since it matches our original ones support.
 
=>Y~Unif[0,1]
 
 
 
3)consider <math>\frac{f(y)}{f(x)}=\frac{y}{x}</math>,
 
<math>r(x,y)=min (\frac{y}{x},1)</math> since q(y|x<sub>i</sub>) and q(x<sub>i</sub>|y) can be cancelled together.
 
 
 
4)X<sub>i</sub>+1=Y with prob r(x,y),
 
X<sub>i</sub>+1=X<sub>i</sub>, otherwise
 
 
 
5)i=i+1, go to 2
 
 
 
<br>
 
 
 
Example form wikipedia
 
 
 
==Step-by-step instructions==
 
 
 
Suppose the most recent value sampled is <math>x_t\,</math>. To follow the Metropolis–Hastings algorithm, we next draw a new proposal state <math>x'\,</math> with probability density <math>Q(x'\mid x_t)\,</math>, and calculate a value
 
 
 
:<math>
 
a = a_1 a_2\,
 
</math>
 
 
 
where
 
 
 
:<math>
 
a_1 = \frac{P(x')}{P(x_t)} \,\!
 
</math>
 
 
 
is the likelihood ratio between the proposed sample <math>x'\,</math> and the previous sample <math>x_t\,</math>, and
 
 
 
:<math>
 
a_2 = \frac{Q(x_t \mid x')}{Q(x'\mid x_t)}
 
</math>
 
 
 
is the ratio of the proposal density in two directions (from <math>x_t\,</math> to <math>x'\,</math> and ''vice versa'').
 
This is equal to 1 if the proposal density is symmetric.
 
Then the new state <math>\displaystyle x_{t+1}</math> is chosen according to the following rules.
 
 
 
:<math>
 
\begin{matrix}
 
\mbox{If } a \geq 1: &  \\
 
& x_{t+1} = x',
 
\end{matrix}
 
</math>
 
:<math>
 
\begin{matrix}
 
\mbox{else} & \\
 
& x_{t+1} = \left\{
 
                  \begin{array}{lr}
 
                      x' & \mbox{ with probability }a \\
 
                      x_t & \mbox{ with probability }1-a.
 
                  \end{array}
 
            \right.
 
\end{matrix}
 
</math>
 
 
 
The Markov chain is started from an arbitrary initial value <math>\displaystyle x_0</math> and the algorithm is run for many iterations until this initial state is "forgotten". 
 
These samples, which are discarded, are known as ''burn-in''. The remaining set of accepted values of <math>x</math> represent a [[Sample (statistics)|sample]] from the distribution <math>P(x)</math>.
 
 
 
The algorithm works best if the proposal density matches the shape of the target distribution <math>\displaystyle P(x)</math> from which direct sampling is difficult, that is <math>Q(x'\mid x_t) \approx P(x') \,\!</math>.
 
If a Gaussian proposal density <math>\displaystyle Q</math> is used the variance parameter <math>\displaystyle \sigma^2</math> has to be tuned during the burn-in period.
 
This is usually done by calculating the ''acceptance rate'', which is the fraction of proposed samples that is accepted in a window of the last <math>\displaystyle N</math> samples.
 
The desired acceptance rate depends on the target distribution, however it has been shown theoretically that the ideal acceptance rate for a one dimensional Gaussian distribution is approx 50%, decreasing to approx 23% for an <math>\displaystyle N</math>-dimensional Gaussian target distribution.<ref name=Roberts/>
 
 
 
If <math>\displaystyle \sigma^2</math> is too small the chain will ''mix slowly'' (i.e., the acceptance rate will be high but successive samples will move around the space slowly and the chain will converge only slowly to <math>\displaystyle P(x)</math>).  On the other hand,
 
if <math>\displaystyle \sigma^2</math> is too large the acceptance rate will be very low because the proposals are likely to land in regions of much lower probability density, so <math>\displaystyle a_1</math> will be very small and again the chain will converge very slowly.
 
 
 
== Class 19 - Tuesday July 9th 2013 ==
 
'''Recall: Metropolis–Hasting Algorithm'''
 
 
 
X<sub>t</sub>= state of chain at step t<br>
 
<math>Y</math>~<math>q(y|x)</math><br>
 
<math>\,r=min[\frac{f(y)}{f(x)}\,\frac{q(x|y)}{q(y|x)}\,,1]</math><br>
 
<math>U</math>~<math>Uniform(0,1)</math><br>
 
If <math>U<r</math>, then<br>
 
x<sub>t+1</sub> = y<br>
 
else<br>
 
x<sub>t+1</sub> = x<sub>t</sub><br>
 
 
 
 
 
Why can we use this algorithm to generate a Markov Chain?<br>
 
By the memoryless property of Markov Chain, the current state will only be affected by the previous state. Thus, time t will not affect the choice of state.<br> 
 
 
 
==='''Choosing b: 3 cases'''===
 
 
 
In this example, q(y|x)=N(x, b^2)<br>
 
Demonstrated as follows, the choice of b will be significant in determining the quality of the Metropolis algorithm. This parameter affects the probability of accepting the candidate states, and the algorithm will not perform well if the probability is too large or too small.<br />
 
 
 
'''MATLAB b=2, b= 0.2, b=20 '''
 
<pre style="font-size:12px">
 
clear all
 
close all
 
clc
 
b=2 % b=0.2 b=20;
 
x(1)=0;
 
for i=2:10000
 
    y=b*randn+x(i-1);
 
    r=min((1+x(i-1)^2)/(1+y^2),1);
 
    u=rand;
 
    if u<r
 
        x(i)=y;
 
    else
 
        x(i)=x(i-1);
 
    end
 
   
 
end
 
hist(x(5000:end,100)
 
plot(x(5000:end)
 
%The Markov Chain usually takes some time to converge and this is known as the "burning time"
 
%Therefore, we don't display the first 5000 points because they don't show the limiting behaviour of the Markov Chain
 
 
 
generate the Markov Chain with 10000 random variable, using a large b and a small  b.
 
</pre>
 
===='''b too small'''====
 
 
 
Then we are setting up our proposal distribution <math>q(y\mid{x})</math> to be much narrower than the target <math>\displaystyle f(x)</math> so the chain will not have a chance to explore the sample state space and visit a lot of the states of the target <math>\displaystyle f(x)</math>. For example, if we are generating samples from the combination of two normal distributions, it is highly likely the histogram of <math>\displaystyle X</math> will only be able to show the distribution of one of the normal distributions. <br>
 
[[File:CBSplx.jpg|300px]][[File:CBplotx.jpg|300px]]
 
with b =0.02, the chain takes small steps so the chain doesn't explore enough of the sample space. Thus, we want b large enough so that it explores both distributions, but small enough so that it is still efficient<br>
 
 
 
As CLT said, when b is very large the graph looks like a normal distribution(b=20). However, when b is very small it looks like a fluctuate graph. (b= 0.2)
 
 
 
===='''b too large'''====
 
Then the fraction <math>\frac{f(y)}{f(x)}</math> would be very small <math>\Rightarrow</math> <math>r=min{\{\frac{f(y)}{f(x)},1}\}</math> is very small as well. It is highly unlike that <math>\displaystyle u<r</math>, the probability of rejecting <math>\displaystyle Y</math> is high so the chain is likely to get stuck in the same state for a long time <math>\rightarrow</math> chain may not coverge to the right distribution.It is easy to observe by looking at the histogram of <math>\displaystyle X</math>, the shape will not resemble the shape of the target <math>\displaystyle f(x)</math>
 
Most likely we reject y and the chain will get stuck. <br>
 
[[File:CBLhistx.jpg|300px]][[File:CBLplotx.jpg|300px]]
 
with b = 20, jumps are very unlikely to be accepted<br>
 
 
 
if b is too large then the normal distribution can not cover the fluctuate graph, so it is hard to accept.
 
 
 
===='''b just right'''====
 
Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well". <br>
 
chose a right b (mixing well), is more likely to accept the chain.
 
 
 
[[File:CBRhx.jpg|300px]][[File:CBRpx.jpg|300px]]
 
with b = 2, the chain is mixing well.<br>
 
 
 
<math> \frac{f(y)}{f(x)}</math> and consequently r is very small and very unlikely that u < r, so the current value will be repeated.<br/>
 
ie. y is rejected as <math>\frac{f(y)}{f(x)}</math> and x<sub>t+1</sub> = x<sub>t</sub>; as a result, we get a lot of the same points from the output.  <br>
 
----
 
<span style="text-shadow: 0px 2px 3px #555;margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">also we want to notice that if we choose b to be too small we could never reach where we want to go in a multi distribution graph</span>
 
 
 
'''Recall detailed balance for discrete case'''
 
if <math>\pi_i*P_ij=\pi_j*Pji</math> then <math>\pi=\pi*P</math>
 
    <math>\; [\pi P]_j = \sum_i \pi_i P_{ij} =\sum_i P_{ji}\pi_j =\pi_j\sum_i P_{ji} =\pi_j  ,\forall j</math>
 
<br>
 
Recall: Each row of P must add up to 1
 
 
 
Continuous case:
 
If <math>\displaystyle f(x)P(y|x)=f(y)P(x|y)</math>, then <math>\displaystyle f(x)</math> is a stationary distribution.<br>
 
Because <math>\int^{}_y f(y)P(x|y)dy=\int^{}_y f(x)P(y|x)dy=f(x) \cancel{\int^{}_y P(y|x)dy}=f(x)*1 = f(x)</math>
 
 
 
'''Convergence of MH'''<br>
 
We generate y from q(y|x) and accept with probability<br>
 
<math>\,r=min[\frac{f(y)}{f(x)}\,\frac{q(x|y)}{q(y|x)}\,,1]</math> <br>
 
without loss of generality, assume <math>\,\frac{f(y)}{f(x)}\,\frac{q(x|y)}{q(y|x)}\,<1</math> <br>
 
then r(x,y) i.e. probability of accepting y given that the current state is x will be<br>
 
<math>\,r(x,y)=min[\frac{f(y)}{f(x)}\,\frac{q(x|y)}{q(y|x)}\,,1]=\frac{f(y)}{f(x)}\,\frac{q(x|y)}{q(y|x)}\,</math> <br>
 
 
 
Now suppose that the current state is y and we are generating x. The probability of accepting x, given that the current state is y is
 
<math>\,r(y,x)=min[\frac{f(x)}{f(y)}\,\frac{q(y|x)}{q(x|y)}\,,1]=1</math><br>
 
Based on our original assumption <math>\frac{f(y)}{f(x)}</math><math>\frac{q(x|y)}{q(y|x)}>1</math> , then r(y,x)=1
 
 
 
In Metropolis-Hastings, the probability of jumping to y, given that the current state is x (denoted by P[y|x]) depends on (1) the probability of generating y, and (2) the probability of accepting y.<br>
 
<math>\,P(y|x)=q(y|x)×r(x,y)=\frac{q(y|x)f(y|x)}{f(x|y)}\,\frac{q(x|y)}{q(y|x)}\,=\frac{f(y|x)q(x|y)}{f(x|y)}\,</math><br>
 
 
 
The probability of jumping to x, given that the current state is y.<br>
 
<math>\,P(x|y)=q(x|y)×r(y,x)=q(x|y)</math><br>
 
Detailed balance for the chain requires <math>\,f(x)P(y|x)=f(y)P(x|y)</math>.<br>
 
 
 
<b>L.H.S</b>  <br>
 
<math>f(x)\frac{f(y)q(x|y)}{f(x)}\,=f(y)q(x|y)</math> <br>
 
<b>R.H.S</b>  <br>
 
<math>\,f(y)P(x|y)=f(y)q(x|y)</math> <br>
 
L.H.S = R.H.S<br>
 
Thus, detailed balance holds. Therefore f(x) is the stationary distribution of the chain generated by Metropolis-Hastings.<br>
 
<br>
 
'''If <math>\,\frac{f(y)}{f(x)}\,\frac{q(x|y)}{q(y|x)}\,>1</math> <br>'''
 
then r(x,y) i.e. probability of accepting y given that the current state is x will be<br>
 
<math>\,r(x,y)=min[\frac{f(y)}{f(x)}\,\frac{q(x|y)}{q(y|x)}\,,1]=1</math><br>
 
<math>\,r(y,x)=min[\frac{f(x)}{f(y)}\,\frac{q(y|x)}{q(x|y)}\,,1]=\frac{f(x)}{f(y)}\,\frac{q(y|x)}{q(x|y)}\,</math> <br>
 
<math>\,P(y|x)=q(y|x)×r(x,y)=q(y|x)</math><br>
 
<math>\,P(x|y)=q(x|y)×r(y,x)=\frac{f(x)q(y|x)}{f(y)}\,</math><br>
 
'''LHS=RHS'''using the '''same rationale''' as the previous case
 
 
 
Suppose we have two normal distribution <math>N(2, \sigma^2)</math> and <math>N(10, \sigma^2)</math>, and we want
 
:<math>
 
f(x) = \begin{cases}
 
N(2, \sigma^2), & \text{if } j = 0 \\
 
N(10, \sigma^2), & \text{if } j = 1
 
\end{cases}</math>
 
 
 
where P(j = 0) = 0.5 and P(j = 1) = 0.5
 
 
 
<b>Mixture of Gaussians</b>  <br>
 
We want to sample <math>f(x)=0.5*exp-\frac{(x-\mu1)^2}{\sigma1^2} = 0.5*exp-\frac{(x-\mu2)^2}{\sigma2^2}</math>
 
 
 
We can use MH where,
 
<math>\,r(x,y)=min[\frac{e^{-(y-2)^2}+e^{-(y-10)^2}}{e^{-(x-2)^2}+e^{-(x-10)^2}}\,,1]</math>
 
using the normal distribution of to show the r(x,y).
 
 
 
'''MATLAB'''
 
<pre style="font-size:16px">
 
clear all
 
close all
 
b=2;%b=20, b=200
 
x(1)=randn;
 
for i=2:100000
 
    y=b*randn+x(i-1);
 
    r=min((exp(-(y-2)^2)+exp(-(y-10)^2))/(exp(-(x(i-1)-2)^2)+exp(-(x(i-1)-10)^2),1);
 
    u=rand;
 
    if u<r
 
        x(i)=y;
 
    else
 
        x(i)=x(i-1);
 
    end
 
end
 
hist(x(5000:end), 100)
 
</pre>
 
 
 
[[File:mixture of gaussian.jpg|400px]]
 
 
 
As it clear from this plot that the matlab code above generated (simulated) mixture of Gaussian, one around mean of 2 and other around mean of 10.
 
 
 
However, keep in mind that inappropriate b (either too small or too large) may not simulate it properly
 
 
 
For b = 0.02 (too small)
 
 
 
[[File:Gaussian (small b).jpg|400px]]
 
 
 
For b = 200 (too large)
 
 
 
[[File:Gaussian_(large_b).jpg|400px]]
 
 
 
These both does not seem like the gaussian distribution centered around mean 2 and 10.
 
Either b is too large or too small, the graph would not look like the original graph.
 
 
 
 
 
 
 
 
 
In general, the mixture of Normal distributions '''''can have any mixture of weights''''' as long as they are between 0 and 1, and the sum of the weights equals 1. For the mixture of Normal distributions, it is important to be careful with the selection of b. <br>
 
1) If b is too small, it will easily get stuck in one of the two parts of the Normal distribution, as it can't choose a number in the other half as a candidate
 
<br>
 
2) If b is too large, it will get stuck on a single number for larger periods of time, as we saw before.<br>
 
 
 
To simulate a multivariate distribution, we can use the above algorithm and modify the code so that the scalars x and y are vectors.
 
 
 
<pre style="font-size:16px">
 
ezplot
 
ex.
 
>>syms x
 
>>ezplot(x^2)
 
>>ezsurf('exp(-(x1^2*x2^2+x1^2+x2^2-8*x1-8*x2)/2)')
 
</pre>
 
 
 
'''Matlab Tips:'''<br />
 
#Function "ezplot" is used to "easily" plot a curve <math>\displaystyle y=f(x)</math>. By default the domain is set to <math>-2 \pi\leq x \leq 2\pi</math>
 
#*Resource:http://www.mathworks.com/help/matlab/ref/ezplot.html
 
#Function "ezsurf" is used to "easily" plot a surface <math>\displaystyle z=f(x,y)</math>. By default the domain is set to <math>-2\pi\leq x\leq 2\pi, -2\pi\leq y\leq 2\pi</math>
 
#*Resource:http://www.mathworks.com/help/matlab/ref/ezsurf.html
 
#Function "syms" is a shortcut used to create symbolic variables.
 
#*Resource:http://www.mathworks.com/help/symbolic/syms.html
 
 
 
we are interested in the probability of moving to y from x in the Markov Chain generated by MH algorithm denoted by p(y|x)
 
p(y|x) depends on two probabilities.
 
1) probabilities of generating y
 
2)probabilities of accepting y
 
 
other tips
 
<a href="http://www.matlabtips.com/page/2/">click me</a>
 
 
 
'''Mixture distribution of two random variables:'''<br>
 
In general if X is a mixture distribution of random variables Y and Z, then <br>
 
1) f(x) = p*f<sub>y</sub>(x) + (1-p)*f<sub>z</sub>(x)<br>
 
2) F(x) = p*F<sub>y</sub>(x) + (1-p)*F<sub>z</sub>(x)<br>
 
3) S(x) = 1-F(x) = p*S<sub>y</sub>(x) + (1-p)*S<sub>z</sub>(x)<br>
 
4) E(X<sup>k</sup>) = p*E(Y<sup>k</sup>) + (1-p)*E(Z<sup>k</sup>)
 
 
 
this lecture has many main points. please remember all of them ! please notice the highlights sentence
 
 
 
== Class 20 - Thursday July 11th 2013 ==
 
=== Simulated Annealing ===
 
wikipedia defn:Simulated annealing (SA) is a generic probabilistic metaheuristic for the global optimization problem of locating a good approximation to the global optimum of a given function in a large search space. It is often used when the search space is discrete (e.g., all tours that visit a given set of cities). For certain problems, simulated annealing may be more efficient than exhaustive enumeration — provided that the goal is merely to find an acceptably good solution in a fixed amount of time, rather than the best possible solution.
 
<br>
 
 
 
'''''Simulated Annealing (SA) is a generic probabilistic metaheuristic for the global optimization problem of locating a good approximation to the global optimum of a given function in a large search space. It is often used when the search space is discrete (e.g., all tours that visit a given set of cities).'''
 
''
 
This notion of slow cooling is implemented in the Simulated Annealing algorithm as a slow decrease in the probability of accepting worse solutions as it explores the solution space. Accepting worse solutions is a fundamental property of metaheuristics because it allows for a more extensive search for the optimal solution.
 
 
 
'''''Bold text'' is very important here!!!!!!!'''
 
 
 
The method was independently described by Scott Kirkpatrick, C. Daniel Gelatt and Mario P. Vecchi in 1983 and by Vlado Černý in 1985.
 
http://en.wikipedia.org/wiki/Simulated_annealing
 
 
 
In short, Simulated Annealing is a popular algorithm in simulating for optimization. This Simulated Annealing is an application of the Metropolis algorithm.
 
This algorithm aims to find a global minimum or a global maximum. <br><br>
 
 
 
Since when h(x)=0, f(x)=1
 
 
 
Suppose that we want to minimize h(x) by maximizing <br/> <math>e^{-\frac{h(x)}{T}}</math>
 
 
 
Note that exponential function is monotonic - meaning that it is only ever increasing/decreasing.
 
 
 
 
 
For a given (arbitrary) constant T>0, minimizing h(x) is equivalent to maxximizing <math>e^{-\frac{h(x)}{T}}</math><br/>
 
 
 
We want to find the x that minimizes h(x) and its value.
 
 
 
 
 
 
 
T is an arbitrary positive number. A '''small T will narrow the function''' and '''large T will widen the function'''.
 
<br/> Note that choosing the value of T will not affect the points of the min/max. <br/>
 
 
 
This means that if T is large, it is spread out. If T is small, then it is close to 2.<br>
 
 
 
This equivalency follows because the exponential function is monotonic<br/>
 
 
 
Why? <math> e^{-x} </math> is a decreasing function of x. Therefore the minimum value of this function occurs at the largest value of x.
 
 
 
Consider the function <br/>
 
 
 
f ∝ e<sup>-h(x)/T</sup>
 
 
 
We do not need to know the normalization factor (alpha).
 
 
 
Note that exponential function is monotonic. If T is small, then sampling from f is in fact a sample of points close to the mode of <math>e^{\frac{-h(x)}{T}}</math> which are the min of h(x). Based on this intuition, simulated annealing has the process:<br/>
 
 
 
 
 
<b>1.</b> Set T to a large number<br>
 
<b>2.</b> Initialize the chain. Set <math>x_t = x</math><br>
 
<b>3.</b> <math>y \sim~ q(y|x)</math> (q should be symmetric)<br>
 
<b>4.</b> <math>r = \min\{\frac{f(y)}{f(x)},1\}</math><br>
 
<b>5.</b> <math>U \sim~ U(0,1)</math><br>
 
<b>6.</b> If U < r: <math>x_{t+1}=y</math><br>
 
else: <math>x_{t+1}=x_t</math><br>
 
<b>7.</b> Decrease T. Go back to 3.<br>
 
 
 
Note: q(y|x) does not have to be symmetric. If q is non-symmetric, then the original MH formula is used.<br>
 
In most academic papers q(y|x) is chosen to be symmetric for convenience.
 
 
 
 
 
<math>\displaystyle \begin{align}
 
r &= \min ( \frac{f(y)}{f(x)} ,1 ) \\
 
&= \min (\frac{e^{\frac{-h(y)}{T}}}{e^\frac{-h(x)}{T}} ,1) \\
 
&= \min (e^{\frac{-h(y)}{T}-\frac{-h(x)}{T}} ,1) \\
 
&= \min (e^{\frac{h(x)-h(y)}{T}},1)
 
\end{align}</math> <br>
 
 
 
<math>r =\min\left[\frac{f(y)}{f(x)},1\right] =\min\left[\frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}},1\right] =\min\left[e^{\frac{h(x)-h(y)}{T}},1\right]</math><br/>
 
 
 
 
 
Suppose T is large, <br>
 
 
 
1. If <math>h(y) \leq h(x)</math>, then <math>e^{\frac {h(x)-h(y)}{T}} \geq 1</math>. Therefore r=1 and we will always accept y (a good move)
 
 
 
2. If <math>h(y)>h(x)</math>, then <math>e^{\frac {h(x)-h(y)}{T}} < 1</math>. Therefore r<1 and we will accept with probability r (this will help to escape from local minima).<br>
 
Note: Even though this does not seem to be a good move, we still give it some chance. Even though, it may lead to the opposite direction, this y may also help to escape from the local min to global min.
 
 
 
Suppose T is small (<math>T \rightarrow 0</math>) <br>
 
1. If <math>h(y)<h(x)</math>, then <math>e^{\frac{h(x)-h(y)}{T}} \rightarrow \infty</math>. Therefore r=1, we always accept y.
 
Since h(y) takes on a lower value, moving towards h(y) is considered a good move and we always accept such a good move.<br/>
 
 
 
2. If <math>h(y)>h(x)</math>, then <math>e^{\frac{h(x)-h(y)}{T}} \rightarrow 0</math>. Therefore <math>r \rightarrow 0</math>, we almost never accept y.<br/>
 
 
 
 
 
3. if r is equal to 1 or close to 1, then we accept the move. If r close to 0, then that means the probability of the move is close to 0, so we almost reject.
 
Essentially, the smaller the value of T, the sharper the distribution, and the higher the probability of rejection. We start by picking T large so that there is a lower probability of rejection which is more efficient. The algorithm will then be able to explore the target distribution instead of rejecting all proposed points and just repeating the previous state. Note though that the convergence of this algorithm to an accurate estimate of the global minimum is not guaranteed. We can never be sure if we have escaped the local minimums, if it is a complex example and there are a lot of them. However, with a large enough T, and reasonable choice of 'b' in the proposal density the algorithm should work for most functions. 
 
 
 
Initial T is large to make sure it can escape from the wrong region. (If initial T is small, it may be trapped in the wrong region) <br>
 
The decrease of T makes the result more and more accurate.<br>
 
 
 
<div><span style="text-shadow: 0px 2px 3px #555;margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">Also, the main reason to choose T to be large enough at first is mainly because we have no idea what the possibilities of x<sub>t</sub> can be.</span><span style="text-shadow: 0px 2px 3px #555;margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">With that in mind</span><span style="text-shadow: 0px 2px 3px #555;margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">,</span><span style="text-shadow: 0px 2px 3px #555;margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">if we initialize a value of x which is away from the mean too much,</span><span style="text-shadow: 0px 2px 3px #555;margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">we may never have the chance to ever get any closer to mean</span><span style="text-shadow: 0px 2px 3px #555;margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">because the probability to move towards the wrong direction will be way too high due to the mechanism of the algorithm.</span><div>
 
<br/> In simple words, choose a large T to start off with in order for a higher chance that the points can explore. <br/>
 
 
 
 
 
<div><span style="text-shadow: 0px 2px 3px #555;margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:14px;line-height:20px;color:3399CC">The acceptance probability is also equal to</span><div>
 
min(<math>\frac {e^{-\frac {h(y)}{T}}}{e^{-\frac {h(x)}{T}}}</math>,1)=min(<math>e^{\frac {h(x)-h(y)}{T}}</math>,1)
 
 
 
<br/>Note: The variable T is known in practice as the "Temperature", thus the higher T is, the more variability there is in terms of the expansion and contraction of materials. The term "Annealing" follows from here, as annealing is the process of heating materials and allowing them to cool slowly - in our case, starting the algorithm with a high T, and then lowering it.<br/>
 
 
 
 
 
Asymptotically this algorithm is guaranteed to generate the global optimal answer, however in practice we never sample forever and this may not happen.
 
 
 
 
 
 
 
Example: Consider h(x)=3x^2, 0<x<1
 
 
 
<br/>1) Set T to be large, for example, T=100<br/>
 
<br/>2) Initialize the chain<br/>
 
<br/>3) set q(y|x)=1, uniform [0,1]<br/>
 
<br/>4) r=min (e^(3x^2-3y^2)/100,1)<br/>
 
<br/>5) U~U[0,1]<br/>
 
<br/>6) <math>U<r =>X_{t}+1=y, else, X_{t}+1=x_{t}</math><br/>
 
<br/>7) decrease T, go back to 3<br/>
 
 
 
<div style="border:1px red solid">
 
'''MATLAB '''
 
<pre style="font-size:12px">
 
Syms x
 
Ezplot((x-3)^2)
 
Ezplot((x-3)^2,[-6,12])
 
Ezplot(exp(-((x-3)^2)), [-6, 12])
 
</pre>
 
ezplot((x-3)^2)<br>
 
[[File:Snip20130711 1.png|300px]]<br>
 
ezplot((x-3)^2,[-6, 12])<br>
 
[[File:snip2013.png|300px]]<br>
 
ezplot(exp(-((x-3)^2)),[-6, 12])<br>
 
[[File:snip20131.png|300px]]<br>
 
 
 
initial the chain is important to find the probability to make the value reject or accept.
 
 
 
'''MATLAB '''
 
<pre style="font-size:12px">
 
clear all
 
close all
 
T=100;
 
x(1)=rand;
 
ii=1;
 
b=1;
 
while T>0.001
 
  y=b*rand+x(ii);
 
  r=min(exp((H(x(ii))-H(y))/T),1);
 
  u=rand;
 
  if u<r
 
      x(ii+1)=y;
 
  else
 
      x(ii+1)=x(ii);
 
  end
 
T=0.99*T;
 
ii=ii+1;
 
end
 
plot(x)
 
</pre>
 
[[File:SA_example.jpg|300px]]
 
</div>
 
when T is large it is helpful for generate the function.
 
 
 
example for H(x)=(x-3)^2
 
<pre style="font-size:12px">
 
function c=H(x)
 
c=(x-3)^2;
 
end
 
</pre>
 
 
 
'''Another Example:
 
<math>h(x)=((x-2)^2-4)((x-4)^2-8)</math>'''
 
 
 
<pre style="font-size:12px">
 
>>syms x
 
>>ezplot(((x-2)^2-4)*((x-4)^2-8),[-1,8])
 
</pre>
 
 
 
<pre style="font-size:12px">
 
function c=H(x)
 
c=((x-2)^2-4)*((x-4)^2-8);
 
end
 
</pre>
 
[[File:SA_example2.jpg|300px]]
 
 
 
Run earlier code with the new H(x) function
 
 
 
<h2>Motivation: Simulated Annealing and the Travelling Salesman Problem</h2>
 
The Travelling Salesman Problem asks:  given n numbers of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?
 
  
[[File:Salesman_n5.png]]
+
Detailed Balance:
An example of a solution of a travelling salesman problem on n=5. This is only one of many solutions, but we want to ensure we find the optimal solution.
 
  
The idea of using Simulated Annealing algorithm is:
+
<math>\pi_i P_{ij} = p_{ji} \pi_j</math>
let Y (let Y be all possible combinations of route in terms of cities index) be generated by permutation of all cities. Let the target or objective distribution (f(x)) be the distance of the route given Y.
 
Then use the Simulated Annealing algorithm to find the minimum value of f(x).<br>
 
Note: in this case, Q is the permutation of the numbers. There will be may possible paths, especially when n is large. If n is very large, then it will take forever to check all the combination of routes.
 
  
This sort of knowledge would be very useful for those in a situation where they are on a limited budget or must visit many points in a short period of time. For example, a truck driver may have to visit multiple cities in southern Ontario and make it back to his original starting point within a 6-hour period. Knowing how to use simulated annealing will help in this case as the truck driver can find the fastest route to complete his mission.
+
LHS:
 +
<math>\pi_i P_{ij} = \pi_i q_{ij} r_{ij} = \pi_i q_{ij} \frac{\pi_j q_{ij}}{\pi_i q_{ij}} = \pi_i q_{ji}

Revision as of 21:06, 12 July 2013

Contents

Introduction, Class 1 - Tuesday, May 7

Course Instructor: Ali Ghodsi

Lecture:
001: T/Th 8:30-9:50am MC1085
002: T/Th 1:00-2:20pm DC1351
Tutorial:
2:30-3:20pm Mon M3 1006
Office Hours:
Friday at 10am, M3 4208

Midterm

Monday June 17,2013 from 2:30pm-3:20pm

Final

Saturday August 10,2013 from 7:30pm-10:00pm

TA(s):

TA Day Time Location
Lu Cheng Monday 3:30-5:30 pm M3 3108, space 2
Han ShengSun Tuesday 4:00-6:00 pm M3 3108, space 2
Yizhou Fang Wednesday 1:00-3:00 pm M3 3108, space 1
Huan Cheng Thursday 3:00-5:00 pm M3 3111, space 1
Wu Lin Friday 11:00-1:00 pm M3 3108, space 1

Four Fundamental Problems

1. Classification: Given an input object X, we have a function which will take in this input X and identify which 'class (Y)' it belongs to (Discrete Case)

  i.e taking value from x, we could predict y.

(For example, if you have 40 images of oranges and 60 images of apples (represented by x), you can estimate a function that takes the images and states what type of fruit it is - note Y is discrete in this case.)
2. Regression: Same as classification but in the continuous case except y is non discrete. (Example of stock prices)
(A simple practice might be investigating the hypothesis that higher levels of education cause higher levels of income.)
3. Clustering: Use common features of objects in same class or group to form clusters.(in this case, x is given, y is unknown)
4. Dimensionality Reduction (aka Feature extraction, Manifold learning): Used when we have a variable in high dimension space and we want to reduce the dimension

Applications

Most useful when structure of the task is not well understood but can be characterized by a dataset with strong statistical regularity
Examples:

  • Computer Vision, Computer Graphics, Finance (fraud detection), Machine Learning
  • Search and recommendation (eg. Google, Amazon)
  • Automatic speech recognition, speaker verification
  • Text parsing
  • Face identification
  • Tracking objects in video
  • Financial prediction(e.g. credit cards)
  • Fraud detection
  • Medical diagnosis

Course Information

Prerequisite: (One of CS 116, 126/124, 134, 136, 138, 145, SYDE 221/322) and (STAT 230 with a grade of at least 60% or STAT 240) and (STAT 231 or 241)

Antirequisite: CM 361/STAT 341, CS 437, 457

General Information

  • No required textbook
  • Recommended: "Simulation" by Sheldon M. Ross
  • Computing parts of the course will be done in Matlab, but prior knowledge of Matlab is not essential (will have a tutorial on it)
  • First midterm will be held on Monday, June 17 from 2:30 to 3:30
  • Announcements and assignments will be posted on Learn.
  • Other course material on: http://wikicoursenote.com/wiki/
  • Log on to both Learn and wikicoursenote frequently.
  • Email all questions and concerns to UWStat340@gmail.com. Do not use your personal email address! Do not email instructor or TAs about the class directly to their personal accounts!

Wikicourse note (10% of final mark): When applying for an account in the wikicourse note, please use the quest account as your login name while the uwaterloo email as the registered email. This is important as the quest id will be used to identify the students who make the contributions. Example:
User: questid
Email: questid@uwaterloo.ca
After the student has made the account request, do wait for several hours before students can login into the account using the passwords stated in the email. During the first login, students will be ask to create a new password for their account.

As a technical/editorial contributor: Make contributions within 1 week and do not copy the notes on the blackboard.

All contributions are now considered general contributions you must contribute to 50% of lectures for full marks

  • A general contribution can be correctional (fixing mistakes) or technical (expanding content, adding examples, etc.) but at least half of your contributions should be technical for full marks.

Do not submit copyrighted work without permission, cite original sources. Each time you make a contribution, check mark the table. Marks are calculated on an honour system, although there will be random verifications. If you are caught claiming to contribute but have not, you will not be credited.

Wikicoursenote contribution form : https://docs.google.com/forms/d/1Sgq0uDztDvtcS5JoBMtWziwH96DrBz2JiURvHPNd-xs/viewform

- you can submit your contributions multiple times.
- you will be able to edit the response right after submitting
- send email to make changes to an old response : uwstat340@gmail.com

Tentative Topics

- Random variable and stochastic process generation
- Discrete-Event Systems
- Variance reduction
- Markov Chain Monte Carlo

Tentative Marking Scheme

Item Value
Assignments (~6) 30%
WikiCourseNote 10%
Midterm 20%
Final 40%


The final exam is going to be closed book and only non-programmable calculators are allowed.
A passing mark must be achieved in the final to pass the course

Class 2 - Thursday, May 9

Generating Random Numbers

Introduction

Simulation is the imitation of a process or system over time. Computational power has introduced the possibility of using simulation study to analyze models used to describe a situation.

In order to perform a simulation study, we must first: <br\> 1. Use a computer to generate (pseudo) random numbers.
2. Use these numbers to generate values of random variable from distributions.
3. Using the concept of discrete events, we show how the random variables can be used to generate the behavior of a stochastic model over time. (Note: A stochastic model is the opposite of deterministic model, where there are several directions the process can evolve to)
4. After continually generating the behavior of the system, we can obtain estimators and other quantities of interest.

The building block of a simulation study is the ability to generate a random number. This random number is a value from a random variable distributed uniformly on (0,1). There are many different methods of generating a random number:


Physical Method: Roulette wheel, lottery balls, dice rolling, card shuffling etc.

Numerically/Arithmetically: Use of a computer to successively generate pseudorandom numbers. The
sequence of numbers can appear to be random; however they are deterministically calculated with an
equation which defines pseudorandom.

(Source: Ross, Sheldon M., and Sheldon M. Ross. Simulation. San Diego: Academic, 1997. Print.)

In general, a deterministic model produces specific results given certain inputs by the model user, contrasting with a stochastic model which encapsulates randomness and probabilistic events. Det vs sto.jpg
A computer cannot generate truly random numbers because computers can only run algorithms, which are deterministic in nature. They can, however, generate Pseudo Random Numbers

Pseudo Random Numbers are the numbers that seem random but are actually deterministic. Although the pseudo random numbers are deterministic, these numbers have a sequence of value and all of them have the appearances of being independent uniform random variables. Being deterministic, pseudo random numbers are valuable and beneficial due to the ease to generate and manipulate.

When people do the test many times, the results will be the closed express values, which make the trial look deterministic, however for each trial, the result is random. So, it looks like pseudo random numbers.

Mod

Let [math]n \in \N[/math] and [math]m \in \N^+[/math], then by Division Algorithm, [math]\exists q, \, r \in \N \;\text{with}\; 0\leq r \lt m, \; \text{s.t.}\; n = mq+r[/math], where [math]q[/math] is called the quotient and [math]r[/math] the remainder. Hence we can define a binary function [math]\mod : \N \times \N^+ \rightarrow \N [/math] given by [math]r:=n \mod m[/math] which returns the remainder after division by m.
Generally, mod means taking the reminder after division by m.
We say that n is congruent to r mod m if n = mq + r, where m is an integer.
if y = ax + b, then [math]b:=y \mod a[/math].

For example:
30 = 4 * 7 + 2
2 = 30 mod 7

25 = 8 * 3 + 1
1 = 25 mod 3


Note: [math]\mod[/math] here is different from the modulo congruence relation in [math]\Z_m[/math], which is an equivalence relation instead of a function.

The modulo operation is useful for determining if an integer divided by another integer produces a non-zero remainder. But both integers should satisfy n = mq + r, where m, r, q, and n are all integers, and r is smaller than m.

Mixed Congruential Algorithm

We define the Linear Congruential Method to be [math]x_{k+1}=(ax_k + b) \mod m[/math], where [math]x_k, a, b, m \in \N, \;\text{with}\; a, m \neq 0[/math]. Given a seed (i.e. an initial value [math]x_0 \in \N[/math]), we can obtain values for [math]x_1, \, x_2, \, \cdots, x_n[/math] inductively. The Multiplicative Congruential Method, invented by Berkeley professor D. H. Lehmer, may also refer to the special case where [math]b=0[/math] and the Mixed Congruential Method is case where [math]b \neq 0[/math]

An interesting fact about Linear Congruential Method is that it is one of the oldest and best-known pseudo random number generator algorithms. It is very fast and requires minimal memory to retain state. However, this method should not be used for applications that require high randomness. They should not be used for Monte Carlo simulation and cryptographic applications. (Monte Carlo simulation will consider possibilities for every choice of consideration, and it shows the extreme possibilities. This method is not precise enough.)


First consider the following algorithm
[math]x_{k+1}=x_{k} \mod m[/math]


Example
[math]\text{Let }x_{0}=10,\,m=3[/math]

[math]\begin{align} x_{1} &{}= 10 &{}\mod{3} = 1 \\ x_{2} &{}= 1 &{}\mod{3} = 1 \\ x_{3} &{}= 1 &{}\mod{3} =1 \\ \end{align}[/math]

[math]\ldots[/math]

Excluding [math]x_{0}[/math], this example generates a series of ones. In general, excluding [math]x_{0}[/math], the algorithm above will always generate a series of the same number less than M. Hence, it has a period of 1. The period can be described as the length of a sequence before it repeats. We want a large period with a sequence that is random looking. We can modify this algorithm to form the Multiplicative Congruential Algorithm.


[math]x_{k+1}=(a \cdot x_{k} + b) \mod m [/math](a little tip: [math](a \cdot b)\mod c = (a\mod c)\cdot(b\mod c))[/math]

Example
[math]\text{Let }a=2,\, b=1, \, m=3, \, x_{0} = 10[/math]
[math]\begin{align} \text{Step 1: } 0&{}=(2\cdot 10 + 1) &{}\mod 3 \\ \text{Step 2: } 1&{}=(2\cdot 0 + 1) &{}\mod 3 \\ \text{Step 3: } 0&{}=(2\cdot 1 + 1) &{}\mod 3 \\ \end{align}[/math]
[math]\ldots[/math]

This example generates a sequence with a repeating cycle of two integers.

(If we choose the numbers properly, we could get a sequence of "random" numbers. How do we find the value of [math]a,b,[/math] and [math]m[/math]? At the very least [math]m[/math] should be a very large, preferably prime number. The larger [math]m[/math] is, the higher the possibility to get a sequence of "random" numbers. This is easier to solve in Matlab. In Matlab, the command rand() generates random numbers which are uniformly distributed on the interval (0,1)). Matlab uses [math]a=7^5, b=0, m=2^{31}-1[/math] – recommended in a 1988 paper, "Random Number Generators: Good Ones Are Hard To Find" by Stephen K. Park and Keith W. Miller (Important part is that [math]m[/math] should be large and prime)

Note: [math]\frac {x_{n+1}}{m-1}[/math] is an approximation to the value of a U(0,1) random variable.


MatLab Instruction for Multiplicative Congruential Algorithm:
Before you start, you need to clear all existing defined variables and operations:

>>clear all
>>close all
>>a=17
>>b=3
>>m=31
>>x=5
>>mod(a*x+b,m)
ans=26
>>x=mod(a*x+b,m)

(Note:
1. Keep repeating this command over and over again and you will get random numbers – this is how the command rand works in a computer.
2. There is a function in MATLAB called RAND to generate a random number between 0 and 1.
For example, in MATLAB, we can use rand(1,1000) to generate 1000's numbers between 0 and 1. This is essentially a vector with 1 row, 1000 columns, with each entry a random number between 0 and 1.
3. If we would like to generate 1000 or more numbers, we could use a for loop

(Note on MATLAB commands:
1. clear all: clears all variables.
2. close all: closes all figures.
3. who: displays all defined variables.
4. clc: clears screen.

5. ; : prevents the results from printing.

>>a=13
>>b=0
>>m=31
>>x(1)=1
>>for ii=2:1000
x(ii)=mod(a*x(ii-1)+b,m);
end
>>size(x)
ans=1    1000
>>hist(x)

MCA Example.jpg

(Note: The semicolon after the x(ii)=mod(a*x(ii-1)+b,m) ensures that Matlab will not print the entire vector of x. It will instead calculate it internally and you will be able to work with it. Adding the semicolon to the end of this line reduces the run time significantly.)


This algorithm involves three integer parameters [math]a, b,[/math] and [math]m[/math] and an initial value, [math]x_0[/math] called the seed. A sequence of numbers is defined by [math]x_{k+1} = ax_k+ b \mod m[/math].

Note: For some bad [math]a[/math] and [math]b[/math], the histogram may not look uniformly distributed.

Note: In MATLAB, hist(x) will generate a graph representing the distribution. Use this function after you run the code to check the real sample distribution.

Example: [math]a=13, b=0, m=31[/math]
The first 30 numbers in the sequence are a permutation of integers from 1 to 30, and then the sequence repeats itself so it is important to choose [math]m[/math] large to decrease the probability of each number repeating itself too early. Values are between [math]0[/math] and [math]m-1[/math]. If the values are normalized by dividing by [math]m-1[/math], then the results are approximately numbers uniformly distributed in the interval [0,1]. There is only a finite number of values (30 possible values in this case). In MATLAB, you can use function "hist(x)" to see if it looks uniformly distributed. We saw that the values between 0-30 had the same frequency in the histogram, so we can conclude that they are uniformly distributed.

If [math]x_0=1[/math], then

[math]x_{k+1} = 13x_{k}\mod{31}[/math]

So,

[math]\begin{align} x_{0} &{}= 1 \\ x_{1} &{}= 13 \times 1 + 0 &{}\mod{31} = 13 \\ x_{2} &{}= 13 \times 13 + 0 &{}\mod{31} = 14 \\ x_{3} &{}= 13 \times 14 + 0 &{}\mod{31} =27 \\ \end{align}[/math]

etc.

For example, with [math]a = 3, b = 2, m = 4, x_0 = 1[/math], we have:

[math]x_{k+1} = (3x_{k} + 2)\mod{4}[/math]

So,

[math]\begin{align} x_{0} &{}= 1 \\ x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\ x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\ \end{align}[/math]

etc.


FAQ:

1.Why is it 1 to 30 instead of 0 to 30 in the example above?
[math]b = 0[/math] so in order to have [math]x_k[/math] equal to 0, [math]x_{k-1}[/math] must be 0 (since [math]a=13[/math] is relatively prime to 31). However, the seed is 1. Hence, we will never observe 0 in the sequence.
Alternatively, {0} and {1,2,...,30} are two orbits of the left multiplication by 13 in the group [math]\Z_{31}[/math].
2.Will the number 31 ever appear?Is there a probability that a number never appears?
The number 31 will never appear. When you perform the operation [math]\mod m[/math], the largest possible answer that you could receive is [math]m-1[/math]. Whether or not a particular number in the range from 0 to [math]m - 1[/math] appears in the above algorithm will be dependent on the values chosen for [math]a, b[/math] and [math]m[/math].


Examples:[From Textbook]
If [math]x_0=3[/math] and [math]x_n=(5x_{n-1}+7)\mod 200[/math], find [math]x_1,\cdots,x_{10}[/math].
Solution:
[math]\begin{align} x_1 &{}= (5 \times 3+7) &{}\mod{200} &{}= 22 \\ x_2 &{}= 117 &{}\mod{200} &{}= 117 \\ x_3 &{}= 592 &{}\mod{200} &{}= 192 \\ x_4 &{}= 2967 &{}\mod{200} &{}= 167 \\ x_5 &{}= 14842 &{}\mod{200} &{}= 42 \\ x_6 &{}= 74217 &{}\mod{200} &{}= 17 \\ x_7 &{}= 371092 &{}\mod{200} &{}= 92 \\ x_8 &{}= 1855467 &{}\mod{200} &{}= 67 \\ x_9 &{}= 9277342 &{}\mod{200} &{}= 142 \\ x_{10} &{}= 46386717 &{}\mod{200} &{}= 117 \\ \end{align}[/math]

Comments:
Typically, it is good to choose [math]m[/math] such that [math]m[/math] is large, and [math]m[/math] is prime. Careful selection of parameters '[math]a[/math]' and '[math]b[/math]' also helps generate relatively "random" output values, where it is harder to identify patterns. For example, when we used a composite (non prime) number such as 40 for [math]m[/math], our results were not satisfactory in producing an output resembling a uniform distribution.

The computed values are between 0 and [math]m-1[/math]. If the values are normalized by dividing by [math]m-1[/math], their result is numbers uniformly distributed on the interval [math]\left[0,1\right][/math] (similar to computing from uniform distribution).

From the example shown above, if we want to create a large group of random numbers, it is better to have large, prime [math]m[/math] so that the generated random values will not repeat after several iterations. Note: the period for this example is 8: from '[math]x_2[/math]' to '[math]x_9[/math]'.

There has been a research on how to choose uniform sequence. Many programs give you the options to choose the seed. Sometimes the seed is chosen by CPU.

Theorem (extra knowledge)
Let c be a non-zero constant. Then for any seed x0, and LCG will have largest max. period if and only if
(i) m and c are coprime;
(ii) (a-1) is divisible by all prime factor of m;
(iii) if and only if m is divisible by 4, then a-1 is also divisible by 4.

We want our LCG to have a large cycle. We call a cycle with m element the maximal period. We can make it bigger by making m big and prime. Recall:any number you can think of can be broken into a factor of prime Define coprime:Two numbers X and Y, are coprime if they do not share any prime factors.

Example:

Xn=(15Xn-1 + 4) mod 7

(i) m=7 c=4 -> coprime;
(ii) a-1=14 and a-1 is divisible by 7;
(iii) dose not apply.
(The extra knowledge stops here)


In this part, I learned how to use R code to figure out the relationship between two integers division, and their remainder. And when we use R to calculate R with random variables for a range such as(1:1000),the graph of distribution is like uniform distribution.

Summary of Multiplicative Congruential Algorithm

Problem: generate Pseudo Random Numbers.

Plan:

  1. find integer: a b m(large prime) x0(the seed) .
  2. [math]x_{k+1}=(ax_{k}+b)[/math]mod m

Matlab Instruction:

>>clear all
>>close all
>>a=17
>>b=3
>>m=31
>>x=5
>>mod(a*x+b,m)
ans=26
>>x=mod(a*x+b,m)

[math]\lt math\gt Insert formula here[/math][math]\lt math\gt Insert formula here[/math]</math></math>=== Inverse Transform Method === Now that we know how to generate random numbers, we use these values to sample form distributions such as exponential. However, to easily use this method, the probability distribution consumed must have a cumulative distribution function (cdf) [math]F[/math] with a tractable (that is, easily found) inverse [math]F^{-1}[/math].

Theorem:
If we want to generate the value of a discrete random variable X, we must generate a random number U, uniformly distributed over (0,1). Let [math]F:\R \rightarrow \left[0,1\right][/math] be a cdf. If [math]U \sim U\left[0,1\right][/math], then the random variable given by [math]X:=F^{-1}\left(U\right)[/math] follows the distribution function [math]F\left(\cdot\right)[/math], where [math]F^{-1}\left(u\right):=\inf F^{-1}\big(\left[u,+\infty\right)\big) = \inf\{x\in\R | F\left(x\right) \geq u\}[/math] is the generalized inverse.
Note: [math]F[/math] need not be invertible everywhere on the real line, but if it is, then the generalized inverse is the same as the inverse in the usual case. We only need it to be invertible on the range of F(x), [0,1].

Proof of the theorem:
The generalized inverse satisfies the following:
[math]\begin{align} \forall u \in \left[0,1\right], \, x \in \R, \\ &{} F^{-1}\left(u\right) \leq x &{} \\ \Rightarrow &{} F\Big(F^{-1}\left(u\right)\Big) \leq F\left(x\right) &&{} F \text{ is non-decreasing} \\ \Rightarrow &{} F\Big(\inf \{y \in \R | F(y)\geq u \}\Big) \leq F\left(x\right) &&{} \text{by definition of } F^{-1} \\ \Rightarrow &{} \inf \{F(y) \in [0,1] | F(y)\geq u \} \leq F\left(x\right) &&{} F \text{ is right continuous and non-decreasing} \\ \Rightarrow &{} u \leq F\left(x\right) &&{} \text{by definition of } \inf \\ \Rightarrow &{} x \in \{y \in \R | F(y) \geq u\} &&{} \\ \Rightarrow &{} x \geq \inf \{y \in \R | F(y)\geq u \}\Big) &&{} \text{by definition of } \inf \\ \Rightarrow &{} x \geq F^{-1}(u) &&{} \text{by definition of } F^{-1} \\ \end{align}[/math]

That is [math]F^{-1}\left(u\right) \leq x \Leftrightarrow u \leq F\left(x\right)[/math]

Finally, [math]P(X \leq x) = P(F^{-1}(U) \leq x) = P(U \leq F(x)) = F(x)[/math], since [math]U[/math] is uniform on the unit interval.

This completes the proof.

Therefore, in order to generate a random variable X~F, it can generate U according to U(0,1) and then make the transformation x=[math] F^{-1}(U) [/math]

Note that we can apply the inverse on both sides in the proof of the inverse transform only if the pdf of X is monotonic. A monotonic function is one that is either increasing for all x, or decreasing for all x. Of course, this holds true for all CDFs, since they are monotonic by definition.

In short, what the theorem tells us is that we can use a random number [math] U from U(0,1) [/math] to randomly sample a point on the CDF of X, then apply the inverse of the CDF to map the given probability to its domain, which gives us the random variable X.


Example 1 - Exponential: [math] f(x) = \lambda e^{-\lambda x}[/math]
Calculate the CDF:
[math] F(x)= \int_0^x f(t) dt = \int_0^x \lambda e ^{-\lambda t}\ dt[/math] [math] = \frac{\lambda}{-\lambda}\, e^{-\lambda t}\, | \underset{0}{x} [/math] [math] = -e^{-\lambda x} + e^0 =1 - e^{- \lambda x} [/math]
Solve the inverse:
[math] y=1-e^{- \lambda x} \Rightarrow 1-y=e^{- \lambda x} \Rightarrow x=-\frac {ln(1-y)}{\lambda}[/math]
[math] y=-\frac {ln(1-x)}{\lambda} \Rightarrow F^{-1}(x)=-\frac {ln(1-x)}{\lambda}[/math]
Note that 1 − U is also uniform on (0, 1) and thus −log(1 − U) has the same distribution as −logU.
Steps:
Step 1: Draw U ~U[0,1];
Step 2: [math] x=\frac{-ln(U)}{\lambda} [/math]


MatLab Code:

>>u=rand(1,1000);
>>hist(u)       #will generate a fairly uniform diagram

ITM example hist(u).jpg

#let λ=2 in this example; however, you can make another value for λ
>>x=(-log(1-u))/2;
>>size(x)       #1000 in size 
>>figure
>>hist(x)       #exponential 

ITM example hist(x).jpg

Example 2 - Continuous Distribution:

[math] f(x) = \dfrac {\lambda } {2}e^{-\lambda \left| x-\theta \right| } for -\infty \lt X \lt \infty , \lambda \gt 0 [/math]

Calculate the CDF:

[math] F(x)= \frac{1}{2} e^{-\lambda (\theta - x)} , for \ x \le \theta [/math]
[math] F(x) = 1 - \frac{1}{2} e^{-\lambda (x - \theta)}, for \ x \gt \theta [/math]

Solve for the inverse:

[math]F^{-1}(x)= \theta + ln(2y)/\lambda, for \ 0 \le y \le 0.5[/math]
[math]F^{-1}(x)= \theta - ln(2(1-y))/\lambda, for \ 0.5 \lt y \le 1[/math]

Algorithm:
Steps:
Step 1: Draw U ~ U[0, 1];
Step 2: Compute [math]X = F^-1(U)[/math] i.e. [math]X = \theta + \frac {1}{\lambda} ln(2U)[/math] for U < 0.5 else [math]X = \theta -\frac {1}{\lambda} ln(2(1-U))[/math]


Example 3 - [math]F(x) = x^5[/math]:
Given a CDF of X: [math]F(x) = x^5[/math], transform U~U[0,1].
Sol: Let [math]y=x^5[/math], solve for x: [math]x=y^\frac {1}{5}[/math]. Therefore, [math]F^{-1} (x) = x^\frac {1}{5}[/math]
Hence, to obtain a value of x from F(x), we first set u as an uniform distribution, then obtain the inverse function of F(x), and set [math]x= u^\frac{1}{5}[/math]

Algorithm:
Steps:
Step 1: Draw U ~ rand[0, 1];
Step 2: X=U^(1/5);

Example 4 - BETA(1,β):
Given u~U[0,1], generate x from BETA(1,β)
Solution: [math]F(x)= 1-(1-x)^\beta[/math], [math]u= 1-(1-x)^\beta[/math]
Solve for x: [math](1-x)^\beta = 1-u[/math], [math]1-x = (1-u)^\frac {1}{\beta}[/math], [math]x = 1-(1-u)^\frac {1}{\beta}[/math]
let β=3, use Matlab to construct N=1000 observations from Beta(1,3)
Matlab Code:
>> u = rand(1,1000);

  x = 1-(1-u)^(1/3);

>> hist(x,50)
>> mean(x)

Example 5 - Estimating [math]\pi[/math]:
Let's use rand() and Monte Carlo Method to estimate [math]\pi[/math]
N= total number of points
Nc = total number of points inside the circle
Prob[(x,y) lies in the circle=[math]\frac {Area(circle)}{Area(square)}[/math]
If we take square of size 2, circle will have area =[math]\pi (\frac {2}{2})^2 =\pi[/math].
Thus [math]\pi= 4(\frac {N_c}{N})[/math]

  For example, UNIF(a,b)
[math]y = F(x) = (x - a)/ (b - a) [/math] [math]x = (b - a ) * y + a[/math] [math]X = a + ( b - a) * U[/math]
where U is UNIF(0,1)

Limitations:
1. This method is flawed since not all functions are invertible or monotonic: generalized inverse is hard to work on.
2. It may be impractical since some CDF's and/or integrals are not easy to compute such as Gaussian distribution.

We learned how to prove the transformation from cdf to inverse cdf,and use the uniform distribution to obtain a value of x from F(x). We can also use uniform distribution in inverse method to determine other distributions. The probability of getting a point for a circle over the triangle is a closed uniform distribution, each point in the circle and over the triangle is almost the same. Then, we can look at the graph to determine what kind of distribution the graph resembles.

Probability Distribution Function Tool in MATLAB

disttool         #shows different distributions

This command allows users to explore the effect of changes of parameters on the plot of either a CDF or PDF.

Disttool.jpg change the value of mu and sigma can change the graph skew side.

Class 3 - Tuesday, May 14

Recall the Inverse Transform Method

To sample X with CDF F(x),

1. Draw u~U(0,1)
2. X = F-1(u)


Proof
First note that [math]P(U\leq a)=a, \forall a\in[0,1][/math]

[math]P(X\leq x)[/math]

[math]= P(F^{-1}(U)\leq x)[/math] (since [math]X= F^{-1}(U)[/math] by the inverse method)
[math]= P((F(F^{-1}(U))\leq F(x))[/math] (since [math]F [/math] is monotonically increasing)
[math]= P(U\leq F(x)) [/math] (since [math] P(U\leq a)= a[/math] for [math]U \sim U(0,1), a \in [0,1][/math], this is explained further below)
[math]= F(x) , \text{ where } 0 \leq F(x) \leq 1 [/math]

This is the c.d.f. of X.

Note: that the CDF of a U(a,b) random variable is:

[math] F(x)= \begin{cases} 0 & \text{for }x \lt a \\[8pt] \frac{x-a}{b-a} & \text{for }a \le x \lt b \\[8pt] 1 & \text{for }x \ge b \end{cases} [/math]

Thus, for [math] U [/math] ~ [math]U(0,1) [/math], we have [math]P(U\leq 1) = 1[/math] and [math]P(U\leq 1/2) = 1/2[/math].
More generally, we see that [math]P(U\leq a) = a[/math].
For this reason, we had [math]P(U\leq F(x)) = F(x)[/math].

Reminder:
This is only for uniform distribution [math] U~ \sim~ Unif [0,1] [/math]
[math] P (U \le 1) = 1 [/math]
[math] P (U \le 0.5) = 0.5 [/math]
[math] P (U \le a) = a [/math]

2.jpg [math]P(U\leq a)=a[/math]

Note that on a single point there is no mass probability (i.e. [math]u[/math] <= 0.5, is the same as [math] u [/math] < 0.5) More formally, this is saying that [math] P(X = x) = F(x)- \lim_{s \to x^-}F(x)[/math] which equals zero for any continuous random variable

Limitations of the Inverse Transform Method

Though this method is very easy to use and apply, it does have a major disadvantage/limitation:

  • We need to find the inverse cdf [math] F^{-1}(\cdot) [/math]. In some cases the inverse function does not exist, or is difficult to find because it requires a closed form expression for F(x).

For example, it is too difficult to find the inverse cdf of the Gaussian distribution, so we must find another method to sample from the Gaussian distribution.

Discrete Case

The same technique can be used for discrete case. We want to generate a discrete random variable x, that has probability mass function:

[math]\begin{align}P(X = x_i) &{}= p_i \end{align}[/math]
[math]x_0 \leq x_1 \leq x_2 \dots \leq x_n[/math]
[math]\sum p_i = 1[/math]

Algorithm for applying Inverse Transformation Method in Discrete Case (Procedure):
1. Define a probability mass function for [math]x_{i}[/math] where i = 1,....,k. Note: k could grow infinitely.
2. Generate a uniform random number U, [math] U~ \sim~ Unif [0,1] [/math]
3. If [math]U\leq p_{o}[/math], deliver [math]X = x_{o}[/math]
4. Else, if [math]U\leq p_{o} + p_{1} [/math], deliver [math]X = x_{1}[/math]
5. Repeat the process again till we reached to [math]U\leq p_{o} + p_{1} + ......+ p_{k}[/math], deliver [math]X = x_{k}[/math]

Note that after generating a random U, the value of X can be determined by finding the interval [math][F(x_{j-1}),F(x_{j})][/math] in which U lies.


Example 3.0:
Generate a random variable from the following probability function:

x -2 -1 0 1 2
f(x) 0.1 0.5 0.07 0.03 0.3

Answer:
1. Gen U~U(0,1)
2. If U < 0.5 then output -1
else if U < 0.8 then output 2
else if U < 0.9 then output -2
else if U < 0.97 then output 0 else output 1

Example 3.1 (from class): (Coin Flipping Example)
We want to simulate a coin flip. We have U~U(0,1) and X = 0 or X = 1.

We can define the U function so that:

If U <= 0.5, then X = 0

and if 0.5 < U <= 1, then X =1.

This allows the probability of Heads occurring to be 0.5 and is a good generator of a random coin flip.

[math] U~ \sim~ Unif [0,1] [/math]

[math]\begin{align} P(X = 0) &{}= 0.5\\ P(X = 1) &{}= 0.5\\ \end{align}[/math]

The answer is:

[math] x = \begin{cases} 0, & \text{if } U\leq 0.5 \\ 1, & \text{if } 0.5 \lt U \leq 1 \end{cases}[/math]


  • Code
>>for ii=1:1000
    u=rand;
      if u<0.5
         x(ii)=0;
      else
         x(ii)=1;
      end
  end
>>hist(x)

Coin example.jpg

Note: The role of semi-colon in Matlab: Matlab will not print out the results if the line ends in a semi-colon and vice versa.

Example