Difference between revisions of "stat340s13"

From statwiki
Jump to: navigation, search
(Properties of Markov Chain)
m (Conversion script moved page Stat340s13 to stat340s13: Converting page titles to lowercase)
 
Line 1: Line 1:
 +
<div style = "align:left; background:#00ffff; font-size: 150%">
 +
If you
 +
use ideas, plots, text, code and other intellectual property developed by someone else
 +
in your `wikicoursenote' contribution , you have to cite the
 +
original source. If you copy a sentence or a paragraph from work done by someone
 +
else, in addition to citing the original source you have to use quotation marks to
 +
identify the scope of the copied material. Evidence of copying or plagiarism will
 +
cause a failing mark in the course.
 +
 +
Example of citing the original source
 +
 +
Assumptions Underlying Principal Component Analysis can be found here<ref>http://support.sas.com/publishing/pubcat/chaps/55129.pdf</ref>
 +
 +
</div>
 +
 +
==Important Notes==
 +
<span style="color:#ff0000;font-size: 200%"> To make distinction between the material covered in class and additional material that you have add to the course, use the following convention. For anything that is not covered in the lecture write:</span>
 +
 +
<div style = "align:left; background:#F5F5DC; font-size: 120%">
 +
In the news recently was a story that captures some of the ideas behind PCA. Over the past two years, Scott Golder and Michael Macy, researchers from Cornell University, collected 509 million Twitter messages from 2.4 million users in 84 different countries. The data they used were words collected at various times of day and they classified the data into two different categories: positive emotion words and negative emotion words. Then, they were able to study this new data to evaluate subjects' moods at different times of day, while the subjects were in different parts of the world. They found that the subjects generally exhibited positive emotions in the mornings and late evenings, and negative emotions mid-day. They were able to "project their data onto a smaller dimensional space" using PCS. Their paper, "Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures," is available in the journal Science.<ref>http://www.pcworld.com/article/240831/twitter_analysis_reveals_global_human_moodiness.html</ref>.
 +
 +
Assumptions Underlying Principal Component Analysis can be found here<ref>http://support.sas.com/publishing/pubcat/chaps/55129.pdf</ref>
 +
 +
</div>
 +
 
== Introduction, Class 1 - Tuesday, May 7 ==
 
== Introduction, Class 1 - Tuesday, May 7 ==
  
Line 56: Line 81:
 
=== Four Fundamental Problems ===
 
=== Four Fundamental Problems ===
 
<!-- br tag for spacing-->
 
<!-- br tag for spacing-->
1. Classification: Given an input object X, we have a function which will take in this input X and identify which 'class (Y)' it belongs to (Discrete Case) <br />
+
1 Classification: Given input object X, we have a function which will take this input X and identify which 'class (Y)' it belongs to (Discrete Case) <br />
   i.e taking value from x, we could predict y.
+
   <font size="3">i.e taking value from x, we could predict y.</font>
 
(For example, if you have 40 images of oranges and 60 images of apples (represented by x), you can estimate a function that takes the images and states what type of fruit it is - note Y is discrete in this case.) <br />
 
(For example, if you have 40 images of oranges and 60 images of apples (represented by x), you can estimate a function that takes the images and states what type of fruit it is - note Y is discrete in this case.) <br />
2. Regression: Same as classification but in the continuous case except y is non discrete. (Example of stock prices) <br />
+
2 Regression: Same as classification but in the continuous case except y is non discrete. Results from regression are often used for prediction,forecasting and etc. (Example of stock prices, height, weight, etc.) <br />
 
(A simple practice might be investigating the hypothesis that higher levels of education cause higher levels of income.) <br />
 
(A simple practice might be investigating the hypothesis that higher levels of education cause higher levels of income.) <br />
3. Clustering: Use common features of objects in same class or group to form clusters.(in this case, x is given, y is unknown) <br />
+
3 Clustering: Use common features of objects in same class or group to form clusters.(in this case, x is given, y is unknown; For example, clustering by provinces to measure average height of Canadian men.) <br />
4. Dimensionality Reduction (aka Feature extraction, Manifold learning): Used when we have a variable in high dimension space and we want to reduce the dimension. <br />
+
4 Dimensionality Reduction (also known as Feature extraction, Manifold learning): Used when we have a variable in high dimension space and we want to reduce the dimension <br />
  
 
=== Applications ===
 
=== Applications ===
Line 94: Line 119:
 
*Email all questions and concerns to UWStat340@gmail.com. Do not use your personal email address! Do not email instructor or TAs about the class directly to their personal accounts!
 
*Email all questions and concerns to UWStat340@gmail.com. Do not use your personal email address! Do not email instructor or TAs about the class directly to their personal accounts!
  
'''Wikicourse note (10% of final mark):'''
+
'''Wikicourse note (complete at least 12 contributions to get 10% of final mark):'''
 
When applying for an account in the wikicourse note, please use the quest account as your login name while the uwaterloo email as the registered email. This is important as the quest id will be used to identify the students who make the contributions.
 
When applying for an account in the wikicourse note, please use the quest account as your login name while the uwaterloo email as the registered email. This is important as the quest id will be used to identify the students who make the contributions.
 
Example:<br/>
 
Example:<br/>
Line 121: Line 146:
 
- Variance reduction<br />
 
- Variance reduction<br />
 
- Markov Chain Monte Carlo
 
- Markov Chain Monte Carlo
 
=== Tentative Marking Scheme ===
 
{| class="wikitable"
 
|-
 
! Item
 
! Value
 
|-
 
| Assignments (~6)
 
| 30%
 
|-
 
| WikiCourseNote
 
| 10%
 
|-
 
| Midterm
 
| 20%
 
|-
 
| Final
 
| 40%
 
|}
 
 
 
'''The final exam is going to be closed book and only non-programmable calculators are allowed.''' <br>
 
'''A passing mark must be achieved in the final to pass the course'''
 
  
 
==Class 2 - Thursday, May 9==
 
==Class 2 - Thursday, May 9==
Line 150: Line 152:
 
Simulation is the imitation of a process or system over time. Computational power has introduced the possibility of using simulation study to analyze models used to describe a situation.
 
Simulation is the imitation of a process or system over time. Computational power has introduced the possibility of using simulation study to analyze models used to describe a situation.
  
In order to perform a simulation study, we must first:
+
In order to perform a simulation study, we should:
<br\> 1. Use a computer to generate (pseudo) random numbers.<br>
+
<br\> 1 Use a computer to generate (pseudo*) random numbers (rand in MATLAB).<br>
2. Use these numbers to generate values of random variable from distributions.<br>
+
2 Use these numbers to generate values of random variable from distributions: for example, set a variable in terms of uniform u ~ U(0,1).<br>
3. Using the concept of discrete events, we show how the random variables can be used to generate the behavior of a stochastic model over time. (Note: A stochastic model is the opposite of deterministic model, where there are several directions the process can evolve to)<br>
+
3 Using the concept of discrete events, we show how the random variables can be used to generate the behavior of a stochastic model over time. (Note: A stochastic model is the opposite of deterministic model, where there are several directions the process can evolve to)<br>
4. After continually generating the behavior of the system, we can obtain estimators and other quantities of interest.<br>
+
4 After continually generating the behavior of the system, we can obtain estimators and other quantities of interest.<br>
  
 
The building block of a simulation study is the ability to generate a random number. This random number is a value from a random variable distributed uniformly on (0,1). There are many different methods of generating a random number: <br>
 
The building block of a simulation study is the ability to generate a random number. This random number is a value from a random variable distributed uniformly on (0,1). There are many different methods of generating a random number: <br>
  <br>Physical Method: Roulette wheel, lottery balls, dice rolling, card shuffling etc. <br>
+
  <br><font size="3">Physical Method: Roulette wheel, lottery balls, dice rolling, card shuffling etc. <br>
  <br>Numerically/Arithmetically: Use of a computer to successively generate pseudorandom numbers. The <br />sequence of numbers can appear to be random; however they are deterministically calculated with an <br />equation which defines pseudorandom. <br>
+
  <br>Numerically/Arithmetically: Use of a computer to successively generate pseudorandom numbers. The <br />sequence of numbers can appear to be random; however they are deterministically calculated with an <br />equation which defines pseudorandom. <br></font>
  
 
(Source: Ross, Sheldon M., and Sheldon M. Ross. Simulation. San Diego: Academic, 1997. Print.)
 
(Source: Ross, Sheldon M., and Sheldon M. Ross. Simulation. San Diego: Academic, 1997. Print.)
 +
 +
*We use the prefix pseudo because computer generates random numbers based on algorithms, which suggests that generated numbers are not truly random. Therefore pseudo-random numbers is used.
  
 
In general, a deterministic model produces specific results given certain inputs by the model user, contrasting with a '''stochastic''' model which encapsulates randomness and probabilistic events.
 
In general, a deterministic model produces specific results given certain inputs by the model user, contrasting with a '''stochastic''' model which encapsulates randomness and probabilistic events.
Line 166: Line 170:
 
<br>A computer cannot generate truly random numbers because computers can only run algorithms, which are deterministic in nature. They can, however, generate Pseudo Random Numbers<br>
 
<br>A computer cannot generate truly random numbers because computers can only run algorithms, which are deterministic in nature. They can, however, generate Pseudo Random Numbers<br>
  
'''Pseudo Random Numbers''' are the numbers that seem random but are actually deterministic. Although the pseudo random numbers are deterministic, these numbers have a sequence of value and all of them have the appearance of being independent uniform random variables. Being deterministic, pseudo random numbers are valuable and beneficial due to the ease of generation and manipulation.  
+
'''Pseudo Random Numbers''' are the numbers that seem random but are actually determined by a relative set of original values. It is a chain of numbers pre-set by a formula or an algorithm, and the value jump from one to the next, making it look like a series of independent random events. The flaw of this method is that, eventually the chain returns to its initial position and pattern starts to repeat, but if we make the number set large enough we can prevent the numbers from repeating too early. Although the pseudo random numbers are deterministic, these numbers have a sequence of value and all of them have the appearances of being independent uniform random variables. Being deterministic, pseudo random numbers are valuable and beneficial due to the ease to generate and manipulate.
  
When people do the test many times, the results will be the closed express values, which make the trial look deterministic, however for each trial, the result is random.
+
When people repeat the test many times, the results will be the closed express values, which make the trials look deterministic. However, for each trial, the result is random. So, it looks like pseudo random numbers.
So, it looks like pseudo random numbers.
 
  
 
==== Mod ====
 
==== Mod ====
Line 179: Line 182:
 
Generally, mod means taking the reminder after division by m.
 
Generally, mod means taking the reminder after division by m.
 
<br />
 
<br />
We say that n is congruent to r mod m if n = mq + r, where m is an integer. <br />
+
We say that n is congruent to r mod m if n = mq + r, where m is an integer.  
 +
Values are between 0 and m-1 <br />
 
if y = ax + b, then <math>b:=y \mod a</math>. <br />
 
if y = ax + b, then <math>b:=y \mod a</math>. <br />
  
For example:<br />
+
'''Example 1:'''<br />
30 = 4 * 7 + 2 <br />
+
 
2 = 30 mod 7<br />
+
<math>30 = 4 \cdot  7 + 2</math><br />
 +
 
 +
<math>2 := 30\mod 7</math><br />
 +
<br />
 +
<math>25 = 8 \cdot  3 + 1</math><br />
 +
 
 +
<math>1: = 25\mod 3</math><br />
 +
<br />
 +
<math>-3=5\cdot (-1)+2</math><br />
 +
 
 +
<math>2:=-3\mod 5</math><br />
 +
 
 +
<br />
 +
'''Example 2:'''<br />
 +
 
 +
If <math>23 = 3 \cdot  6 + 5</math> <br />
 +
 
 +
Then equivalently, <math>5 := 23\mod 6</math><br />
 +
<br />
 +
If <math>31 = 31 \cdot  1</math> <br />
 +
 
 +
Then equivalently, <math>0 := 31\mod 31</math><br />
 +
<br />
 +
If <math>-37 = 40\cdot (-1)+ 3</math> <br />
 +
 
 +
Then equivalently, <math>3 := -37\mod 40</math><br />
 +
 
 +
'''Example 3:'''<br />
 +
<math>77 = 3 \cdot  25 + 2</math><br />
 +
 
 +
<math>2 := 77\mod 3</math><br />
 +
<br />
 +
<math>25 = 25 \cdot  1 + 0</math><br />
 +
 
 +
<math>0: = 25\mod 25</math><br />
 +
<br />
  
25 = 8 * 3 + 1 <br />
 
1 = 25 mod 3<br />
 
  
  
Line 193: Line 230:
 
'''Note:''' <math>\mod</math> here is different from the modulo congruence relation in <math>\Z_m</math>, which is an equivalence relation instead of a function.
 
'''Note:''' <math>\mod</math> here is different from the modulo congruence relation in <math>\Z_m</math>, which is an equivalence relation instead of a function.
  
The modulo operation is useful for determining if an integer divided by another integer produces a non-zero remainder. But both integers should satisfy n = mq + r, where m, r, q, and n are all integers, and r is smaller than m.
+
The modulo operation is useful for determining if an integer divided by another integer produces a non-zero remainder. But both integers should satisfy <math>n = mq + r</math>, where <math>m</math>, <math>r</math>, <math>q</math>, and <math>n</math> are all integers, and <math>r</math> is smaller than <math>m</math>. The above rules also satisfy when any of <math>m</math>, <math>r</math>, <math>q</math>, and <math>n</math> is negative integer, see the third example.
  
 
==== Mixed Congruential Algorithm ====
 
==== Mixed Congruential Algorithm ====
We define the Linear Congruential Method to be <math>x_{k+1}=(ax_k + b) \mod m</math>, where <math>x_k, a, b, m \in \N, \;\text{with}\; a, m \neq 0</math>. Given a '''seed''' (i.e. an initial value <math>x_0 \in \N</math>), we can obtain values for <math>x_1, \, x_2, \, \cdots, x_n</math> inductively. The Multiplicative Congruential Method, invented by Berkeley professor D. H. Lehmer, may also refer to the special case where <math>b=0</math> and the Mixed Congruential Method is case where <math>b \neq 0</math> <br />
+
We define the Linear Congruential Method to be <math>x_{k+1}=(ax_k + b) \mod m</math>, where <math>x_k, a, b, m \in \N, \;\text{with}\; a, m \neq 0</math>. Given a '''seed''' (i.e. an initial value <math>x_0 \in \N</math>), we can obtain values for <math>x_1, \, x_2, \, \cdots, x_n</math> inductively. The Multiplicative Congruential Method, invented by Berkeley professor D. H. Lehmer, may also refer to the special case where <math>b=0</math> and the Mixed Congruential Method is case where <math>b \neq 0</math> <br />. Their title as "mixed" arises from the fact that it has both a multiplicative and additive term.
  
 
An interesting fact about '''Linear Congruential Method''' is that it is one of the oldest and best-known pseudo random number generator algorithms. It is very fast and requires minimal memory to retain state. However, this method should not be used for applications that require high randomness. They should not be used for Monte Carlo simulation and cryptographic applications. (Monte Carlo simulation will consider possibilities for every choice of consideration, and it shows the extreme possibilities. This method is not precise enough.)<br />
 
An interesting fact about '''Linear Congruential Method''' is that it is one of the oldest and best-known pseudo random number generator algorithms. It is very fast and requires minimal memory to retain state. However, this method should not be used for applications that require high randomness. They should not be used for Monte Carlo simulation and cryptographic applications. (Monte Carlo simulation will consider possibilities for every choice of consideration, and it shows the extreme possibilities. This method is not precise enough.)<br />
  
 +
[[File:Linear_Congruential_Statment.png‎|600px]] "Source: STAT 340 Spring 2010 Course Notes"
 +
 +
'''First consider the following algorithm'''<br />
 +
<math>x_{k+1}=x_{k} \mod m</math> <br />
  
 +
such that: if <math>x_{0}=5(mod 150)</math>, <math>x_{n}=3x_{n-1}</math>, find <math>x_{1},x_{8},x_{9}</math>. <br />
 +
<math>x_{n}=(3^n)*5(mod 150)</math> <br />
 +
<math>x_{1}=45,x_{8}=105,x_{9}=15</math> <br />
  
'''First consider the following algorithm'''<br />
 
<math>x_{k+1}=x_{k} \mod m</math>
 
  
  
Line 269: Line 311:
 
2. close all: closes all figures.<br />
 
2. close all: closes all figures.<br />
 
3. who: displays all defined variables.<br />
 
3. who: displays all defined variables.<br />
4. clc: clears screen.<br /><br />
+
4. clc: clears screen.<br />
5. ; : prevents the results from printing.<br /><br />
+
5. ; : prevents the results from printing.<br />
 +
6. disstool: displays a graphing tool.<br /><br />
  
 
<pre style="font-size:16px">
 
<pre style="font-size:16px">
Line 325: Line 368:
 
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\
 
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\
 
\end{align}</math><br />
 
\end{align}</math><br />
 
+
Another Example, a =3, b =2, m = 5, x_0=1
 
etc.
 
etc.
 
<hr/>
 
<hr/>
Line 337: Line 380:
  
 
'''Examples:[From Textbook]'''<br />
 
'''Examples:[From Textbook]'''<br />
If <math>x_0=3</math> and <math>x_n=(5x_{n-1}+7)\mod 200</math>, find <math>x_1,\cdots,x_{10}</math>.<br />
+
<math>\text{If }x_0=3 \text{ and } x_n=(5x_{n-1}+7)\mod 200</math>, <math>\text{find }x_1,\cdots,x_{10}</math>.<br />
 
'''Solution:'''<br />
 
'''Solution:'''<br />
 
<math>\begin{align}
 
<math>\begin{align}
Line 353: Line 396:
  
 
'''Comments:'''<br />
 
'''Comments:'''<br />
Typically, it is good to choose <math>m</math> such that <math>m</math> is large, and <math>m</math> is prime. Careful selection of parameters '<math>a</math>' and '<math>b</math>' also helps generate relatively "random" output values, where it is harder to identify patterns. For example, when we used a composite (non prime) number such as 40 for <math>m</math>, our results were not satisfactory in producing an output resembling a uniform distribution.<br />
 
  
The computed values are between 0 and <math>m-1</math>. If the values are normalized by dividing by '''<math>m-1</math>''', their result is numbers uniformly distributed on the interval <math>\left[0,1\right]</math> (similar to computing from uniform distribution).<br />
+
Matlab code:
 +
a=5;
 +
b=7;
 +
m=200;
 +
x(1)=3;
 +
for ii=2:1000
 +
x(ii)=mod(a*x(ii-1)+b,m);
 +
end
 +
size(x);
 +
hist(x)
 +
 
 +
 
 +
 
 +
Typically, it is good to choose <math>m</math> such that <math>m</math> is large, and <math>m</math> is prime. Careful selection of parameters '<math>a</math>' and '<math>b</math>' also helps generate relatively "random" output values, where it is harder to identify patterns. For example, when we used a composite (non prime) number such as 40 for <math>m</math>, our results were not satisfactory in producing an output resembling a uniform distribution.<br />
 +
 
 +
The computed values are between 0 and <math>m-1</math>. If the values are normalized by dividing by '''<math>m-1</math>''', their result is numbers uniformly distributed on the interval <math>\left[0,1\right]</math> (similar to computing from uniform distribution).<br />
  
 
From the example shown above, if we want to create a large group of random numbers, it is better to have large, prime <math>m</math> so that the generated random values will not repeat after several iterations. Note: the period for this example is 8: from '<math>x_2</math>' to '<math>x_9</math>'.<br />
 
From the example shown above, if we want to create a large group of random numbers, it is better to have large, prime <math>m</math> so that the generated random values will not repeat after several iterations. Note: the period for this example is 8: from '<math>x_2</math>' to '<math>x_9</math>'.<br />
Line 374: Line 431:
  
 
Example:<br />
 
Example:<br />
  Xn=(15Xn-1 + 4) mod 7<br />
+
  <font size="3">Xn=(15Xn-1 + 4) mod 7</font><br />
 
(i) m=7 c=4 -> coprime;<br />
 
(i) m=7 c=4 -> coprime;<br />
 
(ii) a-1=14 and a-1 is divisible by 7;<br />
 
(ii) a-1=14 and a-1 is divisible by 7;<br />
Line 404: Line 461:
 
</pre>
 
</pre>
 
</div>
 
</div>
 +
Another algorithm for generating pseudo random numbers is the multiply with carry method. Its simplest form is similar to the linear congruential generator. They differs in that the parameter b changes in the MWC algorithm. It is as follows: <br>
  
<math><math>Insert formula here</math><math><math>Insert formula here</math></math></math>=== Inverse Transform Method ===
+
1.) x<sub>k+1</sub> = ax<sub>k</sub> + b<sub>k</sub> mod m <br>
 +
2.) b<sub>k+1</sub> = floor((ax<sub>k</sub> + b<sub>k</sub>)/m) <br>
 +
3.) set k to k + 1 and go to step 1
 +
[http://www.javamex.com/tutorials/random_numbers/multiply_with_carry.shtml Source]
 +
 
 +
=== Inverse Transform Method ===
 
Now that we know how to generate random numbers, we use these values to sample form distributions such as exponential. However, to easily use this method, the probability distribution consumed must have a cumulative distribution function (cdf) <math>F</math> with a tractable (that is, easily found) inverse <math>F^{-1}</math>.<br />
 
Now that we know how to generate random numbers, we use these values to sample form distributions such as exponential. However, to easily use this method, the probability distribution consumed must have a cumulative distribution function (cdf) <math>F</math> with a tractable (that is, easily found) inverse <math>F^{-1}</math>.<br />
  
Line 417: Line 480:
 
'''Proof of the theorem:'''<br />
 
'''Proof of the theorem:'''<br />
 
The generalized inverse satisfies the following: <br />
 
The generalized inverse satisfies the following: <br />
<math>\begin{align}
+
 
\forall u \in \left[0,1\right], \, x \in \R, \\
+
:<math>P(X\leq x)</math> <br />
&{} F^{-1}\left(u\right) \leq x &{} \\
+
<math>= P(F^{-1}(U)\leq x)</math> (since <math>X= F^{-1}(U)</math> by the inverse method)<br />
\Rightarrow &{} F\Big(F^{-1}\left(u\right)\Big) \leq F\left(x\right) &&{} F \text{ is non-decreasing} \\
+
<math>= P((F(F^{-1}(U))\leq F(x))</math>  (since <math>F </math> is monotonically increasing) <br />
\Rightarrow &{} F\Big(\inf \{y \in \R | F(y)\geq u \}\Big) \leq F\left(x\right) &&{} \text{by definition of } F^{-1} \\
+
<math>= P(U\leq F(x)) </math> (since <math> P(U\leq a)= a</math> for <math>U \sim U(0,1), a \in [0,1]</math>,<br />
\Rightarrow &{} \inf \{F(y) \in [0,1] | F(y)\geq u \} \leq F\left(x\right) &&{} F \text{ is right continuous and non-decreasing} \\
+
<math>= F(x) , \text{ where } 0 \leq F(x) \leq 1 </math>  <br />
\Rightarrow &{} u \leq F\left(x\right) &&{} \text{by definition of } \inf \\
+
 
\Rightarrow &{} x \in \{y \in \R | F(y) \geq u\} &&{} \\
+
This is the c.d.f. of X.  <br />
\Rightarrow &{} x \geq \inf \{y \in \R | F(y)\geq u \}\Big) &&{} \text{by definition of } \inf \\
+
<br />
\Rightarrow &{} x \geq F^{-1}(u) &&{} \text{by definition of } F^{-1} \\
 
\end{align}</math>
 
  
 
That is <math>F^{-1}\left(u\right) \leq x \Leftrightarrow u \leq F\left(x\right)</math><br />
 
That is <math>F^{-1}\left(u\right) \leq x \Leftrightarrow u \leq F\left(x\right)</math><br />
Line 440: Line 501:
  
 
In short, what the theorem tells us is that we can use a random number <math> U from U(0,1) </math> to randomly sample a point on the CDF of X, then apply the inverse of the CDF to map the given probability to its domain, which gives us the random variable X.<br/>
 
In short, what the theorem tells us is that we can use a random number <math> U from U(0,1) </math> to randomly sample a point on the CDF of X, then apply the inverse of the CDF to map the given probability to its domain, which gives us the random variable X.<br/>
 
 
'''Proof of F(x) is Uniformly Distributed:'''
 
 
P(F(x)<u)=P(F<sup>-1</sup>(F(x))<F<sup>-1</sup>(u))=P(x<F<sup>-1</sup>(u))=F(F<sup>-1</sup>(u))=u
 
 
So F(x)~U(0,1).
 
  
  
Line 462: Line 516:
 
Step 2: <math>  x=\frac{-ln(U)}{\lambda} </math> <br /><br />
 
Step 2: <math>  x=\frac{-ln(U)}{\lambda} </math> <br /><br />
  
 +
 +
EXAMPLE 2 Normal distribution
 +
G(y)=P[Y<=y)
 +
      =P[-sqr (y) < z < sqr (y))
 +
      =integrate from -sqr(z) to Sqr(z) 1/sqr(2pi) e ^(-z^2/2) dz
 +
      = 2 integrate from 0 to sqr(y)  1/sqr(2pi) e ^(-z^2/2) dz
 +
its the cdf of Y=z^2
 +
 +
pdf g(y)= G'(y)
 +
pdf pf x^2 (1)
  
 
'''MatLab Code''':<br />
 
'''MatLab Code''':<br />
Line 467: Line 531:
 
<pre style="font-size:16px">
 
<pre style="font-size:16px">
 
>>u=rand(1,1000);
 
>>u=rand(1,1000);
>>hist(u)      #will generate a fairly uniform diagram
+
>>hist(u)      # this will generate a fairly uniform diagram
 
</pre>
 
</pre>
 
[[File:ITM_example_hist(u).jpg|300px]]
 
[[File:ITM_example_hist(u).jpg|300px]]
Line 498: Line 562:
 
Step 2: Compute <math>X = F^-1(U)</math> i.e. <math>X = \theta  + \frac {1}{\lambda} ln(2U)</math> for U < 0.5 else <math>X = \theta -\frac {1}{\lambda} ln(2(1-U))</math>
 
Step 2: Compute <math>X = F^-1(U)</math> i.e. <math>X = \theta  + \frac {1}{\lambda} ln(2U)</math> for U < 0.5 else <math>X = \theta -\frac {1}{\lambda} ln(2(1-U))</math>
  
'''MatLab Code''':<br />
 
<pre style="font-size:16px">
 
>> u = rand;
 
>> theta  = 1;
 
>> lambda = 1;
 
>> if u < 0.5
 
      X = theta + (1/lambda) * log(2u);
 
  else
 
      X = theta - (1/lambda) * log(2(1-u));
 
  end
 
</pre>
 
  
 
'''Example 3 - <math>F(x) = x^5</math>''':<br/>
 
'''Example 3 - <math>F(x) = x^5</math>''':<br/>
Line 514: Line 567:
 
Sol:  
 
Sol:  
 
Let <math>y=x^5</math>, solve for x: <math>x=y^\frac {1}{5}</math>. Therefore, <math>F^{-1} (x) = x^\frac {1}{5}</math><br />
 
Let <math>y=x^5</math>, solve for x: <math>x=y^\frac {1}{5}</math>. Therefore, <math>F^{-1} (x) = x^\frac {1}{5}</math><br />
Hence, to obtain a value of x from F(x), we first set u as an uniform distribution, then obtain the inverse function of F(x), and set
+
Hence, to obtain a value of x from F(x), we first set 'u' as an uniform distribution, then obtain the inverse function of F(x), and set
 
<math>x= u^\frac{1}{5}</math><br /><br />
 
<math>x= u^\frac{1}{5}</math><br /><br />
  
Line 521: Line 574:
 
Step 1: Draw U ~ rand[0, 1];<br />
 
Step 1: Draw U ~ rand[0, 1];<br />
 
Step 2: X=U^(1/5);<br />
 
Step 2: X=U^(1/5);<br />
 
'''MatLab Code''':<br />
 
<pre style="font-size:16px">
 
>>x=rand^(1/5)
 
</pre>
 
 
  
 
'''Example 4 - BETA(1,β)''':<br/>
 
'''Example 4 - BETA(1,β)''':<br/>
Line 538: Line 585:
 
<math>x = 1-(1-u)^\frac {1}{\beta}</math><br />
 
<math>x = 1-(1-u)^\frac {1}{\beta}</math><br />
 
let β=3, use Matlab to construct N=1000 observations from Beta(1,3)<br />
 
let β=3, use Matlab to construct N=1000 observations from Beta(1,3)<br />
Matlab Code:<br />
+
'''MatLab Code''':<br />
>> u = rand(1,1000);<br />
+
 
  x = 1-(1-u)^(1/3);<br />
+
<pre style="font-size:16px">
>> hist(x,50)<br />
+
>> u = rand(1,1000);
>> mean(x)<br />
+
x = 1-(1-u)^(1/3);
 +
>> hist(x,50)
 +
>> mean(x)
 +
</pre>
  
 
'''Example 5 - Estimating <math>\pi</math>''':<br/>
 
'''Example 5 - Estimating <math>\pi</math>''':<br/>
Line 552: Line 602:
 
Thus <math>\pi= 4(\frac {N_c}{N})</math><br />
 
Thus <math>\pi= 4(\frac {N_c}{N})</math><br />
  
   For example, '''UNIF(a,b)'''<br />
+
   <font size="3">For example, '''UNIF(a,b)'''<br />
 
   <math>y = F(x) = (x - a)/ (b - a) </math>
 
   <math>y = F(x) = (x - a)/ (b - a) </math>
 
   <math>x = (b - a ) * y + a</math>
 
   <math>x = (b - a ) * y + a</math>
 
   <math>X = a + ( b - a) * U</math><br />
 
   <math>X = a + ( b - a) * U</math><br />
   where U is UNIF(0,1)
+
   where U is UNIF(0,1)</font>
  
 
'''Limitations:'''<br />
 
'''Limitations:'''<br />
Line 571: Line 621:
 
</pre>  
 
</pre>  
  
This command allows users to explore the effect of changes of parameters on the plot of either a CDF or PDF.  
+
This command allows users to explore different types of distribution and see how the changes affect the parameters on the plot of either a CDF or PDF.
 +
 
  
 
[[File:Disttool.jpg|450px]]
 
[[File:Disttool.jpg|450px]]
Line 578: Line 629:
 
== Class 3 - Tuesday, May 14 ==
 
== Class 3 - Tuesday, May 14 ==
 
=== Recall the Inverse Transform Method ===
 
=== Recall the Inverse Transform Method ===
 +
Let U~Unif(0,1),then the random variable  X = F<sup>-1</sup>(u) has distribution F.  <br />
 +
To sample X with CDF F(x), <br />
  
To sample X with CDF F(x), <br />
+
<math>1) U~ \sim~ Unif [0,1] </math>
 +
'''2) X = F<sup>-1</sup>(u)   '''<br />
  
'''1. Draw u~U(0,1) '''<br />
 
'''2. X = F<sup>-1</sup>(u)  '''<br />
 
  
  
'''Proof''' <br />
 
First note that
 
<math>P(U\leq a)=a, \forall a\in[0,1]</math> <br />
 
  
:<math>P(X\leq x)</math> <br />
 
<math>= P(F^{-1}(U)\leq x)</math> (since <math>X= F^{-1}(U)</math> by the inverse method)<br />
 
<math>= P((F(F^{-1}(U))\leq F(x))</math>  (since <math>F </math> is monotonically increasing) <br />
 
<math>= P(U\leq F(x)) </math> (since <math> P(U\leq a)= a</math> for <math>U \sim U(0,1), a \in [0,1]</math>, this is explained further below)<br />
 
<math>= F(x) , \text{ where } 0 \leq F(x) \leq 1 </math>  <br />
 
  
This is the c.d.f. of X.  <br />
 
 
<br />
 
<br />
  
This same technique can be used to sample from discrete distribution.<br />
+
'''Note''': CDF of a U(a,b) random variable is:
 
 
'''Note''': that the CDF of a U(a,b) random variable is:
 
 
:<math>
 
:<math>
 
   F(x)= \begin{cases}
 
   F(x)= \begin{cases}
Line 608: Line 649:
 
   \end{cases}
 
   \end{cases}
 
</math>  
 
</math>  
 
Further, the pdf <math>f(x) = \frac{1}{b-a}</math> and 0 otherwise.
 
  
 
Thus, for <math> U </math> ~ <math>U(0,1) </math>, we have <math>P(U\leq 1) = 1</math> and <math>P(U\leq 1/2) = 1/2</math>.<br />
 
Thus, for <math> U </math> ~ <math>U(0,1) </math>, we have <math>P(U\leq 1) = 1</math> and <math>P(U\leq 1/2) = 1/2</math>.<br />
Line 624: Line 663:
  
 
Note that on a single point there is no mass probability (i.e. <math>u</math> <= 0.5, is the same as <math> u </math> < 0.5)  
 
Note that on a single point there is no mass probability (i.e. <math>u</math> <= 0.5, is the same as <math> u </math> < 0.5)  
More formally, this is saying that <math> P(X = x) = F(x)- \lim_{s \to x^-}F(x)</math> which equals zero for any continuous random variable
+
More formally, this is saying that <math> P(X = x) = F(x)- \lim_{s \to x^-}F(x)</math> , which equals zero for any continuous random variable
 
 
====Advantages of the Inverse Transform Method====
 
 
 
*  It is very easy to use and apply if we are able to find the inverse cdf <math> F^{-1}(\cdot)</math>.
 
*  It preserves monotonicity and correlation, which consequently helps in order statistics, variance reduction methods, and also generating truncated distributions.
 
  
 
====Limitations of the Inverse Transform Method====
 
====Limitations of the Inverse Transform Method====
  
Though this method is very easy to use and apply,  it does have some major disadvantages/limitations:
+
Though this method is very easy to use and apply,  it does have a major disadvantage/limitation:
  
*  Since a number of comparisons are required, the speed of this method is often very slow.
+
*  We need to find the inverse cdf <math> F^{-1}(\cdot) </math>. In some cases the inverse function does not exist, or is difficult to find because it requires a closed form expression for F(x).
*  We need to find the inverse cdf <math> F^{-1}(\cdot) </math>. In some cases the inverse function does not exist, or is difficult to find.
 
  
 
For example, it is too difficult to find the inverse cdf of the Gaussian distribution, so we must find another method to sample from the Gaussian distribution.
 
For example, it is too difficult to find the inverse cdf of the Gaussian distribution, so we must find another method to sample from the Gaussian distribution.
  
[Discrete Case]
+
In conclusion, we need to find another way of sampling from more complicated distributions
 +
 
 +
=== Discrete Case ===
 
The same technique can be used for discrete case. We want to generate a discrete random variable x, that has probability mass function: <br/>
 
The same technique can be used for discrete case. We want to generate a discrete random variable x, that has probability mass function: <br/>
  
Line 655: Line 690:
  
 
Note that after generating a random U, the value of X can be determined by finding the interval <math>[F(x_{j-1}),F(x_{j})]</math> in which U lies. <br />
 
Note that after generating a random U, the value of X can be determined by finding the interval <math>[F(x_{j-1}),F(x_{j})]</math> in which U lies. <br />
 +
 +
In summary:
 +
Generate a discrete r.v.x that has pmf:<br />
 +
  P(X=xi)=Pi,    x0<x1<x2<... <br />
 +
1. Draw U~U(0,1);<br />
 +
2. If F(x(i-1))<U<F(xi), x=xi.<br />
  
  
Line 684: Line 725:
 
else if U < 0.9 then output -2<br />
 
else if U < 0.9 then output -2<br />
 
else if U < 0.97 then output 0 else output 1<br />
 
else if U < 0.97 then output 0 else output 1<br />
 
* '''Matlab Code'''<br />
 
<pre style="font-size:16px">
 
>> u = rand;              # pick up F(x);
 
>> if u < 0.5            # Pr(x = -1)=0.5;
 
      x = -1;           
 
  elseif u< 0.8          # Pr(x = 2) =0.8-0.5 =0.3
 
      x = 2;
 
  elseif u < 0.9        # Pr(x = -2)=0.9-0.8 =0.1
 
      x = -2;
 
  elseif u < 0.97        # Pr(x = 0) =0.97-0.9=0.07
 
      x = 0;
 
  else                  # Pr(x = 1) =1 - 0.97=0.03
 
      x = 1;
 
  end                    # total probability:0.5+0.2+0.2+0.07+0.03=1
 
</pre>
 
  
 
'''Example 3.1 (from class):''' (Coin Flipping Example)<br />
 
'''Example 3.1 (from class):''' (Coin Flipping Example)<br />
Line 706: Line 731:
 
We can define the U function so that:  
 
We can define the U function so that:  
  
If U <= 0.5, then X = 0
+
If <math>U\leq 0.5</math>, then X = 0
  
and if  0.5 < U <= 1, then X =1.  
+
and if  <math>0.5 < U\leq 1</math>, then X =1.  
  
 
This allows the probability of Heads occurring to be 0.5 and is a good generator of a random coin flip.
 
This allows the probability of Heads occurring to be 0.5 and is a good generator of a random coin flip.
Line 776: Line 801:
 
3. else if 0.3<U<=0.5 deliver x=1<br />
 
3. else if 0.3<U<=0.5 deliver x=1<br />
 
4. else 0.5<U<=1 deliver x=2
 
4. else 0.5<U<=1 deliver x=2
 +
 +
Can you find a faster way to run this algorithm? Consider:
 +
 +
:<math>
 +
x = \begin{cases}
 +
2, & \text{if } U\leq 0.5 \\
 +
1, & \text{if } 0.5 < U \leq 0.7 \\
 +
0, & \text{if } 0.7 <U\leq 1
 +
\end{cases}</math>
 +
 +
The logic for this is that U is most likely to fall into the largest range. Thus by putting the largest range (in this case x >= 0.5) we can improve the run time of this algorithm. Could this algorithm be improved further using the same logic?
  
 
* '''Code''' (as shown in class)<br />
 
* '''Code''' (as shown in class)<br />
Line 797: Line 833:
 
[[File:Discrete_example.jpg|300px]]
 
[[File:Discrete_example.jpg|300px]]
  
Can you find a faster way to run this algorithm? Consider:
+
The algorithm above generates a vector (1,1000) containing 0's ,1's and 2's in differing proportions. Due to the criteria for accepting 0, 1 or 2 into the vector we get proportions of 0,1 &2 that correspond to their respective probabilities. So plotting the histogram (frequency of 0,1&2) doesn't give us the pmf but a frequency histogram that shows the proportions of each, which looks identical to the pmf.
 
 
:<math>
 
x = \begin{cases}
 
2, & \text{if } U\leq 0.5 \\
 
1, & \text{if } 0.5 < U \leq 0.7 \\
 
0, & \text{if } 0.7 <U\leq 1
 
\end{cases}</math>
 
 
 
The logic for this is that U is most likely to fall into the largest range. Thus by putting the largest range (in this case x >= 0.5) we can improve the run time of this algorithm. Could this algorithm be improved further using the same logic?
 
 
 
<pre style="font-size:16px">
 
close all
 
clear all
 
for ii=1:1000
 
    u=rand;
 
    if u<=0.5
 
      x(ii)=2;
 
    elseif u<=0.7
 
      x(ii)=1;
 
    else
 
      x(ii)=0;
 
    end
 
end
 
size(x)
 
hist(x)
 
</pre>
 
[[File:lec3.jpg|300px]]
 
 
 
  
 
'''Example 3.3''': Generating a random variable from pdf <br>
 
'''Example 3.3''': Generating a random variable from pdf <br>
Line 843: Line 851:
 
:<math>\begin{align} U = x^{2}, X = F^{-1}x(U)= U^{\frac{1}{2}}\end{align}</math>
 
:<math>\begin{align} U = x^{2}, X = F^{-1}x(U)= U^{\frac{1}{2}}\end{align}</math>
  
 
+
'''Example 3.4''': Generating a Bernoulli random variable <br>
* '''Code'''<br />
+
:<math>\begin{align} P(X = 1) = p,  P(X = 0) = 1 - p\end{align}</math>
<pre style="font-size:16px">
+
:<math>
>> u = rand;
 
>> x = u ^ (1/2);
 
</pre>
 
 
 
 
 
'''Example 3.4''': Generating a Bernoulli random variable <br>
 
:<math>\begin{align} P(X = 1) = p,  P(X = 0) = 1 - p\end{align}</math>
 
:<math>
 
 
F(x) = \begin{cases}
 
F(x) = \begin{cases}
 
1-p, & \text{if } x < 1 \\
 
1-p, & \text{if } x < 1 \\
Line 865: Line 865:
 
\end{cases}</math>
 
\end{cases}</math>
  
* '''Code'''<br />
 
<pre style="font-size:16px">
 
>> u = rand;
 
>> P = .3  % p from (0,1)
 
>> if u < (1-p)
 
      x = 0;
 
  else
 
      x = 1;
 
  end
 
</pre>
 
  
 
'''Example 3.5''': Generating Binomial(n,p) Random Variable<br>
 
'''Example 3.5''': Generating Binomial(n,p) Random Variable<br>
Line 885: Line 875:
 
Step 5: Go to step 3<br>
 
Step 5: Go to step 3<br>
 
*Note: These steps can be found in Simulation 5th Ed. by Sheldon Ross.
 
*Note: These steps can be found in Simulation 5th Ed. by Sheldon Ross.
*Note: Another method by seeing the Binomial as a sum of n independent Bernoulli random variables.<br>
+
*Note: Another method by seeing the Binomial as a sum of n independent Bernoulli random variables, U1, ..., Un. Then set X equal to the number of Ui that are less than or equal to p. To use this method, n random numbers are needed and n comparisons need to be done. On the other hand, the inverse transformation method is simpler because only one random variable needs to be generated and it makes 1 + np comparisons.<br>
 
Step 1: Generate n uniform numbers U1 ... Un.<br>
 
Step 1: Generate n uniform numbers U1 ... Un.<br>
 
Step 2: X = <math>\sum U_i < = p</math> where P is the probability of success.
 
Step 2: X = <math>\sum U_i < = p</math> where P is the probability of success.
Line 891: Line 881:
 
'''Example 3.6''': Generating a Poisson random variable <br>
 
'''Example 3.6''': Generating a Poisson random variable <br>
  
Let X ~ Poi(u). Write an algorithm to generate X.
+
"Let X ~ Poi(u). Write an algorithm to generate X.
 
The PDF of a poisson is:
 
The PDF of a poisson is:
 
:<math>\begin{align} f(x) = \frac {\, e^{-u} u^x}{x!} \end{align}</math>
 
:<math>\begin{align} f(x) = \frac {\, e^{-u} u^x}{x!} \end{align}</math>
Line 904: Line 894:
 
   <math>\begin{align} F = P(X = 0) = e^{-u}*u^0/{0!} = e^{-u} = p \end{align}</math>
 
   <math>\begin{align} F = P(X = 0) = e^{-u}*u^0/{0!} = e^{-u} = p \end{align}</math>
 
3) If U<F, output x <br>
 
3) If U<F, output x <br>
   Else, <math>\begin{align} p = (u/(x+1))^p \end{align}</math> <br>
+
   <font size="3">Else,</font> <math>\begin{align} p = (u/(x+1))^p \end{align}</math> <br>
 
         <math>\begin{align} F = F + p \end{align}</math> <br>
 
         <math>\begin{align} F = F + p \end{align}</math> <br>
 
         <math>\begin{align} x = x + 1 \end{align}</math> <br>
 
         <math>\begin{align} x = x + 1 \end{align}</math> <br>
4) Go to 1 <br>
+
4) Go to 1" <br>
 
   
 
   
Acknowledgements: This is from Stat 340 Winter 2013
+
Acknowledgements: This is an example from Stat 340 Winter 2013
  
  
Line 917: Line 907:
 
<math>P(X=x_i) = \, p (1-p)^{x_{i}-1}</math>
 
<math>P(X=x_i) = \, p (1-p)^{x_{i}-1}</math>
 
We have CDF:
 
We have CDF:
<math>F(x)=P(X \leq x)=1-P(X>x) = 1-(1-p)^x</math>, P(X>x) means we get at least x failures before observe the first success.
+
<math>F(x)=P(X \leq x)=1-P(X>x) = 1-(1-p)^x</math>, P(X>x) means we get at least x failures before we observe the first success.
 
Now consider the inverse transform:
 
Now consider the inverse transform:
 
:<math>
 
:<math>
Line 940: Line 930:
 
4. Else if <math>U \leq P_{0} + P_{1} + P_{2} </math> deliver <math>x = x_{2}</math><br />
 
4. Else if <math>U \leq P_{0} + P_{1} + P_{2} </math> deliver <math>x = x_{2}</math><br />
 
...  
 
...  
   Else if <math>U \leq P_{0} + ... + P_{k} </math> deliver <math>x = x_{k}</math><br />
+
   <font size="3">Else if</font> <math>U \leq P_{0} + ... + P_{k} </math> <font size="3">deliver</font> <math>x = x_{k}</math><br />
then if the <math>x_{i}</math>,<math>i \geq </math>, are ordered so that <math>x_{0}<x_{1}<x_{2}<...</math> and if we let F denote the distribution function of X, then <math>F(x_{k}) = \sum p_{i}</math> and so
 
                      X will equal <math>x_{j}</math> if <math>F(x_{j-1}) \leq U \leq F(x_{j})</math>
 
  
 
<br /'''>===Inverse Transform Algorithm for Generating a Binomial(n,p) Random Variable(from textbook)==='''
 
<br /'''>===Inverse Transform Algorithm for Generating a Binomial(n,p) Random Variable(from textbook)==='''
Line 953: Line 941:
  
 
'''Problems'''<br />
 
'''Problems'''<br />
1. We have to find <math> F^{-1} </math>
+
Though this method is very easy to use and apply, it does have a major disadvantage/limitation:
 
+
We need to find the inverse cdf  F^{-1}(\cdot) . In some cases the inverse function does not exist, or is difficult to find because it requires a closed form expression for F(x).
2. For many distributions, such as Gaussian, it is too difficult to find the inverse of <math> F(x)</math>.<br>
+
For example, it is too difficult to find the inverse cdf of the Gaussian distribution, so we must find another method to sample from the Gaussian distribution.
Flipping a coin is a discrete case of uniform distribution, and the code below shows an example of flipping a coin 1000 times; the result is closed to the expected value 0.5.<br>
+
In conclusion, we need to find another way of sampling from more complicated distributions
 +
Flipping a coin is a discrete case of uniform distribution, and the code below shows an example of flipping a coin 1000 times; the result is close to the expected value 0.5.<br>
 
Example 2, as another discrete distribution, shows that we can sample from parts like 0,1 and 2, and the probability of each part or each trial is the same.<br>
 
Example 2, as another discrete distribution, shows that we can sample from parts like 0,1 and 2, and the probability of each part or each trial is the same.<br>
 
Example 3 uses inverse method to figure out the probability range of each random varible.
 
Example 3 uses inverse method to figure out the probability range of each random varible.
Line 998: Line 987:
 
</div>
 
</div>
  
=== Continuous Case ===
+
=== Generalized Inverse-Transform Method ===
'''Example 3.8''': Generating a Weibull(a,b) distribution <br>
+
 
 +
Valid for any CDF F(x): return X=min{x:F(x)<math>\leq</math> U}, where U~U(0,1)
 +
 
 +
1. Continues, possibly with flat spots (i.e. not strictly increasing)
 +
 
 +
2. Discrete
 +
 
 +
3. Mixed continues discrete
  
Let X ~ Weibull(a,b). Write an algorithm to generate X. <br>
 
The PDF of X is: <br>
 
  
      <math>f(x) = ab^{-a}x^{a-1}exp((-x/b)^a)</math> ; x > 0 <br>
+
'''Advantages of Inverse-Transform Method'''
  
The CDF of X is: <br>
+
Inverse transform method preserves monotonicity and correlation
  
      <math>F(x) = 1-exp((-x/b)^a)</math> ; x > 0 <br>
+
which helps in
  
Solve for U = F(X) for X: <br>
+
1. Variance reduction methods ...
  
      <math>U = 1-exp((-x/b)^a)</math> <br>
+
2. Generating truncated distributions ...
<=>  <math>-(X/b)^a=ln(1-U)</math> <br>
 
<=>  <math>(X/b)=(-ln(1-U))^{1/a}</math><br>
 
<=>  <math>X=b(-ln(1-U))^{1/a}</math><br>
 
  
'''Algorithm''': <br>
+
3. Order statistics ...
1. Generate U~U(0,1) <- ''Note: Generating U and 1-U is the same since they are still U(0,1)''<br>
 
2. Return <math>X=b(-ln (U) )^{1/a}</math><br>
 
 
A simple example: Simulating an Exponential Random Variable:
 
If F(x)=1-exp(-x), then F-1(u) is that value of x such that
 
                          1-exp(-x)=u
 
or
 
                          x= -log(1-u)
 
Hence, if U is a uniform(0,1) variable, then
 
                          F-1(u)=-log(1-U)
 
is exponentially distributed with mean1. Since 1-U is also uniformly distributed on (0,1) it follows that -logU is exponential with mean 1. Since cX is exponential with mean c when X is exponential with mean 1, it follows that -clogU is exponential with mean c.
 
  
 
===Acceptance-Rejection Method===
 
===Acceptance-Rejection Method===
 
<span style="text-shadow:5px 5px 5px #555;">this is well worth reading</span>
 
  
 
Although the inverse transformation method does allow us to change our uniform distribution, it has two limits;
 
Although the inverse transformation method does allow us to change our uniform distribution, it has two limits;
Line 1,043: Line 1,021:
  
 
[[File:AR_Method.png]]
 
[[File:AR_Method.png]]
 +
  
 
The main logic behind the Acceptance-Rejection Method is that:<br>
 
The main logic behind the Acceptance-Rejection Method is that:<br>
Line 1,049: Line 1,028:
 
3. For each value of x, we accept and reject some points based on a probability, which will be discussed below.<br>
 
3. For each value of x, we accept and reject some points based on a probability, which will be discussed below.<br>
  
Note: If the red line was only g(x) as opposed to <math>\,c g(x)</math> (i.e. c=1), then <math>g(x) \geq f(x)</math> for all values of x if and only if g and f are the same functions. This is because the sum of pdf of g(x)=1 and the sum of pdf of f(x)=1, hence, <math>g(x) \ngeqq f(x)</math> &forall;x. <br>
+
Note: If the red line was only g(x) as opposed to <math>\,c g(x)</math> (i.e. c=1), then <math>g(x) \geq f(x)</math> for all values of x if and only if g and f are the same functions. This is because the sum of pdf of g(x)=1 and the sum of pdf of f(x)=1, hence, <math>g(x) \ngeqq f(x)</math> \,&forall;x. <br>
  
 
Also remember that <math>\,c g(x)</math> always generates higher probability than what we need. Thus we need an approach of getting the proper probabilities.<br><br>
 
Also remember that <math>\,c g(x)</math> always generates higher probability than what we need. Thus we need an approach of getting the proper probabilities.<br><br>
Line 1,059: Line 1,038:
 
3. Verify that <math>f(x)\leqslant c g(x)</math> at all the local maximums as well as the absolute maximums.<br>
 
3. Verify that <math>f(x)\leqslant c g(x)</math> at all the local maximums as well as the absolute maximums.<br>
 
4. Verify that <math>f(x)\leqslant c g(x)</math> at the tail ends by calculating <math>\lim_{x \to +\infty} \frac{f(x)}{\, c g(x)}</math> and <math>\lim_{x \to -\infty} \frac{f(x)}{\, c g(x)}</math> and seeing that they are both < 1. Use of L'Hopital's Rule should make this easy, since both f and g are p.d.f's, resulting in both of them approaching 0.<br>
 
4. Verify that <math>f(x)\leqslant c g(x)</math> at the tail ends by calculating <math>\lim_{x \to +\infty} \frac{f(x)}{\, c g(x)}</math> and <math>\lim_{x \to -\infty} \frac{f(x)}{\, c g(x)}</math> and seeing that they are both < 1. Use of L'Hopital's Rule should make this easy, since both f and g are p.d.f's, resulting in both of them approaching 0.<br>
5.Efficiency: the number of times N that steps 1 and 2 need to be called(also the number of iterations needed to successfully generate X) is a random variable and has a geometric distribution with success probability p=P(U<= f(Y)/(cg(Y))) , P(N=n)=(1-p^(n-1))p ,n>=1.Thus on average the number of iterations required is given by E(N)=1/p
+
5.Efficiency: the number of times N that steps 1 and 2 need to be called(also the number of iterations needed to successfully generate X) is a random variable and has a geometric distribution with success probability <math>p=P(U \leq f(Y)/(cg(Y)))</math> , <math>P(N=n)=(1-p(n-1))p ,n \geq 1</math>.Thus on average the number of iterations required is given by <math> E(N)=\frac{1} p</math>
  
 
c should be close to the maximum of f(x)/g(x), not just some arbitrarily picked large number. Otherwise, the Acceptance-Rejection method will have more rejections (since our probability <math>f(x)\leqslant c g(x)</math> will be close to zero). This will render our algorithm inefficient.  
 
c should be close to the maximum of f(x)/g(x), not just some arbitrarily picked large number. Otherwise, the Acceptance-Rejection method will have more rejections (since our probability <math>f(x)\leqslant c g(x)</math> will be close to zero). This will render our algorithm inefficient.  
Line 1,118: Line 1,097:
 
<math>P(y|accepted)=f(y)=\frac{P(accepted|y)P(y)}{P(accepted)}</math><br />         
 
<math>P(y|accepted)=f(y)=\frac{P(accepted|y)P(y)}{P(accepted)}</math><br />         
 
<br />based on the concept from '''procedure-step1''':<br />
 
<br />based on the concept from '''procedure-step1''':<br />
<math>P(y)=g(y)</math><br /> (first step:draw Y~g(.))
+
<math>P(y)=g(y)</math><br />
  
 
<math>P(accepted|y)=\frac{f(y)}{cg(y)}</math> <br />
 
<math>P(accepted|y)=\frac{f(y)}{cg(y)}</math> <br />
Line 1,131: Line 1,110:
 
           &=\frac{1}{c}
 
           &=\frac{1}{c}
 
\end{align}</math><br />
 
\end{align}</math><br />
(under any pdf, the area=1)
 
  
 
Therefore:<br />
 
Therefore:<br />
Line 1,139: Line 1,117:
 
&=\frac{\frac{f(y)}{c}}{1/c}\\
 
&=\frac{\frac{f(y)}{c}}{1/c}\\
 
&=f(y)\end{align}</math><br /><br /><br />
 
&=f(y)\end{align}</math><br /><br /><br />
so a sample of g results to a sample of f.
 
  
 
'''''Here is an alternative introduction of Acceptance-Rejection Method'''''
 
'''''Here is an alternative introduction of Acceptance-Rejection Method'''''
Line 1,145: Line 1,122:
 
'''Comments:'''
 
'''Comments:'''
  
-Acceptance-Rejection Method is not good for all cases. The limitation with this method is that sometimes many points will be rejected. One obvious cons is that it could be very hard to pick the <math>g(y)</math> and the constant <math>c</math> in some cases. We have to pick the SMALLEST C such that <math>cg(x) \leq f(x)</math> else the the algorithm will not be efficient. This is because <math>f(x)/cg(x)</math> will become smaller and probability <math>u \leq f(x)/cg(x)</math> will go down and many points will be rejected making the algorithm inefficient.  
+
-Acceptance-Rejection Method is not good for all cases. The limitation with this method is that sometimes many points will be rejected. One obvious disadvantage is that it could be very hard to pick the <math>g(y)</math> and the constant <math>c</math> in some cases. We have to pick the SMALLEST C such that <math>cg(x) \leq f(x)</math> else the the algorithm will not be efficient. This is because <math>f(x)/cg(x)</math> will become smaller and probability <math>u \leq f(x)/cg(x)</math> will go down and many points will be rejected making the algorithm inefficient.  
  
 
-'''Note:''' When <math>f(y)</math> is very different than <math>g(y)</math>, it is less likely that the point will be accepted as the ratio above would be very small and it will be difficult for <math>U</math> to be less than this small value. <br/>An example would be when the target function (<math>f</math>) has a spike or several spikes in its domain - this would force the known distribution (<math>g</math>) to have density at least as large as the spikes, making the value of <math>c</math> larger than desired. As a result, the algorithm would be highly inefficient.
 
-'''Note:''' When <math>f(y)</math> is very different than <math>g(y)</math>, it is less likely that the point will be accepted as the ratio above would be very small and it will be difficult for <math>U</math> to be less than this small value. <br/>An example would be when the target function (<math>f</math>) has a spike or several spikes in its domain - this would force the known distribution (<math>g</math>) to have density at least as large as the spikes, making the value of <math>c</math> larger than desired. As a result, the algorithm would be highly inefficient.
Line 1,153: Line 1,130:
 
We wish to generate X~Bi(2,0.5), assuming that we cannot generate this directly.<br/>
 
We wish to generate X~Bi(2,0.5), assuming that we cannot generate this directly.<br/>
 
We use a discrete distribution DU[0,2] to approximate this.<br/>
 
We use a discrete distribution DU[0,2] to approximate this.<br/>
<math>f(x)=Pr(X=x)=2Cx*(0.5)^2</math><br/>
+
<math>f(x)=Pr(X=x)=2Cx×(0.5)^2\,</math><br/>
  
 
{| class=wikitable  align=left
 
{| class=wikitable  align=left
Line 1,174: Line 1,151:
 
1. Generate <math>u,v~U(0,1)</math><br/>
 
1. Generate <math>u,v~U(0,1)</math><br/>
 
2. Set <math>y= \lfloor 3*u \rfloor</math> (This is using uniform distribution to generate DU[0,2]<br/>
 
2. Set <math>y= \lfloor 3*u \rfloor</math> (This is using uniform distribution to generate DU[0,2]<br/>
3. If <math>(y=0)</math> and <math>(v<1/2), output=0</math> <br/>
+
3. If <math>(y=0)</math> and <math>(v<\tfrac{1}{2}), output=0</math> <br/>
If <math>(y=2) </math> and <math>(v<1/2), output=2 </math><br/>
+
If <math>(y=2) </math> and <math>(v<\tfrac{1}{2}), output=2 </math><br/>
 
Else if <math>y=1, output=1</math><br/>
 
Else if <math>y=1, output=1</math><br/>
  
  
 
An elaboration of “c”<br/>
 
An elaboration of “c”<br/>
c is the expected number of times the code runs to output 1 random variable.  Remember that when <math>u < f(x)/(cg(x))</math> is not satisfied, we need to go over the code again.<br/>
+
c is the expected number of times the code runs to output 1 random variable.  Remember that when <math>u < \tfrac{f(x)}{cg(x)}</math> is not satisfied, we need to go over the code again.<br/>
  
 
Proof<br/>
 
Proof<br/>
Line 1,201: Line 1,178:
 
=== Example of Acceptance-Rejection Method===
 
=== Example of Acceptance-Rejection Method===
  
Generating a random variable having p.d.f.  
+
Generating a random variable having p.d.f. <br />
                                <math>f(x) = 20x(1 - x)^3,        0< x <1  </math>
+
<math>\displaystyle f(x) = 20x(1 - x)^3,        0< x <1  </math><br />
Since this random variable (which is beta with parameters 2, 4) is concentrated in the interval (0, 1), let us consider the acceptance-rejection method with
+
Since this random variable (which is beta with parameters (2,4)) is concentrated in the interval (0, 1), let us consider the acceptance-rejection method with<br />
                                    g(x) = 1,           0 < x < 1
+
<math>\displaystyle g(x) = 1,0<x<1</math><br />
To determine the constant c such that f(x)/g(x) <= c, we use calculus to determine the maximum value of
+
To determine the constant c such that f(x)/g(x) <= c, we use calculus to determine the maximum value of<br />
                                  <math> f(x)/g(x) = 20x(1 - x)^3 </math>
+
<math>\displaystyle f(x)/g(x) = 20x(1 - x)^3 </math><br />
Differentiation of this quantity yields                              
+
Differentiation of this quantity yields <br />                             
                                  <math>d/dx[f(x)/g(x)]=20*[(1-x)^3-3x(1-x)^2]</math>
+
<math>\displaystyle d/dx[f(x)/g(x)]=20*[(1-x)^3-3x(1-x)^2]</math><br />
 
Setting this equal to  0  shows that the maximal value is attained when x = 1/4,  
 
Setting this equal to  0  shows that the maximal value is attained when x = 1/4,  
and thus,                            
+
and thus, <br />
                                  <math>f(x)/g(x)<= 20*(1/4)*(3/4)^3=135/64=c </math>                                  
+
<math>\displaystyle f(x)/g(x)<= 20*(1/4)*(3/4)^3=135/64=c </math><br />
Hence,
+
Hence,<br />
                                  <math>f(x)/cg(x)=(256/27)*(x*(1-x)^3)</math>                            
+
<math>\displaystyle f(x)/cg(x)=(256/27)*(x*(1-x)^3)</math><br />
 
and thus the simulation procedure is as follows:
 
and thus the simulation procedure is as follows:
  
Line 1,230: Line 1,207:
 
===Another Example of Acceptance-Rejection Method===
 
===Another Example of Acceptance-Rejection Method===
 
Generate a random variable from:<br />  
 
Generate a random variable from:<br />  
  <math>f(x)=3*x^2</math>, 0< x <1<br />
+
<math>\displaystyle f(x)=3*x^2, 0<x<1 </math><br />
 
Assume g(x) to be uniform over interval (0,1), where 0< x <1<br />
 
Assume g(x) to be uniform over interval (0,1), where 0< x <1<br />
 
Therefore:<br />
 
Therefore:<br />
  <math>c = max(f(x)/(g(x)))= 3</math><br />   
+
<math>\displaystyle c = max(f(x)/(g(x)))= 3</math><br />   
  
 
the best constant c is the max(f(x)/(cg(x))) and the c make the area above the f(x) and below the g(x) to be small.
 
the best constant c is the max(f(x)/(cg(x))) and the c make the area above the f(x) and below the g(x) to be small.
because g(.) is uniform so the g(x) is 1. max(g(x)) is 1
+
because g(.) is uniform so the g(x) is 1. max(g(x)) is 1<br />
  <math>f(x)/(cg(x))= x^2</math><br />
+
<math>\displaystyle f(x)/(cg(x))= x^2</math><br />
 
Acknowledgement: this is example 1 from http://www.cs.bgu.ac.il/~mps042/acceptance.htm
 
Acknowledgement: this is example 1 from http://www.cs.bgu.ac.il/~mps042/acceptance.htm
  
 
== Class 4 - Thursday, May 16 ==  
 
== Class 4 - Thursday, May 16 ==  
*When we want to find a target distribution, denoted as <math>f(x)</math>, we need to first find a proposal distribution <math>g(x)</math>  that is easy to sample from. <br>  
+
 
*The relationship between the proposal distribution and target distribution is: <math> c \cdot g(x) \geq f(x) </math>, where c is a constant. This means that the area of f(x) is under the area of <math> c \cdot g(x)</math>. <br>
+
'''Goals'''<br>
 +
*When we want to find target distribution <math>f(x)</math>, we need to first find a proposal distribution <math>g(x)</math>  that is easy to sample from. <br>  
 +
*Relationship between the proposal distribution and target distribution is: <math> c \cdot g(x) \geq f(x) </math>, where c is constant. This means that the area of f(x) is under the area of <math> c \cdot g(x)</math>. <br>
 
*Chance of acceptance is less if the distance between <math>f(x)</math> and <math> c \cdot g(x)</math> is big, and vice-versa, we use <math> c </math> to keep <math> \frac {f(x)}{c \cdot g(x)} </math> below 1 (so <math>f(x) \leq c \cdot g(x)</math>). Therefore, we must find the constant <math> C </math> to achieve this.<br />
 
*Chance of acceptance is less if the distance between <math>f(x)</math> and <math> c \cdot g(x)</math> is big, and vice-versa, we use <math> c </math> to keep <math> \frac {f(x)}{c \cdot g(x)} </math> below 1 (so <math>f(x) \leq c \cdot g(x)</math>). Therefore, we must find the constant <math> C </math> to achieve this.<br />
*In other words, a <math>C</math> is chosen to make sure  <math> c \cdot g(x) \geq f(x) </math>. However, it will not make sense if <math>C</math> is simply chosen to be arbitrarily large. If <math>C</math> is too large, the probability of this sample is too small. We need to choose <math>C</math> such that <math>c \cdot g(x)</math> fits <math>f(x)</math> as tightly as possible.<br />
+
*In other words, <math>C</math> is chosen to make sure  <math> c \cdot g(x) \geq f(x) </math>. However, it will not make sense if <math>C</math> is simply chosen to be arbitrarily large. We need to choose <math>C</math> such that <math>c \cdot g(x)</math> fits <math>f(x)</math> as tightly as possible. This means that we must find the minimum c such that the area of f(x) is under the area of c*g(x). <br />
 
*The constant c cannot be a negative number.<br />
 
*The constant c cannot be a negative number.<br />
this is well worth reading
+
 
  
 
'''How to find C''':<br />
 
'''How to find C''':<br />
 +
 
<math>\begin{align}
 
<math>\begin{align}
 
&c \cdot g(x) \geq f(x)\\
 
&c \cdot g(x) \geq f(x)\\
Line 1,254: Line 1,234:
 
&c= \max \left(\frac{f(x)}{g(x)}\right)  
 
&c= \max \left(\frac{f(x)}{g(x)}\right)  
 
\end{align}</math><br>
 
\end{align}</math><br>
 +
 
If <math>f</math> and <math> g </math> are continuous, we can find the extremum by taking the derivative and solve for <math>x_0</math> such that:<br/>
 
If <math>f</math> and <math> g </math> are continuous, we can find the extremum by taking the derivative and solve for <math>x_0</math> such that:<br/>
 
<math> 0=\frac{d}{dx}\frac{f(x)}{g(x)}|_{x=x_0}</math> <br/>
 
<math> 0=\frac{d}{dx}\frac{f(x)}{g(x)}|_{x=x_0}</math> <br/>
 +
 
Thus <math> c = \frac{f(x_0)}{g(x_0)} </math><br/>
 
Thus <math> c = \frac{f(x_0)}{g(x_0)} </math><br/>
  
*The logic behind this:
+
Note: This procedure is called the Acceptance-Rejection Method.<br>
The Acceptance-Rejection method involves finding a distribution that we know how to sample from (g(x)) and multiplying g(x) by a constant c so that <math>c \cdot g(x)</math> is always greater than or equal to f(x). Mathematically, we want <math> c \cdot g(x) \geq f(x) </math>.
+
 
And it means c has to be greater or equal to <math>\frac{f(x)}{g(x)}</math>. So the smallest possible c that satisfies the condition is the maximum value of <math>\frac{f(x)}{g(x)}</math><br/>. If c is too large, the chance of acceptance of generated values will be small, thereby losing efficiency of the algorithm. Therefore, it is best to get the smallest possible c such that <math> c g(x) \geq f(x)</math>. <br>
+
'''The Acceptance-Rejection method''' involves finding a distribution that we know how to sample from, g(x), and multiplying g(x) by a constant c so that <math>c \cdot g(x)</math> is always greater than or equal to f(x). Mathematically, we want <math> c \cdot g(x) \geq f(x) </math>.
 +
And it means, c has to be greater or equal to <math>\frac{f(x)}{g(x)}</math>. So the smallest possible c that satisfies the condition is the maximum value of <math>\frac{f(x)}{g(x)}</math><br/>.  
 +
But in case of c being too large, the chance of acceptance of generated values will be small, thereby losing efficiency of the algorithm. Therefore, it is best to get the smallest possible c such that <math> c g(x) \geq f(x)</math>. <br>
 +
 
 +
'''Important points:'''<br>  
  
 
*For this method to be efficient, the constant c must be selected so that the rejection rate is low. (The efficiency for this method is <math>\left ( \frac{1}{c} \right )</math>)<br>
 
*For this method to be efficient, the constant c must be selected so that the rejection rate is low. (The efficiency for this method is <math>\left ( \frac{1}{c} \right )</math>)<br>
 
*It is easy to show that the expected number of trials for an acceptance is  <math> \frac{Total Number of Trials} {C} </math>. <br>
 
*It is easy to show that the expected number of trials for an acceptance is  <math> \frac{Total Number of Trials} {C} </math>. <br>
*recall the acceptance rate is 1/c. (Not rejection rate)  
+
*recall the '''acceptance rate is 1/c'''. (Not rejection rate)  
 
:Let <math>X</math> be the number of trials for an acceptance, <math> X \sim~ Geo(\frac{1}{c})</math><br>
 
:Let <math>X</math> be the number of trials for an acceptance, <math> X \sim~ Geo(\frac{1}{c})</math><br>
 
:<math>\mathbb{E}[X] = \frac{1}{\frac{1}{c}} = c </math>
 
:<math>\mathbb{E}[X] = \frac{1}{\frac{1}{c}} = c </math>
 
*The number of trials needed to generate a sample size of <math>N</math> follows a negative binomial distribution. The expected number of trials needed is then <math>cN</math>.<br>
 
*The number of trials needed to generate a sample size of <math>N</math> follows a negative binomial distribution. The expected number of trials needed is then <math>cN</math>.<br>
 
*So far, the only distribution we know how to sample from is the '''UNIFORM''' distribution. <br>
 
*So far, the only distribution we know how to sample from is the '''UNIFORM''' distribution. <br>
 +
  
 
'''Procedure''': <br>
 
'''Procedure''': <br>
 +
 
1. Choose <math>g(x)</math> (simple density function that we know how to sample, i.e. Uniform so far) <br>
 
1. Choose <math>g(x)</math> (simple density function that we know how to sample, i.e. Uniform so far) <br>
The easiest case is UNIF(0,1). However, in other cases we need to generate UNIF(a,b). We may need to perform a linear transformation on the UNIF(0,1) variable. <br>
+
The easiest case is <math>U~ \sim~ Unif [0,1] </math>. However, in other cases we need to generate UNIF(a,b). We may need to perform a linear transformation on the <math>U~ \sim~ Unif [0,1] </math> variable. <br>
 
2. Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math>, otherwise return to step 1.
 
2. Find a constant c such that :<math> c \cdot g(x) \geq f(x) </math>, otherwise return to step 1.
  
Line 1,280: Line 1,268:
 
#If <math>U \leq \frac{f(Y)}{c \cdot g(Y)}</math> then X=Y; else return to step 1 (This is not the way to find C. This is the general procedure.)
 
#If <math>U \leq \frac{f(Y)}{c \cdot g(Y)}</math> then X=Y; else return to step 1 (This is not the way to find C. This is the general procedure.)
  
<hr><b>Example: Generate a random variable from the pdf</b><br>
+
<hr><b>Example: <br>
 +
 
 +
Generate a random variable from the pdf</b><br>
 
<math> f(x) =  
 
<math> f(x) =  
 
\begin{cases}  
 
\begin{cases}  
Line 1,290: Line 1,280:
 
<math>beta(a,b)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}x^{(a-1)}(1-x)^{(b-1)}</math><br>
 
<math>beta(a,b)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}x^{(a-1)}(1-x)^{(b-1)}</math><br>
  
Where &Gamma; (n)=(n-1)! if n is positive integer
+
Where &Gamma; (n) = (n - 1)! if n is positive integer
  
 
<math>Gamma(z)=\int _{0}^{\infty }t^{z-1}e^{-t}dt</math>
 
<math>Gamma(z)=\int _{0}^{\infty }t^{z-1}e^{-t}dt</math>
Line 1,315: Line 1,305:
 
[[File:Beta(2,1)_example.jpg|750x750px]]
 
[[File:Beta(2,1)_example.jpg|750x750px]]
  
Note: g follows uniform distribution, it only covers half of the graph which runs from 0 to 1 on y-axis. Thus we need to multiply by c to ensure that <math>c\cdot g</math> can cover entire f(x) area. In this case, c=2, so that makes g runs from 0 to 2 on y-axis which covers f(x).
+
'''Note:''' g follows uniform distribution, it only covers half of the graph which runs from 0 to 1 on y-axis. Thus we need to multiply by c to ensure that <math>c\cdot g</math> can cover entire f(x) area. In this case, c=2, so that makes g run from 0 to 2 on y-axis which covers f(x).
  
Comment:
+
'''Comment:'''<br>
 
From the picture above, we could observe that the area under f(x)=2x is a half of the area under the pdf of UNIF(0,1). This is why in order to sample 1000 points of f(x), we need to sample approximately 2000 points in UNIF(0,1).
 
From the picture above, we could observe that the area under f(x)=2x is a half of the area under the pdf of UNIF(0,1). This is why in order to sample 1000 points of f(x), we need to sample approximately 2000 points in UNIF(0,1).
 
And in general, if we want to sample n points from a distritubion with pdf f(x), we need to scan approximately <math>n\cdot c</math> points from the proposal distribution (g(x)) in total. <br>
 
And in general, if we want to sample n points from a distritubion with pdf f(x), we need to scan approximately <math>n\cdot c</math> points from the proposal distribution (g(x)) in total. <br>
Line 1,328: Line 1,318:
 
</ol>
 
</ol>
  
Note: In the above example, we sample 2 numbers. If second number (u) is less than or equal to first number (y), then accept x=y, if not then start all over.
+
'''Note:''' In the above example, we sample 2 numbers. If second number (u) is less than or equal to first number (y), then accept x=y, if not then start all over.
  
 
<span style="font-weight:bold;color:green;">Matlab Code</span>
 
<span style="font-weight:bold;color:green;">Matlab Code</span>
Line 1,345: Line 1,335:
 
     end
 
     end
 
   end
 
   end
>>hist(x)
+
>>hist(x)         # It is a histogram
 
>>jj
 
>>jj
 
   jj = 2024        # should be around 2000
 
   jj = 2024        # should be around 2000
Line 1,372: Line 1,362:
 
<pre style="font-size:16px">
 
<pre style="font-size:16px">
 
>>u=rand(1,1000);
 
>>u=rand(1,1000);
>>x=u.^0.5; %square root of each element of u.
+
>>x=u.^0.5;
 
>>hist(x)
 
>>hist(x)
 
</pre>
 
</pre>
Line 1,378: Line 1,368:
  
 
<span style="font-weight:bold;colour:green;">Matlab Tip:</span>
 
<span style="font-weight:bold;colour:green;">Matlab Tip:</span>
Periods, ".",meaning "element-wise", are used to describe the operation you want performed on each element of a vector. In the above example, to take the square root of every element in U, the notation U.^0.5 is used. If we don't have the ".", there will be an error. Input must be a scalar and a square matrix. However if you want to take the Square root of the entire matrix U the period, "*.*" would be excluded. i.e. Let matrix B=U^0.5, then <math>B^T*B=U</math>. For example if we have a two 1 X 3 matrices and we want to find out their product; using "." in the code will give us their product. However, if we don't use ".", it will just give us an error. For example, a =[1 2 3] b=[2 3 4] are vectors, a.*b=[2 6 12], but a*b does not work since matrix dimensions must agree.
+
Periods, ".",meaning "element-wise", are used to describe the operation you want performed on each element of a vector. In the above example, to take the square root of every element in U, the notation U.^0.5 is used. However if you want to take the square root of the entire matrix U the period, "." would be excluded. i.e. Let matrix B=U^0.5, then <math>B^T*B=U</math>. For example if we have a two 1 X 3 matrices and we want to find out their product; using "." in the code will give us their product. However, if we don't use ".", it will just give us an error. For example, a =[1 2 3] b=[2 3 4] are vectors, a.*b=[2 6 12], but a*b does not work since the matrix dimensions must agree.
  
 
'''
 
'''
Line 1,393: Line 1,383:
 
<math> cg(x)\geq f(x),
 
<math> cg(x)\geq f(x),
 
c\frac{1}{2} \geq \frac{3}{4} (1-x^2) /1,  
 
c\frac{1}{2} \geq \frac{3}{4} (1-x^2) /1,  
c=max 2*\frac{3}{4} (1-x^2) = 3/2 </math>
+
c=max 2\cdot\frac{3}{4} (1-x^2) = 3/2 </math>
  
 
The process:
 
The process:
Line 1,402: Line 1,392:
 
:4: if <math>U2 \leq \frac { \frac{3}{4} * (1-y^2)} { \frac{3}{4}} = {1-y^2}</math>, then x=y,  '''note that''' (3/4(1-y^2)/(3/4) is getting from f(y) / (cg(y)) )
 
:4: if <math>U2 \leq \frac { \frac{3}{4} * (1-y^2)} { \frac{3}{4}} = {1-y^2}</math>, then x=y,  '''note that''' (3/4(1-y^2)/(3/4) is getting from f(y) / (cg(y)) )
 
:5: else: return to '''step 1'''  
 
:5: else: return to '''step 1'''  
 
<span style="font-weight:bold;color:green;">Matlab Code</span>
 
<pre style="font-size:16px">
 
>> ii = 1
 
>> while  ii < 1000  % example of generating 1000 points
 
      u1 = rand;
 
      u2 = rand;
 
      y = 2 * u1 - 1;  % make y uniform over (-1,1)
 
      if u2 <= (1 - y^2)
 
        x(ii) = y;
 
        ii = ii + 1;
 
      end
 
  end
 
</pre>
 
 
  
 
----
 
----
  
  
Simple example of Acceptance-Rejection Method:
+
=====Example of Acceptance-Rejection Method=====
Generate a random variable having density function f(x)=20x[(1-x)^3], 0<x<1
 
Find c such that C>=f(x)/g(x), we use calculus to determine the maximum value of
 
  f(x)/g(x)= 20x[(1-x)^3]
 
d/dx[f(x)/g(x)] = 20[(1-x)^3 - 3x(1-x^2) ]
 
Setting this equal to 0 shows that the maximal value is attained when x=0.25, and thus
 
f(x)/g(x) <= 20(0.25)[(0.25)^3] = 135/64 ≅c
 
Hence,
 
f(x)/cg(x) = (256/27)x[(1-x)^3]
 
  
Procedure:
+
<math>\begin{align}
i. Generate random numbers U1 and U2
+
& f(x) = 3x^2, 0<x<1 \\
ii.If U2 <= (256/27)U1[(1-U1)^3], stop and set X=U1. Otherwise return to step i.
+
\end{align}</math><br\>
(From textbook: Introduction to Probability,  10th Edition, Sheldon M.Ross)
 
  
=====Example of Acceptance-Rejection Method=====
+
<math>\begin{align}
 
+
& g(x)=1,  0<x<1 \\
<math> f(x) = 3x^2,  0<x<1 </math>
+
\end{align}</math><br\>
<math>g(x)=1,  0<x<1</math>
 
  
 
<math>c = \max \frac{f(x)}{g(x)} = \max \frac{3x^2}{1} = 3 </math><br>
 
<math>c = \max \frac{f(x)}{g(x)} = \max \frac{3x^2}{1} = 3 </math><br>
Line 1,445: Line 1,410:
  
 
1. Generate two uniform numbers in the unit interval <math>U_1, U_2 \sim~ U(0,1)</math><br>
 
1. Generate two uniform numbers in the unit interval <math>U_1, U_2 \sim~ U(0,1)</math><br>
2. If <math>U_2 \leqslant {U_1}^2</math>, accept <math>U_1</math> as the random variable with pdf <math>f</math>, if not return to Step 1
+
2. If <math>U_2 \leqslant {U_1}^2</math>, accept <math>\begin{align}U_1\end{align}</math> as the random variable with pdf <math>\begin{align}f\end{align}</math>, if not return to Step 1
  
We can also use <math>g(x)=2x</math> for a more efficient algorithm
+
We can also use <math>\begin{align}g(x)=2x\end{align}</math> for a more efficient algorithm
  
 
<math>c = \max \frac{f(x)}{g(x)} = \max \frac {3x^2}{2x} = \frac {3x}{2}  </math>.
 
<math>c = \max \frac{f(x)}{g(x)} = \max \frac {3x^2}{2x} = \frac {3x}{2}  </math>.
Use the inverse method to sample from <math>g(x)</math>
+
Use the inverse method to sample from <math>\begin{align}g(x)\end{align}</math>
<math>G(x)=x^2</math>.
+
<math>\begin{align}G(x)=x^2\end{align}</math>.
Generate <math>U</math> from <math>U(0,1)</math> and set <math>x=sqrt(u)</math>
+
Generate <math>\begin{align}U\end{align}</math> from <math>\begin{align}U(0,1)\end{align}</math> and set <math>\begin{align}x=sqrt(u)\end{align}</math>
  
 
1. Generate two uniform numbers in the unit interval <math>U_1, U_2 \sim~ U(0,1)</math><br>
 
1. Generate two uniform numbers in the unit interval <math>U_1, U_2 \sim~ U(0,1)</math><br>
 
2. If <math>U_2 \leq \frac{3\sqrt{U_1}}{2}</math>, accept <math>U_1</math> as the random variable with pdf <math>f</math>, if not return to Step 1
 
2. If <math>U_2 \leq \frac{3\sqrt{U_1}}{2}</math>, accept <math>U_1</math> as the random variable with pdf <math>f</math>, if not return to Step 1
  
*Note :the function q(x) = c * g(x) is called an envelop or majoring function.<br>
+
*Note :the function <math>\begin{align}q(x) = c * g(x)\end{align}</math> is called an envelop or majoring function.<br>
To obtain a better proposing function g(x), we can first assume a new q(x) and then solve for the normalizing constant by integrating.<br>
+
To obtain a better proposing function <math>\begin{align}g(x)\end{align}</math>, we can first assume a new <math>\begin{align}q(x)\end{align}</math> and then solve for the normalizing constant by integrating.<br>
In the previous example, we first assume q(x) = 3x. To find the normalizing constant, we need to solve k * <math>\sum 3x = 1</math> which gives us k = 2/3. So, g(x) = k*q(x) = 2x.
+
In the previous example, we first assume <math>\begin{align}q(x) = 3x\end{align}</math>. To find the normalizing constant, we need to solve <math>k *\sum 3x = 1</math> which gives us k = 2/3. So,<math>\begin{align}g(x) = k*q(x) = 2x\end{align}</math>.
 
 
=====Another example of Acceptance-Rejection Method=====
 
 
 
Let <math> f(x) = x^3 </math> for <math> 0<x<\sqrt{2} </math>. Use acceptance-rejection method with the proposal distribution, <math> g(x)=x </math> for <math> 0<x<\sqrt{2} </math>
 
 
 
<math> c=max(\frac{f(x)}{g(x)}) = max(\frac{x^3}{x}) = max(x^2) = (\sqrt{2})^2 = 2 => \frac{f(x)}{c \cdot g(x)} = \frac{x^2}{2} </math> <br>
 
Hence, the algorithm is: <br>
 
1. Generate <math> y \sim~ U(0,1) </math> <br>
 
2. Generate <math> U \sim~ U(0,1) </math> <br>
 
3. If <math> U \leqslant \frac{y^2}{2} </math>. Then X=y, else go to step 1.  
 
  
 +
*Source: http://www.cs.bgu.ac.il/~mps042/acceptance.htm*       
  
 
'''Possible Limitations'''
 
'''Possible Limitations'''
Line 1,504: Line 1,460:
  
 
3. Now Y follows <math>U(a,b)</math>
 
3. Now Y follows <math>U(a,b)</math>
 
  
 
'''Example''': Generate a random variable z from the Semicircular density <math>f(x)= \frac{2}{\pi R^2} \sqrt{R^2-x^2}, -R\leq x\leq R</math>.
 
'''Example''': Generate a random variable z from the Semicircular density <math>f(x)= \frac{2}{\pi R^2} \sqrt{R^2-x^2}, -R\leq x\leq R</math>.
  
-> Proposal distribution: g(x)=1\2R, x belongs to UNIF(-R, R)
+
-> Proposal distribution: UNIF(-R, R)
  
-> We want to generate it to <math> U \sim UNIF (0,1) </math> Let <math> Y= [R-(-R)]U+(-R)=2RU-R=R(2U-1)</math>, therefore Y follows <math>U(-R,R)</math>
+
-> We know how to generate using <math> U \sim UNIF (0,1) </math> Let <math> Y= 2RU-R=R(2U-1)</math>, therefore Y follows <math>U(-R,R)</math>
  
 
-> In order to maximize the function we must maximize the top and minimize the bottom.
 
-> In order to maximize the function we must maximize the top and minimize the bottom.
Line 1,551: Line 1,506:
 
'''One more example about AR method''' <br/>
 
'''One more example about AR method''' <br/>
 
(In this example, we will see how to determine the value of c when c is a function with unknown parameters instead of a value)
 
(In this example, we will see how to determine the value of c when c is a function with unknown parameters instead of a value)
Let <math>f(x)=x*e^{-x}, x>0 </math> <br/>
+
Let <math>f(x)=x×e^{-x}, x > 0 </math> <br/>
Use <math>g(x)=a*e^{-a*x}</math>to generate random variable <br/>
+
Use <math>g(x)=a×e^{-a×x}</math> to generate random variable <br/>
 
<br/>
 
<br/>
 
Solution: First of all, we need to find c<br/>
 
Solution: First of all, we need to find c<br/>
Line 1,600: Line 1,555:
 
4) A uniform draw<br/>
 
4) A uniform draw<br/>
  
----
+
==== Interpretation of 'C' ====
 +
We can use the value of c to calculate the acceptance rate by <math>\tfrac{1}{c}</math>.
 +
 
 +
For instance, assume c=1.5, then we can tell that 66.7% of the points will be accepted (<math>\tfrac{1}{1.5} = 0.667</math>). We can also call the efficiency of the method is 66.7%.
 +
 
 +
Likewise, if the minimum value of possible values for C is <math>\tfrac{4}{3}</math>, <math>1/ \tfrac{4}{3}</math> of the generated random variables will be accepted. Thus the efficient of the algorithm is 75%.
 +
 
 +
In order to ensure the algorithm is as efficient as possible, the 'C' value should be as close to one as possible, such that <math>\tfrac{1}{c}</math> approaches 1 => 100% acceptance rate.
  
==== Interpretation of 'C' ====
 
We can use the value of c to calculate the acceptance rate by '1/c'.
 
  
For instance, assume c=1.5, then we can tell that 66.7% of the points will be accepted (1/1.5=0.667). We can also call the efficiency of the method is 66.7%.
+
>> close All
 +
>> clear All
 +
>> i=1
 +
>> j=0;
 +
>> while ii<1000
 +
y=rand
 +
u=rand
 +
if u<=y;
 +
x(ii)=y
 +
ii=ii+1
 +
end
 +
end
  
 
== Class 5 - Tuesday, May 21 ==
 
== Class 5 - Tuesday, May 21 ==
Line 1,628: Line 1,599:
 
   end
 
   end
 
>>hist(x,20)                  # 20 is the number of bars
 
>>hist(x,20)                  # 20 is the number of bars
 +
 +
>>hist(x,30)                #30 is the number of bars
 
</pre>
 
</pre>
  
 +
calculate process:
 +
<math>u_{1} <= \sqrt (1-(2u-1)^2) </math> <br>
 +
<math>(u_{1})^2 <=(1-(2u-1)^2) </math> <br>
 +
<math>(u_{1})^2 -1 <=(-(2u-1)^2) </math> <br>
 +
<math>1-(u_{1})^2 >=((2u-1)^2-1) </math> <br>
  
  
MATLAB tips: hist(x,y) where y is the number of bars in the graph.
+
MATLAB tips: hist(x,y) plots a histogram of variable x, where y is the number of bars in the graph.
  
 
[[File:ARM_cont_example.jpg|300px]]
 
[[File:ARM_cont_example.jpg|300px]]
  
a histogram to show variable x, and the bars number is y.
 
 
=== Discrete Examples ===
 
=== Discrete Examples ===
 
* '''Example 1''' <br>
 
* '''Example 1''' <br>
Line 1,652: Line 1,629:
 
The following algorithm then yields our X:
 
The following algorithm then yields our X:
  
Step 1. Draw discrete uniform distribution of 1, 2, 3, 4 and 5, <math>Y \sim~ g</math>.<br/>
+
Step 1 Draw discrete uniform distribution of 1, 2, 3, 4 and 5, <math>Y \sim~ g</math>.<br/>
Step 2. Draw <math>U \sim~ U(0,1)</math>.<br/>
+
Step 2 Draw <math>U \sim~ U(0,1)</math>.<br/>
Step 3. If <math>U \leq \frac{f(Y)}{c \cdot g(Y)}</math>, then <b> X = Y </b>;<br/>
+
Step 3 If <math>U \leq \frac{f(Y)}{c \cdot g(Y)}</math>, then <b> X = Y </b>;<br/>
        else return to Step 1.
+
Else return to Step 1.<br/>
  
 
C can be found by maximizing the ratio :<math> \frac{f(x)}{g(x)} </math>. To do this, we want to maximize <math> f(x) </math> and minimize <math> g(x) </math>. <br>
 
C can be found by maximizing the ratio :<math> \frac{f(x)}{g(x)} </math>. To do this, we want to maximize <math> f(x) </math> and minimize <math> g(x) </math>. <br>
Line 1,662: Line 1,639:
 
:<math>\frac{p(x)}{cg(x)} =  \frac{p(x)}{1.5*0.2} = \frac{p(x)}{0.3} </math><br>
 
:<math>\frac{p(x)}{cg(x)} =  \frac{p(x)}{1.5*0.2} = \frac{p(x)}{0.3} </math><br>
 
Note: The U is independent from y in Step 2 and 3 above.
 
Note: The U is independent from y in Step 2 and 3 above.
The constant c is an indicator of rejection rate or efficiency of the algorithm.
+
~The constant c is a indicator of rejection rate or efficiency of the algorithm. It can represent the average number of trials of the algorithm. Thus, a higher c would mean that the algorithm is comparatively inefficient.
  
Since g follows a discrete uniform distribution, the probability is the same for all variables. And since there are 5 parameters (1,2,3,4,5), g(x)=1/5=0.2.
+
the acceptance-rejection method of pmf, the uniform probability is the same for all variables, and there are 5 parameters(1,2,3,4,5), so g(x) is 0.2
  
Remember that we always want to choose <math> c/timesg </math> to be equal to or greater than <math> f </math>, but as close as possible.
+
Remember that we always want to choose <math> cg </math> to be equal to or greater than <math> f </math>, but as close as possible.
<br />limitations: If the form of the proposal distribution g is very different from the target distribution f, then c is very large and the algorithm is not computationally effective.
+
<br />limitations: If the form of the proposal dist g is very different from target dist f, then c is very large and the algorithm is not computatively efficient.
  
 
* '''Code for example 1'''<br />
 
* '''Code for example 1'''<br />
Line 1,691: Line 1,668:
 
The acceptance rate is <math>\frac {1}{c}</math>, so the lower the c, the more efficient the algorithm. Theoretically, c equals 1 is the best case because all samples would be accepted; however it would only be true when the proposal and target distributions are exactly the same, which would never happen in practice.  
 
The acceptance rate is <math>\frac {1}{c}</math>, so the lower the c, the more efficient the algorithm. Theoretically, c equals 1 is the best case because all samples would be accepted; however it would only be true when the proposal and target distributions are exactly the same, which would never happen in practice.  
  
For example, if c = 1.5, the acceptance rate would be <math>\frac {1}{1.5}=\frac {2}{3}</math>. This means about 66% of points are accepted. Thus, in order to generate 1000 random values, on average, a total of 1500 iterations would be required.  
+
For example, if c = 1.5, the acceptance rate would be <math>\frac {1}{1.5}=\frac {2}{3}</math>. Thus, in order to generate 1000 random values, on average, a total of 1500 iterations would be required.  
  
 
A histogram to show 1000 random values of f(x), more random value make the probability close to the express probability value.
 
A histogram to show 1000 random values of f(x), more random value make the probability close to the express probability value.
Line 1,700: Line 1,677:
 
Let g be the uniform distribution of 1, 2, or 3<br />
 
Let g be the uniform distribution of 1, 2, or 3<br />
 
g(x)= 1/3<br />
 
g(x)= 1/3<br />
<math>c=max(p_{x}/g(x))=0.6/(1/3)=1.8</math><br />
+
<math>c=max(\tfrac{p_{x}}{g(x)})=0.6/(\tfrac{1}{3})=1.8</math><br />
Hence p(x)/cg(x) = p(x)/(1.8*1/3)= p(x)/0.6
+
Hence <math>\tfrac{p(x)}{cg(x)} = p(x)/(1.8 (\tfrac{1}{3}))= \tfrac{p(x)}{0.6}</math>
  
 
1,y~g<br />
 
1,y~g<br />
Line 1,711: Line 1,688:
 
>>close all
 
>>close all
 
>>clear all
 
>>clear all
>>p=[.1 .3 .6];  
+
>>p=[.1 .3 .6];     %This a vector holding the values 
 
>>ii=1;
 
>>ii=1;
 
>>while ii < 1000
 
>>while ii < 1000
     y=unidrnd(3);
+
     y=unidrnd(3);   %generates random numbers for the discrete uniform distribution with maximum 3
     u=rand;
+
     u=rand;          
 
     if u<= p(y)/0.6
 
     if u<= p(y)/0.6
       x(ii)=y;
+
       x(ii)=y;    
       ii=ii+1;
+
       ii=ii+1;     %else ii=ii+1
 
     end
 
     end
 
   end
 
   end
Line 1,724: Line 1,701:
 
</pre>
 
</pre>
  
[[File:May21_Example2.jpg|300px]]
 
  
 
* '''Example 3'''<br>
 
* '''Example 3'''<br>
<math>p_{x}=e^{-3}3^{x}/x! , x>=0</math><br>(poisson distribution)
 
Try the first few p_{x}'s:  .0498 .149 .224 .224 .168 .101 .0504 .0216 .0081 .0027<br>
 
  
Use the geometric distribution for <math>g(x)</math>;<br>
+
Suppose <math>\begin{align}p_{x} = e^{-3}3^{x}/x! , x\geq 0\end{align}</math> (Poisson distribution)
<math>g(x)=p(1-p)^{x}</math>, choose p=0.25<br>
+
 
Look at <math>p_{x}/g(x)</math> for the first few numbers: .199 .797 1.59 2.12 2.12 1.70 1.13 .647 .324 .144<br>
+
'''First:''' Try the first few <math>\begin{align}p_{x}'s\end{align}</math>:  0.0498, 0.149, 0.224, 0.224, 0.168, 0.101, 0.0504, 0.0216, 0.0081, 0.0027 for <math>\begin{align} x = 0,1,2,3,4,5,6,7,8,9 \end{align}</math><br>
We want <math>c=max(p_{x}/g(x))</math> which is approximately 2.12<br>
+
 
 +
'''Proposed distribution:''' Use the geometric distribution for <math>\begin{align}g(x)\end{align}</math>;<br>
 +
 
 +
<math>\begin{align}g(x)=p(1-p)^{x}\end{align}</math>, choose <math>\begin{align}p=0.25\end{align}</math><br>
 +
 
 +
Look at <math>\begin{align}p_{x}/g(x)\end{align}</math> for the first few numbers: 0.199 0.797 1.59 2.12 2.12 1.70 1.13 0.647 0.324 0.144 for <math>\begin{align} x = 0,1,2,3,4,5,6,7,8,9 \end{align}</math><br>
 +
 
 +
We want <math>\begin{align}c=max(p_{x}/g(x))\end{align}</math> which is approximately 2.12<br>
  
1. Generate <math>U_{1} \sim~ U(0,1); U_{2} \sim~ U(0,1)</math><br>
+
'''The general procedures to generate <math>\begin{align}p(x)\end{align}</math> is as follows:'''
2. <math>j = \lfloor \frac{ln(U_{1})}{ln(.75)} \rfloor+1;</math><br>
 
3. if <math>U_{2} < \frac{p_{j}}{cg(j)}</math>, set X = x<sub>j</sub>, else go to step 1.
 
  
 +
1. Generate <math>\begin{align}U_{1} \sim~ U(0,1); U_{2} \sim~ U(0,1)\end{align}</math><br>
  
* '''Code for example 3'''<br />
+
2. <math>\begin{align}j = \lfloor \frac{ln(U_{1})}{ln(.75)} \rfloor+1;\end{align}</math><br>
<pre style="font-size:16px">
+
 
>> ii = 1
+
3. if <math>U_{2} < \frac{p_{j}}{cg(j)}</math>, set <math>\begin{align}X = x_{j}\end{align}</math>, else go to step 1.
>> while ii < 1000
 
      u1 = rand;
 
      u2 = rand;
 
      j = floor((log(u1)/log(.75)) + 1;
 
      pj = (e^3) * (3^j)/fact(j);
 
      gj = .25 * (1 - .25)^j
 
      if u2 < pj / (2.12 * gj)
 
        x(ii) = j;
 
        ii = ii + 1;
 
      end
 
  end
 
>> hist(x)
 
  
</pre>
+
Note: In this case, <math>\begin{align}f(x)/g(x)\end{align}</math> is extremely difficult to differentiate so we were required to test points. If the function is very easy to differentiate, we can calculate the max as if it were a continuous function then check the two surrounding points for which is the highest discrete value.
  
Note: In this case, f(x)/g(x) is extremely difficult to differentiate so we were required to test points. If the function is easily differentiable, we can calculate the max as if it were a continuous function then check the two surrounding points for which is the highest discrete value.
+
* Source: http://www.math.wsu.edu/faculty/genz/416/lect/l04-46.pdf*
  
 
*'''Example 4''' (Hypergeometric & Binomial)<br>  
 
*'''Example 4''' (Hypergeometric & Binomial)<br>  
Line 1,838: Line 1,806:
 
<math> F(x) = \int_0^{x} \frac{e^{-y}y^{t-1}}{(t-1)!} \mathrm{d}y, \; \forall x \in (0,+\infty)</math>, where <math>t \in \N^+ \text{ and } \lambda \in (0,+\infty)</math>.<br>
 
<math> F(x) = \int_0^{x} \frac{e^{-y}y^{t-1}}{(t-1)!} \mathrm{d}y, \; \forall x \in (0,+\infty)</math>, where <math>t \in \N^+ \text{ and } \lambda \in (0,+\infty)</math>.<br>
  
 +
Note that the CDF of the Gamma distribution does not have a closed form.
  
Neither Inverse Transformation nor Acceptance/Rejection Method can be easily applied to Gamma distribution.
+
The gamma distribution is often used to model waiting times between a certain number of events. It can also be expressed as the sum of infinitely many independent and identically distributed exponential distributions. This distribution has two parameters: the number of exponential terms n, and the rate parameter <math>\lambda</math>. In this distribution there is the Gamma function, <math>\Gamma </math> which has some very useful properties. "Source: STAT 340 Spring 2010 Course Notes" <br/>
 +
 
 +
Neither Inverse Transformation nor Acceptance-Rejection Method can be easily applied to Gamma distribution.
 
However, we can use additive property of Gamma distribution to generate random variables.
 
However, we can use additive property of Gamma distribution to generate random variables.
  
Line 1,851: Line 1,822:
 
If we want to sample from the Gamma distribution, we can consider sampling from <math>t</math> independent exponential distributions using the Inverse Method for each <math> X_i</math> and add them up. Note that this only works the specific set of gamma distributions where t is a positive integer.
 
If we want to sample from the Gamma distribution, we can consider sampling from <math>t</math> independent exponential distributions using the Inverse Method for each <math> X_i</math> and add them up. Note that this only works the specific set of gamma distributions where t is a positive integer.
  
According to this property, a random variable that follows Gamma distribution is the sum of i.i.d (independent and identically distributed) exponential random variables. Now we want to generate 1000 values of <math>Gamma(20,10)</math> random variables, so we need to obtain the value of each one by adding 20 values of <math>X_i \sim~ Exp(10)</math>. To achieve this, we generate a 20-by-1000 matrix whose entries follow <math>Exp(10)</math> and add the rows together.
+
According to this property, a random variable that follows Gamma distribution is the sum of i.i.d (independent and identically distributed) exponential random variables. Now we want to generate 1000 values of <math>Gamma(20,10)</math> random variables, so we need to obtain the value of each one by adding 20 values of <math>X_i \sim~ Exp(10)</math>. To achieve this, we generate a 20-by-1000 matrix whose entries follow <math>Exp(10)</math> and add the rows together.<br />
<math> x_1 </math>~Exp(<math>\lambda </math>)
+
<math> x_1 \sim~Exp(\lambda)</math><br />
<math>x_2 </math>~Exp(<math> \lambda </math>)
+
<math>x_2 \sim~Exp(\lambda)</math><br />
...
+
...<br />
<math>x_t </math>~Exp(<math> \lambda </math>)
+
<math>x_t \sim~Exp(\lambda)</math><br />
<math>x_1+x_2+...+x_t</math>
+
<math>x_1+x_2+...+x_t~</math>
  
 
<pre style="font-size:16px">
 
<pre style="font-size:16px">
Line 1,890: Line 1,861:
 
                             all the elements are generated by rand
 
                             all the elements are generated by rand
 
>>x = (-1/lambda)*log(1-u);      Note: log(1-u) is essentially the same as log(u) only if u~U(0,1)  
 
>>x = (-1/lambda)*log(1-u);      Note: log(1-u) is essentially the same as log(u) only if u~U(0,1)  
>>xx = sum(x)                     Note: sum(x) will sum all elements in the same column.  
+
>>xx = sum(x)                   Note: sum(x) will sum all elements in the same column.  
 
                                                 size(xx) can help you to verify
 
                                                 size(xx) can help you to verify
 
>>size(sum(x))                  Note: see the size of x if we forget it
 
>>size(sum(x))                  Note: see the size of x if we forget it
Line 1,915: Line 1,886:
 
</pre>
 
</pre>
  
in the matrix rand(20,1000) means 20 row with 1000 numbers for each.
+
In the matrix rand(20,1000) means 20 row with 1000 numbers for each.
 
use the code to show the generalize the distributions for multidimensional purposes in different cases, such as sum xi (each xi not equal xj), and they are independent, or matrix. Finally, we can see the conclusion is shown by the histogram.
 
use the code to show the generalize the distributions for multidimensional purposes in different cases, such as sum xi (each xi not equal xj), and they are independent, or matrix. Finally, we can see the conclusion is shown by the histogram.
  
 
=== Other Sampling Method: Box Muller ===
 
=== Other Sampling Method: Box Muller ===
 
 
[[File:Unnamed_QQ_Screenshot20130521203625.png‎]]
 
[[File:Unnamed_QQ_Screenshot20130521203625.png‎]]
 
* From cartesian to polar coordinates <br />
 
* From cartesian to polar coordinates <br />
Line 1,926: Line 1,896:
 
   
 
   
 
*Box-Muller Transformation:<br>
 
*Box-Muller Transformation:<br>
It is a transformation that consumes two continuous uniform random variables <math> X \sim U(0,1), Y \sim U(0,1) </math> and outputs a bivariate normal random variable with <math> Z_1\sim N(0,1), Z_2\sim N(0,1). </math><br>
+
It is a transformation that consumes two continuous uniform random variables <math> X \sim U(0,1), Y \sim U(0,1) </math> and outputs a bivariate normal random variable with <math> Z_1\sim N(0,1), Z_2\sim N(0,1). </math>
In other words, the Box-Muller method is a method of producing two independent standard normals from two independent uniforms. <br>
 
 
 
*Basic Form:<br>
 
Let U<sub>1</sub> and U<sub>2</sub> ~ U(0,1). Assuming U1 & U2 are independent, and let: <br>
 
 
 
1)  <math>Z_0 = R \cos(\Theta) =\sqrt{-2 \ln U_1} \cos(2 \pi U_2)\,</math><br>
 
 
 
2)  <math>Z_1 = R \sin(\Theta) = \sqrt{-2 \ln U_1} \sin(2 \pi U_2)\,</math><br>
 
 
 
where both Z<sub>0</sub> and Z<sub>1</sub>~N(0,1) are independent, with corresponding polar coordinates:<br>
 
  <math>R^2 = -2\cdot\ln U_1\,</math> <br>
 
and <br>
 
  <math>\Theta = 2\pi U_2\,</math> <br>
 
 
 
'''Note:''' <br>
 
 
 
R<sup>2</sup> here has Chi-Squared distribution with df = 2 since it is just the square of the norm of the standard bivariate normal variable (X,Y). For the special case where df = 2, chi-squared distribution is the same as the exponential distribution. Hence, R<sup>2</sup> is simply obtainable by generating the required exponential variate.
 
 
 
Source: https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform
 
  
 
=== '''Matlab''' ===
 
=== '''Matlab''' ===
Line 1,968: Line 1,919:
 
:<math>f(x) = \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} x^2}</math>
 
:<math>f(x) = \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} x^2}</math>
  
*Warning : the General Normal distribution is  
+
*Warning : the General Normal distribution is:
:
 
 
<table>
 
<table>
 
<tr>
 
<tr>
Line 2,021: Line 1,971:
  
 
Let <math> \theta </math> and R denote the Polar coordinate of the vector (X, Y)  
 
Let <math> \theta </math> and R denote the Polar coordinate of the vector (X, Y)  
 +
where <math> X = R \cdot \sin\theta </math> and <math> Y = R \cdot \cos \theta </math>
  
 
[[File:rtheta.jpg]]
 
[[File:rtheta.jpg]]
Line 2,037: Line 1,988:
 
We know that  
 
We know that  
  
:R<sup>2</sup>= X<sup>2</sup>+Y<sup>2</sup> and <math> \tan(\theta) = \frac{y}{x} </math> where X and Y are two independent standard normal
+
<math>R^{2}= X^{2}+Y^{2}</math> and <math> \tan(\theta) = \frac{y}{x} </math> where X and Y are two independent standard normal
 
:<math>f(x) = \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} x^2}</math>
 
:<math>f(x) = \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} x^2}</math>
 
:<math>f(y) = \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} y^2}</math>
 
:<math>f(y) = \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} y^2}</math>
:<math>f(x,y) = \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} x^2} * \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} y^2}=\frac{1}{2\pi}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} (x^2+y^2)} </math><br /> - Since for independent distributions, their joint probability function is the multiplication of two independent probability functions
+
:<math>f(x,y) = \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} x^2} * \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} y^2}=\frac{1}{2\pi}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} (x^2+y^2)} </math><br /> - Since for independent distributions, their joint probability function is the multiplication of two independent probability functions. It can also be shown using 1-1 transformation that the joint distribution of R and θ is given by, 1-1 transformation:<br />
It can also be shown using 1-1 transformation that the joint distribution of R and θ is given by,
+
 
1-1 transformation:<br />
+
 
Let <math>d=R^2</math><br />
+
'''Let <math>d=R^2</math>'''<br />
 +
 
 
  <math>x= \sqrt {d}\cos \theta </math>
 
  <math>x= \sqrt {d}\cos \theta </math>
 
  <math>y= \sqrt {d}\sin \theta </math>
 
  <math>y= \sqrt {d}\sin \theta </math>
 
then  
 
then  
 
<math>\left| J\right| = \left| \dfrac {1} {2}d^{-\frac {1} {2}}\cos \theta d^{\frac{1}{2}}\cos \theta +\sqrt {d}\sin \theta \dfrac {1} {2}d^{-\frac{1}{2}}\sin \theta \right| = \dfrac {1} {2}</math>
 
<math>\left| J\right| = \left| \dfrac {1} {2}d^{-\frac {1} {2}}\cos \theta d^{\frac{1}{2}}\cos \theta +\sqrt {d}\sin \theta \dfrac {1} {2}d^{-\frac{1}{2}}\sin \theta \right| = \dfrac {1} {2}</math>
It can be shown that the pdf of <math> d </math> and <math> \theta </math> is:
+
It can be shown that the joint density of <math> d /R^2</math> and <math> \theta </math> is:
 
:<math>\begin{matrix}  f(d,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad  d = R^2 \end{matrix},\quad for\quad 0\leq d<\infty\ and\quad 0\leq \theta\leq 2\pi </math>
 
:<math>\begin{matrix}  f(d,\theta) = \frac{1}{2}e^{-\frac{d}{2}}*\frac{1}{2\pi},\quad  d = R^2 \end{matrix},\quad for\quad 0\leq d<\infty\ and\quad 0\leq \theta\leq 2\pi </math>
  
Line 2,055: Line 2,007:
 
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent
 
Note that <math> \begin{matrix}f(r,\theta)\end{matrix}</math> consists of two density functions, Exponential and Uniform, so assuming that r and <math>\theta</math> are independent
 
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2),  \theta \sim~ Unif[0,2\pi] \end{matrix} </math>
 
<math> \begin{matrix} \Rightarrow d \sim~ Exp(1/2),  \theta \sim~ Unif[0,2\pi] \end{matrix} </math>
::* <math> \begin{align} R^2 = x^2 + y^2 \end{align} </math>
+
::* <math> \begin{align} R^2 = d = x^2 + y^2 \end{align} </math>
 
::* <math> \tan(\theta) = \frac{y}{x} </math>
 
::* <math> \tan(\theta) = \frac{y}{x} </math>
 
<math>\begin{align} f(d) = Exp(1/2)=\frac{1}{2}e^{-\frac{d}{2}}\ \end{align}</math>  
 
<math>\begin{align} f(d) = Exp(1/2)=\frac{1}{2}e^{-\frac{d}{2}}\ \end{align}</math>  
Line 2,061: Line 2,013:
 
<math>\begin{align} f(\theta) =\frac{1}{2\pi}\ \end{align}</math>
 
<math>\begin{align} f(\theta) =\frac{1}{2\pi}\ \end{align}</math>
 
<br>
 
<br>
 +
 
To sample from the normal distribution, we can generate a pair of independent standard normal X and Y by:<br />
 
To sample from the normal distribution, we can generate a pair of independent standard normal X and Y by:<br />
 +
 
1) Generating their polar coordinates<br />
 
1) Generating their polar coordinates<br />
 
2) Transforming back to rectangular (Cartesian) coordinates.<br />
 
2) Transforming back to rectangular (Cartesian) coordinates.<br />
  
Alternative Method of Generating Standard Normal Random Variables 
 
  
Step 1: Generate <math>u</math><sub>1</sub>~<math>Unif(0,1)</math>
+
'''Alternative Method of Generating Standard Normal Random Variables'''<br /> 
Step 2: Generate <math>Y</math><sub>1</sub>~<math>Exp(1),Y</math><sub>2</sub>~<math>Exp(2)</math>
+
 
Step 3: If <math>Y2 \geq(Y-1)^2/2</math>,set <math>V=Y1</math>,otherwise,go to step 1
+
Step 1: Generate <math>u_{1}</math> ~<math>Unif(0,1)</math><br />
Step 4: If <math>u1 \leq 1/2</math>,then <math>X=-V</math>
+
Step 2: Generate <math>Y_{1}</math> ~<math>Exp(1)</math>,<math>Y_{2}</math>~<math>Exp(2)</math><br />
 +
Step 3: If <math>Y_{2} \geq(Y_{1}-1)^2/2</math>,set <math>V=Y1</math>,otherwise,go to step 1<br />
 +
Step 4: If <math>u_{1} \leq 1/2</math>,then <math>X=-V</math><br />
 +
 
 +
===Expectation of a Standard Normal distribution===<br />
 +
 
 +
The expectation of a standard normal distribution is 0<br />
  
==== Expectation of a Standard Normal distribution ====
+
'''Proof:''' <br />
The expectation of a standard normal distribution is 0
 
:Below is the proof:  
 
  
 
:<math>\operatorname{E}[X]= \;\int_{-\infty}^{\infty} x \frac{1}{\sqrt{2\pi}}  e^{-x^2/2} \, dx.</math>
 
:<math>\operatorname{E}[X]= \;\int_{-\infty}^{\infty} x \frac{1}{\sqrt{2\pi}}  e^{-x^2/2} \, dx.</math>
Line 2,083: Line 2,040:
 
:<math>= - \left[\phi(x)\right]_{-\infty}^{\infty}</math>
 
:<math>= - \left[\phi(x)\right]_{-\infty}^{\infty}</math>
 
:<math>= 0</math><br />
 
:<math>= 0</math><br />
More intuitively, because x is an odd function (f(x)+f(-x)=0). Taking integral of x will give <math>x^2/2 </math> which is an even function (f(x)=f(-x)). If support is from negative infinity to infinity, then the integral will return 0.<br />
 
  
* '''Procedure (Box-Muller Transformation Method):''' <br />
+
'''Note,''' more intuitively, because x is an odd function (f(x)+f(-x)=0). Taking integral of x will give <math>x^2/2 </math> which is an even function (f(x)=f(-x)). This is in relation to the symmetrical properties of the standard normal distribution. If support is from negative infinity to infinity, then the integral will return 0.<br />
 +
 
 +
 
 +
'''Procedure (Box-Muller Transformation Method):''' <br />
 +
 
 
Pseudorandom approaches to generating normal random variables used to be limited. Inefficient methods such as inverse Gaussian function, sum of uniform random variables, and acceptance-rejection were used. In 1958, a new method was proposed by George Box and Mervin Muller of Princeton University. This new technique was easy to use and also had the accuracy to the inverse transform sampling method that it grew more valuable as computers became more computationally astute. <br>
 
Pseudorandom approaches to generating normal random variables used to be limited. Inefficient methods such as inverse Gaussian function, sum of uniform random variables, and acceptance-rejection were used. In 1958, a new method was proposed by George Box and Mervin Muller of Princeton University. This new technique was easy to use and also had the accuracy to the inverse transform sampling method that it grew more valuable as computers became more computationally astute. <br>
 
The Box-Muller method takes a sample from a bivariate independent standard normal distribution, each component of which is thus a univariate standard normal. The algorithm is based on the following two properties of the bivariate independent standard normal distribution: <br>
 
The Box-Muller method takes a sample from a bivariate independent standard normal distribution, each component of which is thus a univariate standard normal. The algorithm is based on the following two properties of the bivariate independent standard normal distribution: <br>
 
if <math>Z = (Z_{1}, Z_{2}</math>) has this distribution, then <br>
 
if <math>Z = (Z_{1}, Z_{2}</math>) has this distribution, then <br>
 +
 
1.<math>R^2=Z_{1}^2+Z_{2}^2</math> is exponentially distributed with mean 2, i.e. <br>
 
1.<math>R^2=Z_{1}^2+Z_{2}^2</math> is exponentially distributed with mean 2, i.e. <br>
 
<math>P(R^2 \leq x) = 1-e^{-x/2}</math>. <br>
 
<math>P(R^2 \leq x) = 1-e^{-x/2}</math>. <br>
 
2.Given <math>R^2</math>, the point <math>(Z_{1},Z_{2}</math>) is uniformly distributed on the circle of radius R centered at the origin. <br>
 
2.Given <math>R^2</math>, the point <math>(Z_{1},Z_{2}</math>) is uniformly distributed on the circle of radius R centered at the origin. <br>
 
We can use these properties to build the algorithm: <br>
 
We can use these properties to build the algorithm: <br>
 +
  
 
1) Generate random number <math> \begin{align} U_1,U_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
 
1) Generate random number <math> \begin{align} U_1,U_2 \sim~ \mathrm{Unif}(0, 1) \end{align} </math> <br />
Line 2,111: Line 2,073:
  
  
Note: In steps 2 and 3, we are using a similar technique as that used in the inverse transform method. <br />
+
'''Note:''' In steps 2 and 3, we are using a similar technique as that used in the inverse transform method. <br />
 
The Box-Muller Transformation Method generates a pair of independent Standard Normal distributions, X and Y (Using the transformation of polar coordinates). <br />
 
The Box-Muller Transformation Method generates a pair of independent Standard Normal distributions, X and Y (Using the transformation of polar coordinates). <br />
 +
 
If you want to generate a number of independent standard normal distributed numbers (more than two), you can run the Box-Muller method several times.<br/>
 
If you want to generate a number of independent standard normal distributed numbers (more than two), you can run the Box-Muller method several times.<br/>
 
For example: <br />
 
For example: <br />
Line 2,119: Line 2,082:
  
  
* '''Code'''<br />
+
'''Matlab Code'''<br />
 +
 
 
<pre style="font-size:16px">
 
<pre style="font-size:16px">
 
>>close all
 
>>close all
Line 2,134: Line 2,098:
 
>>hist(y)
 
>>hist(y)
 
</pre>
 
</pre>
 +
<br>
 +
'''Remember''': For the above code to work the "." needs to be after the d to ensure that each element of d is raised to the power of 0.5.<br /> Otherwise matlab will raise the entire matrix to the power of 0.5."<br>
  
"''Remember'': For the above code to work the "." needs to be after the d to ensure that each element of d is raised to the power of 0.5.<br /> Otherwise matlab will raise the entire matrix to the power of 0.5."
+
'''Note:'''<br>the first graph is hist(tet) and it is a uniform distribution.<br>The second one is hist(d) and it is a exponential distribution.<br>The third one is hist(x) and it is a normal distribution.<br>The last one is hist(y) and it is also a normal distribution.
 
 
Note:<br>the first graph is hist(tet) and it is a uniform distribution.<br>The second one is hist(d) and it is a exponential distribution.<br>The third one is hist(x) and it is a normal distribution.<br>The last one is hist(y) and it is also a normal distribution.
 
  
 
Attention:There is a "dot" between sqrt(d) and "*". It is because d and tet are vectors. <br>
 
Attention:There is a "dot" between sqrt(d) and "*". It is because d and tet are vectors. <br>
Line 2,154: Line 2,118:
 
>>hist(x)
 
>>hist(x)
 
>>hist(x+2)
 
>>hist(x+2)
>>hist(x*2+2)
+
>>hist(x*2+2)<br>
 
</pre>
 
</pre>
 
+
<br>
Note: randn is random sample from a standard normal distribution.<br />
+
'''Note:'''<br>
Note: hist(x+2) will be centered at 2 instead of at 0. <br />
+
1. randn is random sample from a standard normal distribution.<br />
      hist(x*3+2) is also centered at 2. The mean doesn't change, but the variance of x*3+2 becomes nine times (3^2) the variance of x.<br />
+
2. hist(x+2) will be centered at 2 instead of at 0. <br />
 +
3. hist(x*3+2) is also centered at 2. The mean doesn't change, but the variance of x*3+2 becomes nine times (3^2) the variance of x.<br />
 
[[File:Normal_x.jpg|300x300px]][[File:Normal_x+2.jpg|300x300px]][[File:Normal(2x+2).jpg|300px]]
 
[[File:Normal_x.jpg|300x300px]][[File:Normal_x+2.jpg|300x300px]][[File:Normal(2x+2).jpg|300px]]
 
<br />
 
<br />
  
<b>Comment</b>: Box-Muller transformations are not computationally efficient. The reason for this is the need to compute sine and cosine functions. A way to get around this time-consuming difficulty is by an indirect computation of the sine and cosine of  a random angle (as opposed to a direct computation which generates  U  and then computes the sine and cosine of 2πU. <br />
+
<b>Comment</b>:<br />
 +
Box-Muller transformations are not computationally efficient. The reason for this is the need to compute sine and cosine functions. A way to get around this time-consuming difficulty is by an indirect computation of the sine and cosine of  a random angle (as opposed to a direct computation which generates  U  and then computes the sine and cosine of 2πU. <br />
 +
 
 +
 
  
 
'''Alternative Methods of generating normal distribution'''<br />
 
'''Alternative Methods of generating normal distribution'''<br />
 +
 
1. Even though we cannot use inverse transform method, we can approximate this inverse using different functions.One method would be '''rational approximation'''.<br />
 
1. Even though we cannot use inverse transform method, we can approximate this inverse using different functions.One method would be '''rational approximation'''.<br />
 
2.'''Central limit theorem''' : If we sum 12 independent U(0,1) distribution and subtract 6 (which is E(ui)*12)we will approximately get a standard normal distribution.<br />
 
2.'''Central limit theorem''' : If we sum 12 independent U(0,1) distribution and subtract 6 (which is E(ui)*12)we will approximately get a standard normal distribution.<br />
Line 2,179: Line 2,148:
 
=== Proof of Box Muller Transformation ===
 
=== Proof of Box Muller Transformation ===
  
Definition:
+
'''Definition:'''<br />
 
A transformation which transforms from a '''two-dimensional continuous uniform''' distribution to a '''two-dimensional bivariate normal''' distribution (or complex normal distribution).
 
A transformation which transforms from a '''two-dimensional continuous uniform''' distribution to a '''two-dimensional bivariate normal''' distribution (or complex normal distribution).
  
 
Let U<sub>1</sub> and U<sub>2</sub> be independent uniform (0,1) random variables. Then
 
Let U<sub>1</sub> and U<sub>2</sub> be independent uniform (0,1) random variables. Then
<math>X_{1} = (-2lnU)^0.5_{1}*cos(2\pi U_{2})</math>
+
<math>X_{1} = ((-2lnU_{1})^.5)*cos(2\pi U_{2})</math>
  
<math>X_{2} = (-2lnU)^0.5_{1}*sin(2\pi U_{2})</math>
+
<math>X_{2} = (-2lnU_{1})^0.5*sin(2\pi U_{2})</math>
 
are '''independent''' N(0,1) random variables.
 
are '''independent''' N(0,1) random variables.
  
Line 2,199: Line 2,168:
 
       u<sub>2</sub> = g<sub>2</sub> ^-1(x1,x2)
 
       u<sub>2</sub> = g<sub>2</sub> ^-1(x1,x2)
  
Inverting the above transformations, we have
+
Inverting the above transformation, we have
 
     u1 = exp^{-(x<sub>1</sub> ^2+ x<sub>2</sub> ^2)/2}
 
     u1 = exp^{-(x<sub>1</sub> ^2+ x<sub>2</sub> ^2)/2}
 
     u2 = (1/2pi)*tan^-1 (x<sub>2</sub>/x<sub>1</sub>)
 
     u2 = (1/2pi)*tan^-1 (x<sub>2</sub>/x<sub>1</sub>)
Line 2,206: Line 2,175:
 
   f(x1,x2) = {exp^(-(x1^2+x2^2)/2)}/2pi
 
   f(x1,x2) = {exp^(-(x1^2+x2^2)/2)}/2pi
 
which factors into two standard normal pdfs.
 
which factors into two standard normal pdfs.
 +
 +
 +
(The quote is from http://mathworld.wolfram.com/Box-MullerTransformation.html)
 +
(The proof is from http://www.math.nyu.edu/faculty/goodman/teaching/MonteCarlo2005/notes/GaussianSampling.pdf)
  
 
=== General Normal distributions ===
 
=== General Normal distributions ===
Line 2,227: Line 2,200:
 
where <math> \mu </math> is the mean or expectation of the distribution and <math> \sigma </math> is standard deviation <br />
 
where <math> \mu </math> is the mean or expectation of the distribution and <math> \sigma </math> is standard deviation <br />
  
The special case of the normal distribution is standard normal distribution, which the variance is 1 and the mean is zero. If X is a general normal deviate, then Z = (X − μ)/σ will have a standard normal distribution.
+
The probability density must be scaled by 1/sigma so that the integral is still 1.(Acknowledge: https://en.wikipedia.org/wiki/Normal_distribution)
 +
The special case of the normal distribution is standard normal distribution, which the variance is 1 and the mean is zero. If X is a general normal deviate, then <math> Z=\dfrac{X - (\mu)}{\sigma} </math> will have a standard normal distribution.
  
 
If Z ~ N(0,1), and we want <math>X </math>~<math> N(\mu, \sigma^2)</math>, then <math>X = \mu + \sigma * Z</math> Since <math>E(x) = \mu +\sigma*0 = \mu </math> and <math>Var(x) = 0 +\sigma^2*1</math>
 
If Z ~ N(0,1), and we want <math>X </math>~<math> N(\mu, \sigma^2)</math>, then <math>X = \mu + \sigma * Z</math> Since <math>E(x) = \mu +\sigma*0 = \mu </math> and <math>Var(x) = 0 +\sigma^2*1</math>
Line 2,360: Line 2,334:
 
   
 
   
 
The Bernoulli distribution is a special case of binomial distribution, where the variate x only has two outcomes; so that the Bernoulli also can use the probability density function of the binomial distribution with the variate x taking values 0 and 1.
 
The Bernoulli distribution is a special case of binomial distribution, where the variate x only has two outcomes; so that the Bernoulli also can use the probability density function of the binomial distribution with the variate x taking values 0 and 1.
 +
 +
The most famous example for the Bernoulli Distribution would be the "Flip Coin" question, which has only two possible outcomes(Success or Failure) with the same probabilities of 0.5
  
 
Let x1,x2 denote the lifetime of 2 independent particles, x1~exp(<math>\lambda</math>), x2~exp(<math>\lambda</math>)
 
Let x1,x2 denote the lifetime of 2 independent particles, x1~exp(<math>\lambda</math>), x2~exp(<math>\lambda</math>)
Line 2,417: Line 2,393:
 
for k=1:5000
 
for k=1:5000
 
     i = 1;
 
     i = 1;
     while (i <= n)
+
     for i=1:n
 
         u=rand();
 
         u=rand();
 
         if (u <= p)
 
         if (u <= p)
Line 2,424: Line 2,400:
 
             y(i) = 0;
 
             y(i) = 0;
 
         end
 
         end
        i = i + 1;
 
 
     end
 
     end
  
Line 2,436: Line 2,411:
  
  
</pre>
+
 
 
</div>
 
</div>
 
Note: We can also regard the Bernoulli Distribution as either a conditional distribution or <math>f(x)= p^{x}(1-p)^{(1-x)}</math>, x=0,1.
 
Note: We can also regard the Bernoulli Distribution as either a conditional distribution or <math>f(x)= p^{x}(1-p)^{(1-x)}</math>, x=0,1.
Line 2,453: Line 2,428:
 
===Universality of the Uniform Distribution/Inverse Method===
 
===Universality of the Uniform Distribution/Inverse Method===
  
The inverse method is universal in the sense that we can potentially sample from any distribution where we can find the inverse of the cumulative distribution function. However, this is not always the case as some functions do not have an inverse while others may be difficult to find. As such, there exist different procedures such as Acceptance-Rejection which is outlined in a further lecture.
+
The inverse method is universal in the sense that we can potentially sample from any distribution where we can find the inverse of the cumulative distribution function.
  
 
Procedure:
 
Procedure:
  
1.Generate U~Unif [0, 1)<br>
+
1) Generate U~Unif (0, 1)<br>
2.set <math>x=F^{-1}(u)</math><br>
+
2) Set <math>x=F^{-1}(u)</math><br>
3.X~f(x)<br>
+
3) X~f(x)<br>
  
 
'''Remark'''<br>
 
'''Remark'''<br>
1. The preceding can be written algorithmically as
+
1) The preceding can be written algorithmically for discrete random variables as <br>
Generate a random number U.
+
Generate a random number U ~ U(0,1] <br>
If U<<sub>p0</sub> set X=<sub>x0</sub> and stop
+
If U < p<sub>0</sub> set X = x<sub>0</sub> and stop <br>
If U<<sub>p0</sub>+<sub>p1</sub> set X=x1 and stop
+
If U < p<sub>0</sub> + p<sub>1</sub> set X = x<sub>1</sub> and stop <br>
...
+
... <br>
2. If the <sub>xi</sub>, i>=0, are ordered so that <sub>x0</sub><<sub>x1</sub><<sub>x2</sub><... and if we let F denote the distribution function of X, then <math>F(<sub>xk</sub>=<sub>/sum/pi</sub>)</math> and so X will equal <sub>xj</sub> if F(<sub>x(j-1)</sub>)<=U<F(<sub>xj</sub>)
+
2) If the x<sub>i</sub>, i>=0, are ordered so that x<sub>0</sub> < x<sub>1</sub> < x<sub>2</sub> <... and if we let F denote the distribution function of X, then X will equal x<sub>j</sub> if F(x<sub>j-1</sub>) <= U < F(x<sub>j</sub>)
  
 
'''Example 1'''<br>
 
'''Example 1'''<br>
Line 2,494: Line 2,469:
  
 
Step1: Generate U~ U(0, 1)<br>
 
Step1: Generate U~ U(0, 1)<br>
Step2: set <math>y=\, {-\frac {1}{{\lambda_1 +\lambda_2}}} ln(u)</math><br>
+
 
 +
Step2: set <math>y=\, {-\frac {1}{{\lambda_1 +\lambda_2}}} ln(1-u)</math><br>
 +
 
 +
    or set <math>y=\, {-\frac {1} {{\lambda_1 +\lambda_2}}} ln(u)</math><br>
 +
Since it is a uniform distribution, therefore after generate a lot of times 1-u and u are the same.
 +
 
 +
 
 +
* '''Matlab Code'''<br />
 +
<pre style="font-size:16px">
 +
>> lambda1 = 1;
 +
>> lambda2 = 2;
 +
>> u = rand;
 +
>> y = -log(u)/(lambda1 + lambda2)
 +
</pre>
  
 
If we generalize this example from two independent particles to n independent particles we will have:<br>
 
If we generalize this example from two independent particles to n independent particles we will have:<br>
Line 2,530: Line 2,518:
 
'''Solution:'''<br>
 
'''Solution:'''<br>
 
<br>
 
<br>
1. generate u ~ Unif[0, 1)<br>
+
1. Generate <math>U ~\sim~ Unif[0, 1)</math><br>
2. Set x = U<sup>1/n</sup><br>
+
2. Set <math>X = U^{1/n}</math><br>
 
<br>
 
<br>
For example, when n = 20,<br>
+
For example, when <math>n = 20</math>,<br>
u = 0.6 => x = u<sub>1/20</sub> = 0.974<br>
+
<math>U = 0.6</math> => <math>X = U^{1/20} = 0.974</math><br>
u = 0.5 => x = u<sub>1/20</sub> = 0.966<br>
+
<math>U = 0.5 =></math> <math>X = U^{1/20} = 0.966</math><br>
u = 0.2 => x = u<sub>1/20</sub> = 0.923<br>
+
<math>U = 0.2</math> => <math>X = U^{1/20} = 0.923</math><br>
 
<br>
 
<br>
Observe from above that the values of X for n = 20 are close to 1, this is because we can view X<sup>n</sup> as the maximum of n independent random variables X, X~Unif(0,1) and is much likely to be close to 1 as n increases. This is because when n is large the exponent tends towards 0. This observation is the motivation for method 2 below.<br>
+
Observe from above that the values of X for n = 20 are close to 1, this is because we can view <math>X^n</math> as the maximum of n independent random variables <math>X,</math> <math>X~\sim~Unif(0,1)</math> and is much likely to be close to 1 as n increases. This is because when n is large the exponent tends towards 0. This observation is the motivation for method 2 below.<br>
  
 
Recall that
 
Recall that
Line 2,589: Line 2,577:
 
The general algorithm to generate random variables from a composition CDF is:
 
The general algorithm to generate random variables from a composition CDF is:
  
1)  Generate U, V ~ <math>U(0,1)</math>
+
1)  Generate U,V ~ <math> Unif(0,1)</math>
  
2)  If u < p<sub>1</sub>, v=<math>F_{X_{1}}(x)</math><sup>-1</sup>
+
2)  If U < p<sub>1</sub>, V = <math>F_{X_{1}}(x)</math><sup>-1</sup>
  
3)  Else if u < p<sub>1</sub>+p<sub>2</sub>, v=<math>F_{X_{2}}(x)</math><sup>-1</sup>
+
3)  Else if U < p<sub>1</sub> + p<sub>2</sub>, V = <math>F_{X_{2}}(x)</math><sup>-1</sup>
  
4)  ....
+
4)  Repeat from Step 1 (if N randomly generated variables needed, repeat N times)
  
 
<b>Explanation</b><br>
 
<b>Explanation</b><br>
Line 2,650: Line 2,638:
 
=== Example of Decomposition Method ===
 
=== Example of Decomposition Method ===
  
F<sub>x</sub>(x) = 1/3*x+1/3*x<sup>2</sup>+1/3*x<sup>3</sup>, 0<= x<=1
+
<math>F_x(x) = \frac {1}{3} x+\frac {1}{3} x^2+\frac {1}{3} x^3, 0\leq x\leq 1</math>
  
let U =F<sub>x</sub>(x) = 1/3*x+1/3*x<sup>2</sup>+1/3*x<sup>3</sup>, solve for x.
+
Let <math>U =F_x(x) = \frac {1}{3} x+\frac {1}{3} x^2+\frac {1}{3} x^3</math>, solve for x.
  
P<sub>1</sub>=1/3, F<sub>x1</sub>(x)= x, P<sub>2</sub>=1/3,F<sub>x2</sub>(x)= x<sup>2</sup>,  
+
<math>P_1=\frac{1}{3}, F_{x1} (x)= x, P_2=\frac{1}{3},F_{x2} (x)= x^2,  
P<sub>3</sub>=1/3,F<sub>x3</sub>(x)= x<sup>3</sup>
+
P_3=\frac{1}{3},F_{x3} (x)= x^3</math>
  
 
'''Algorithm:'''
 
'''Algorithm:'''
  
Generate U ~ Unif [0,1)
+
Generate <math>\,U \sim Unif [0,1)</math>
  
Generate V~ Unif [0,1)
+
Generate <math>\,V \sim  Unif [0,1)</math>
  
if 0<u<1/3, x = v
+
if <math>0\leq u \leq \frac{1}{3}, x = v</math>
  
else if <math>u<\frac{2}{3}, x = v<sup>\frac{1}{2}</sup></math>
+
else if <math>u \leq \frac{2}{3}, x = v^{\frac{1}{2}}</math>
  
else <math>x = v<sup>\frac{1}{3}</sup></math><br>
+
else <math>x=v^{\frac{1}{3}}</math> <br>
  
  
 
'''Matlab Code:'''  
 
'''Matlab Code:'''  
 
<pre style="font-size:16px">
 
<pre style="font-size:16px">
u=rand
+
u=rand # U is
 
v=rand
 
v=rand
if <math>u<\frac{1}{3}</math>
+
if u<1/3
 
x=v
 
x=v
elseif <math>u<\frac{2}{3}</math>
+
elseif u<2/3
 
x=sqrt(v)
 
x=sqrt(v)
 
else
 
else
Line 2,739: Line 2,727:
 
(Basis of the Accept-Reject algorithm)
 
(Basis of the Accept-Reject algorithm)
  
The advantage of this method is that we can sample a unknown distribution from a easy distribution. The disadvantage of this method is that it may need to reject many points, which is inefficient.
+
The advantage of this method is that we can sample a unknown distribution from a easy distribution. The disadvantage of this method is that it may need to reject many points, which is inefficient.<br />
 +
Inverse each part of partial CDF, the partial CDF is divided by the original CDF, partial range is uniform distribution.<br />
 +
More specific definition of the theorem can be found here.<ref>http://www.bus.emory.edu/breno/teaching/MCMC_GibbsHandouts.pdf</ref>
  
inverse each part of partial CDF, the partial CDF is divided by the original CDF, partial range is uniform distribution.
+
Matlab code:
  
Suppose we want to sample from f(x), we can write f(x)=<math>\int _{0}^{f(x)}du</math>.
+
<pre style="font-size:16px">
Thus, f(x) can be thought as the marginal distribution of (X, U) ∼ U{(x, u):0 <u<f(x)}.
+
close all
 
+
clear all
Theorem: Simulating X ∼ f(x)is equivalent to simulating (X, U) ∼ U{(x, u):0 <u<f(x)}.
+
ii=1;
 +
while ii<1000
 +
u=rand
 +
y=R*(2*U-1)
 +
if (1-U^2)>=(2*u-1)^2
 +
x(ii)=y;
 +
ii=ii+1
 +
end
 +
</pre>
  
 
===Question 2===
 
===Question 2===
Line 2,765: Line 2,763:
  
 
The beta distribution maximized at 0.5 with value <math>(1/4)^n</math>.
 
The beta distribution maximized at 0.5 with value <math>(1/4)^n</math>.
So, <math>c=b*(1/4)^n</math>
+
So, <math>c=b*(1/4)^n</math><br />
Algorithm:
+
Algorithm: <br />
1.Draw <math>U_1</math> from <math>U(0, 1)</math>.<math> U_2</math> from <math>U(0, 1)<math>
+
1.Draw <math>U_1</math> from <math>U(0, 1)</math>. <math> U_2</math> from <math>U(0, 1)</math> <br />
2.If <math>U_2<=b*(U_1)^n*(1-(U_1))^n/b*(1/4)^n=(4*(U_1)*(1-(U_1)))^n</math>
+
2.If <math>U_2<=b*(U_1)^n*(1-(U_1))^n/b*(1/4)^n=(4*(U_1)*(1-(U_1)))^n</math><br />
 
   then X=U_1
 
   then X=U_1
 
   Else return to step 1.
 
   Else return to step 1.
Line 2,785: Line 2,783:
 
==Class 8 - Thursday, May 30, 2013==
 
==Class 8 - Thursday, May 30, 2013==
  
In this lecture, we will discuss algorithms to generate 3 well-known distributions: Binomial, Geometric, and Poisson. For each of these distributions, we will first state its general understanding, probability mass function, expectation, and variance. Then, we will derive one or more algorithms to sample from each of these distributions, and implement the algorithms utilizing Matlab. <br \>
+
In this lecture, we will discuss algorithms to generate 3 well-known distributions: Binomial, Geometric and Poisson. For each of these distributions, we will first state its general understanding, probability mass function, expectation and variance. Then, we will derive one or more algorithms to sample from each of these distributions, and implement the algorithms on Matlab. <br \>
  
 
===The Bernoulli distribution===
 
===The Bernoulli distribution===
  
The Bernoulli distribution is a special case of the binomial distribution, where n = 1. X ~ Bin(1, p) has the same meaning as X ~ Ber(p), where p is the probability if the event success, otherwise the probability is 1-p (we usually define a variate q, q= 1-p). The mean of Bernoulli is p, variance is p(1-p). Bin(n, p), is the distribution of the sum of n independent Bernoulli trials - Bernoulli(p), each with the same probability p, where 0<p<1. <br>
+
The Bernoulli distribution is a special case of the binomial distribution, where n = 1. X ~ Bin(1, p) has the same meaning as X ~ Ber(p), where p is the probability of success and 1-p is the probability of failure (we usually define a variate q, q= 1-p). The mean of Bernoulli is p and the variance is p(1-p). Bin(n, p), is the distribution of the sum of n independent Bernoulli trials, Bernoulli(p), each with the same probability p, where 0<p<1. <br>
 
For example, let X be the event that a coin toss results in a "head" with probability ''p'', then ''X~Bernoulli(p)''. <br>
 
For example, let X be the event that a coin toss results in a "head" with probability ''p'', then ''X~Bernoulli(p)''. <br>
P(X=1)=p,P(X=0)=1-p, P(x=0)+P(x=1)=p+q=1
+
P(X=1)= p
 +
P(X=0)= q = 1-p
 +
Therefore, P(X=0) + P(X=1) = p + q = 1
  
 
'''Algorithm: '''
 
'''Algorithm: '''
  
1. Generate u~Unif(0,1) <br>
+
1) Generate <math>u\sim~Unif(0,1)</math> <br>
2. If u p, then x = 1 <br>
+
2) If <math>u \leq p</math>, then <math>x = 1 </math><br>
else x = 0 <br>
+
else <math>x = 0</math> <br>
 
The answer is: <br>
 
The answer is: <br>
when U≤p, x=1 <br>
+
when <math> U \leq p, x=1</math> <br>
when U>p, x=0<br>
+
when <math>U \geq p, x=0</math><br>
3.Repeat as necessary
+
3) Repeat as necessary
  
'''Code'''<br>
+
* '''Matlab Code'''<br />
 
<pre style="font-size:16px">
 
<pre style="font-size:16px">
i = 1;
+
>> p = 0.8    % an arbitrary probability for example
 +
>> for i = 1: 100
 +
>>  u = rand;
 +
>>  if u < p
 +
>>      x(ii) = 1;
 +
>>  else
 +
>>      x(ii) = 0;
 +
>>  end
 +
>> end
 +
>> hist(x)
 +
</pre>
 +
 
 +
===The Binomial Distribution===
  
while (i <=1000)
+
In general, if the random variable X follows the binomial distribution with parameters n and p, we write X ~ Bin(n, p).
    u =rand();
+
(Acknowledge: https://en.wikipedia.org/wiki/Binomial_distribution)
    p = 0.1;
+
If X ~ B(n, p), then its pmf is of form:
    if (u <= p)
 
        x(i) = 1;
 
    else
 
        x(i) = 0;
 
    end
 
    i = i + 1;
 
end
 
 
 
hist(x)
 
</pre>
 
 
 
[[File:Bernoulli.jpg|300px]]
 
 
 
===The Binomial Distribution===
 
 
 
If X~Bin(n,p), then its pmf is of form:
 
  
 
f(x)=(nCx) p<sup>x</sup>(1-p)<sup>(n-x)</sup>, x=0,1,...n<br />
 
f(x)=(nCx) p<sup>x</sup>(1-p)<sup>(n-x)</sup>, x=0,1,...n<br />
 
Or f(x) = <math>(n!/x!(n-x)!)</math> p<sup>x</sup>(1-p)<sup>(n-x)</sup>, x=0,1,...n <br />
 
Or f(x) = <math>(n!/x!(n-x)!)</math> p<sup>x</sup>(1-p)<sup>(n-x)</sup>, x=0,1,...n <br />
  
Mean (x) = E(x) = <math> np </math><br/>
+
Mean (x) = E(x) = <math> np </math>
 
Variance = <math> np(1-p) </math><br/>
 
Variance = <math> np(1-p) </math><br/>
  
 
Generate n uniform random number <math>U_1,...,U_n</math> and let X be the number of <math>U_i</math> that are less than or equal to p.
 
Generate n uniform random number <math>U_1,...,U_n</math> and let X be the number of <math>U_i</math> that are less than or equal to p.
The logic behind this algorithm is that the Binomial Distribution is simply a summation of '''n''' Bernoulli trials, each with probability of success '''p'''. Thus, we can say equivalently that sampling from a Bin (n, p) distribution is the same as sampling from '''n''' Bernoulli trials. In the example below, we are sampling 1000 realizations from 20 Bernoulli random variables. By summing up the rows of the 20 by 1000 matrix that is produced, we are summing up the 20 Bernoulli outcomes to produce one binomial sampling. We have 1000 rows, which means we have realizations from 1000 binomial random variables when this sum is done (the output of the sum is a 1 by 1000 sized vector).<br />
+
The logic behind this algorithm is that the Binomial Distribution is simply a Bernoulli Trial, with a probability of success of p, repeated n times. Thus, we can sample from the distribution by sampling from n Bernoulli. The sum of these n bernoulli trials will represent one binomial sampling. Thus, in the below example, we are sampling 1000 realizations from 20 Bernoulli random variables. By summing up the rows of the 20 by 1000 matrix that is produced, we are summing up the 20 bernoulli outcomes to produce one binomial sampling. We have 1000 rows, which means we have realizations from 1000 binomial random variables when this sum is done (the output of the sum is a 1 by 1000 sized vector).<br />
 
 
 
To continue with the previous example, let X be the number of heads in a series of ''n'' independent coin tosses - where for each toss, the probability of coming up with a head is ''p'' - then ''X~Bin(n, p)''. <br />
 
To continue with the previous example, let X be the number of heads in a series of ''n'' independent coin tosses - where for each toss, the probability of coming up with a head is ''p'' - then ''X~Bin(n, p)''. <br />
 
+
MATLAB tips: to get a pdf f(x), we can use code binornd(N,P). N means number of trials and p is the probability of success. a=[2 3 4],if set a<3, will produce a=[1 0 0]. If you set "a == 3", it will produce [0 1 0]. If a=[2 6 9 10], if set a<4, will produce a=[1 0 0 0], because only the first element (2) is less than 4, meanwhile the rest are greater. So we can use this to get the number which is less than p.<br />
MATLAB tips: to get the pdf f(x), we can use code binornd(N,P). N represents the number of trials and P is the probability of success. a=[2 3 4],if set a<3, will produce a=[1 0 0]. If you set "a == 3", it will produce [0 1 0]. If a=[2 6 9 10], if set a<4, will produce a=[1 0 0 0], because only the first element (2) is less than 4, meanwhile the rest are greater. So we can use this to get the number which is less than p.<br />
 
  
 
Algorithm for Bernoulli is given as above
 
Algorithm for Bernoulli is given as above
Line 2,848: Line 2,842:
 
ans= 1 0 0
 
ans= 1 0 0
  
>>rand(20,1000)   # if we want to generate 20 times; retry the trail 20 times.
+
>>rand(20,1000)
 
>>rand(20,1000)<0.4
 
>>rand(20,1000)<0.4
 
>>A = sum(rand(20,1000)<0.4)  #sum of raws ~ Bin(20 , 0.3)
 
>>A = sum(rand(20,1000)<0.4)  #sum of raws ~ Bin(20 , 0.3)
Line 2,863: Line 2,857:
  
 
remark: a=[2 3 4],if set a<3, will produce a=[1 0 0]. If you set "a == 3", it will produce [0 1 0].
 
remark: a=[2 3 4],if set a<3, will produce a=[1 0 0]. If you set "a == 3", it will produce [0 1 0].
 +
using code to find some value what i want to get from the matrix. It`s useful to define some matrixs.
  
 
Relation between Bernoulli Distribution and Binomial Distribution:
 
Relation between Bernoulli Distribution and Binomial Distribution:
Line 2,868: Line 2,863:
  
 
===The Geometric Distribution===
 
===The Geometric Distribution===
Geometric distribution is a discrete distribution. There are two types of geometric distributions.
+
Geometric distribution is a discrete distribution. There are two types geometric distributions, the first one is the probability distribution of the number of X Bernoulli fail trials, with probability 1-p, needed until the first success situation happened, X come from the set { 1, 2, 3, ...}; the other one is the probability distribution of the number Y = X 1 of failures, with probability 1-p, before the first success, Y comes from the set { 0, 1, 2, 3, ... }.
1. Generate the probability distribution of X Bernoulli fail trials (probability '''1-p'''), until the first success (probability '''p'''). X belongs in the set {1, 2, 3, ...}.
 
2. Generate probability distribution of Y = X - 1 failures (probability '''1-p''') before the first success (probability '''p'''). Y belongs in the set {0, 1, 2, 3, ...}.
 
  
 
For example,<br />
 
For example,<br />
 
If the success event showed at the first time, which x=1, then f(x)=p.<br />
 
If the success event showed at the first time, which x=1, then f(x)=p.<br />
If the success event showed at the second time and failed at the first time, which x=2, then f(x)=p(1-p).<br />
+
If the success event showed at the second time and failed at the first time, which x = 2, then f(x)= p(1-p).<br />
If the success event showed at the third time and failed at the first and second time, which x=3, then f(x)= p(1-p)<sup>(k-1)</sup>. etc.<br />
+
If the success event showed at the third time and failed at the first and second time, which x = 3, then f(x)= p(1-p)<sup>2 </sup>. etc.<br />
If the success event showed at the x time and all failed before time x, which x=x, then f(x)= p(1-p)<sup>(k-1)</sup><br />
+
If the success event showed at the k time and all failed before time k, which implies x = k, then f(k)= p(1-p)<sup>(k-1)</sup><br />
 
which is,<br />
 
which is,<br />
 
x    Pr<br />
 
x    Pr<br />
Line 2,885: Line 2,878:
 
.    .<br />
 
.    .<br />
 
.    .<br />
 
.    .<br />
n    P(1-P)<sup>(n-1)</sup>(number of x-1 failures)<br />
+
n    P(1-P)<sup>(n-1)</sup><br />
 +
Also,  the sequence of the outputs of the probability is a geometric sequence.
 +
 
 
For example, suppose a die is thrown repeatedly until the first time a "6" appears. This is a question of geometric distribution of the number of times on the set { 1, 2, 3, ... } with p = 1/6.
 
For example, suppose a die is thrown repeatedly until the first time a "6" appears. This is a question of geometric distribution of the number of times on the set { 1, 2, 3, ... } with p = 1/6.
  
Line 2,901: Line 2,896:
  
 
The CDF : P(X<n) = 1 - <math>(1-p)^n</math>
 
The CDF : P(X<n) = 1 - <math>(1-p)^n</math>
 
Memorylessness properties : P(X>m+n|X>=m)=P(X>n)
 
  
  
Line 2,927: Line 2,920:
  
  
If Y~Exp(<math>\lambda</math>) then X=floor(Y)+1 is geometric.<br />
+
If Y~Exp(<math>\lambda</math>) then <math>X=\left \lfloor Y \right \rfloor+1</math> is geometric.<br />
 
Choose e^(-<math>\lambda</math>)=1-p. Then X ~ geo (p) <br />
 
Choose e^(-<math>\lambda</math>)=1-p. Then X ~ geo (p) <br />
  
 
P (X > x) = (1-p)<sup>x</sup>(because first x trials are not successful) <br/>
 
P (X > x) = (1-p)<sup>x</sup>(because first x trials are not successful) <br/>
 +
 +
NB: An advantage of using this method is that nothing is rejected. We accept all the points, and the method is more efficient. Also, this method is closer to the inverse transform method as nothing is being rejected. <br />
  
 
'''Proof''' <br/>
 
'''Proof''' <br/>
  
P(X>x) = P( floor(Y) + 1 > X) = P(floor (Y) > x- 1) = P(Y>= x) = e<sup>(-<math>\lambda</math> * x)</sup> <br>
+
<math>P(X>x) = P( \left \lfloor Y \right \rfloor + 1 > X) = P(\left \lfloor Y \right \rfloor > x- 1) = P(Y>= x) = e^{-\lambda × x} </math> <br>
  
 
SInce p = 1- e<sup>-<math>\lambda</math></sup> or <math>\lambda</math>= <math>-log(1-p)</math>(compare the pdf of exponential distribution and Geometric distribution,we can look at e<sup>-<math>\lambda</math></sup> the probability of the fail trial), then <br>
 
SInce p = 1- e<sup>-<math>\lambda</math></sup> or <math>\lambda</math>= <math>-log(1-p)</math>(compare the pdf of exponential distribution and Geometric distribution,we can look at e<sup>-<math>\lambda</math></sup> the probability of the fail trial), then <br>
Line 3,026: Line 3,021:
 
We have X ~Geo(1/6), <math>f(x)=(1/6)*(5/6)^{x-1}</math>, x=1,2,3....  
 
We have X ~Geo(1/6), <math>f(x)=(1/6)*(5/6)^{x-1}</math>, x=1,2,3....  
  
Now, let <math>Y=e^{\lambda}</math> => x=floor(Y) +1  
+
Now, let <math>\left \lfloor Y \right \rfloor=e^{\lambda}</math> => x=floor(Y) +1  
  
 
Let <math>e^{-\lambda}=5/6</math>  
 
Let <math>e^{-\lambda}=5/6</math>  
Line 3,038: Line 3,033:
 
1) Let Y be <math>e^{\lambda}</math>, exponentially distributed  
 
1) Let Y be <math>e^{\lambda}</math>, exponentially distributed  
  
2) Set X= floor(Y)+1, to generate X  
+
2) Set <math>X= \left \lfloor Y \right \rfloor +1 </math>, to generate X  
  
 
<math> E[x]=6, Var[X]=5/6 /(1/6^2) = 30 </math>
 
<math> E[x]=6, Var[X]=5/6 /(1/6^2) = 30 </math>
Line 3,059: Line 3,054:
 
If <math>\displaystyle X \sim \text{Poi}(\lambda)</math>, its pdf is of the form <math>\displaystyle \, f(x) = \frac{e^{-\lambda}\lambda^x}{x!}</math> , where <math>\displaystyle \lambda </math> is the rate parameter.<br />
 
If <math>\displaystyle X \sim \text{Poi}(\lambda)</math>, its pdf is of the form <math>\displaystyle \, f(x) = \frac{e^{-\lambda}\lambda^x}{x!}</math> , where <math>\displaystyle \lambda </math> is the rate parameter.<br />
  
Definition: In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event. The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume. (from Wikipedia) In short, the Poisson distribution measures the number of occurrences at a particular time interval, given the rate of occurrences per unit time.
+
definition:In probability theory and statistics, the Poisson distribution (pronounced [pwasɔ̃]) is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event. The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.
 +
For instance, suppose someone typically gets 4 pieces of mail per day on average. There will be, however, a certain spread: sometimes a little more, sometimes a little less, once in a while nothing at all.[2] Given only the average rate, for a certain period of observation (pieces of mail per day, phonecalls per hour, etc.), and assuming that the process, or mix of processes, that produces the event flow is essentially random, the Poisson distribution specifies how likely it is that the count will be 3, or 5, or 10, or any other number, during one period of observation. That is, it predicts the degree of spread around a known average rate of occurrence.
 +
The Derivation of the Poisson distribution section shows the relation with a formal definition.(from Wikipedia)
  
 
Understanding of Poisson distribution:
 
Understanding of Poisson distribution:
  
If customers '''independently''' come to bank over time, all of the following exponential distributions with rate <math>\lambda</math> per unit of time, then  
+
If customers '''independently''' come to bank over time, all following exponential distributions with rate <math>\lambda</math> per unit of time, then  
 
X(t) = # of customer in [0,t] ~ Poi<math>(\lambda t)</math>
 
X(t) = # of customer in [0,t] ~ Poi<math>(\lambda t)</math>
  
Line 3,142: Line 3,139:
  
 
=== Beta Distribution ===
 
=== Beta Distribution ===
The beta distribution is a continuous probability distribution. There are two positive shape parameters (i.e. greater than zero) in this distribution, defined as '''α''' and '''β'''. X falls within the interval [0,1]. The parameter '''α''' is used as exponents of the random variable. The parameter '''β''' is used to control the shape of the this distribution. We use the beta distribution to build the model of the behavior of random variables, which are limited to intervals of finite length. For example, we can use the beta distribution to analyze the time allocation of sunshine data and variability of soil properties.
+
The beta distribution is a continuous probability distribution. <br>
 +
PDF:<math>\displaystyle \text{ } f(x) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1} </math><br>  where <math>0 \leq x \leq 1</math> and <math>\alpha</math>>0, <math>\beta</math>>0<br/>
 +
<div style = "align:left; background:#F5F5DC; font-size: 120%">
 +
Definition:
 +
In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by α and β, that appear as exponents of the random variable and control the shape of the distribution.<br/.>
 +
More can be find in the link: <ref>http://en.wikipedia.org/wiki/Beta_distribution</ref>
 +
</div>
 +
 
 +
There are two positive shape parameters in this distribution defined as alpha and beta: <br>
 +
-Both parameters are greater than 0, and X is within the interval [0,1]. <br>
 +
-Alpha is used as exponents of the random variable. <br>
 +
-Beta is used to control the shape of the this distribution. We use the beta distribution to build the model of the behavior of random variables, which are limited to intervals of finite length. <br>
 +
-For example, we can use the beta distribution to analyze the time allocation of sunshine data and variability of soil properties. <br>
  
 
If X~Beta(<math>\alpha, \beta</math>) then its p.d.f. is of the form
 
If X~Beta(<math>\alpha, \beta</math>) then its p.d.f. is of the form
Line 3,150: Line 3,159:
 
<math>f(x;\alpha,\beta)= 0 </math> otherwise
 
<math>f(x;\alpha,\beta)= 0 </math> otherwise
 
Note: <math>\Gamma(\alpha)=(\alpha-1)! </math> if <math>\alpha</math> is a positive integer.
 
Note: <math>\Gamma(\alpha)=(\alpha-1)! </math> if <math>\alpha</math> is a positive integer.
However, several other authors, including W. Feller choose to exclude the ends x = 0 and x = 1, (such that the two ends are not actually part of the density function) and consider instead 0 < x < 1.
 
Another notation for beta-distributed random variables is X~Be(<math>\alpha, \beta</math>).
 
  
 +
Note: Gamma Function Properties
 +
 +
If <math>\alpha=\frac{1}{2} ,
 +
 +
\Gamma(\frac {1}{2})=\sqrt\pi </math>
  
 
The mean of the beta distribution is <math>\frac{\alpha}{\alpha + \beta}</math>. The variance is <math>\frac{\alpha\beta}{(\alpha+\beta)^2 (\alpha + \beta + 1)}</math>
 
The mean of the beta distribution is <math>\frac{\alpha}{\alpha + \beta}</math>. The variance is <math>\frac{\alpha\beta}{(\alpha+\beta)^2 (\alpha + \beta + 1)}</math>
 
The variance of the beta distribution decreases monotonically if <math> \alpha = \beta </math> and as <math> \alpha = \beta </math> increases, the variance decreases.  
 
The variance of the beta distribution decreases monotonically if <math> \alpha = \beta </math> and as <math> \alpha = \beta </math> increases, the variance decreases.  
The mode of a Beta distributed random variable X with α, β > 1 is <math>\frac{\alpha-1}{\alpha + \beta-2}</math>.
 
  
 
The formula for the cumulative distribution function of the beta distribution is also called the incomplete beta function ratio (commonly denoted by Ix) and is defined as F(x) = I(x)(p,q)  
 
The formula for the cumulative distribution function of the beta distribution is also called the incomplete beta function ratio (commonly denoted by Ix) and is defined as F(x) = I(x)(p,q)  
Line 3,187: Line 3,198:
 
:<math>\displaystyle \text{f}(x) = \frac{\Gamma(\alpha+1)}{\Gamma(\alpha)\Gamma(1)}x^{\alpha-1}(1-x)^{1-1}=\alpha x^{\alpha-1}</math><br>
 
:<math>\displaystyle \text{f}(x) = \frac{\Gamma(\alpha+1)}{\Gamma(\alpha)\Gamma(1)}x^{\alpha-1}(1-x)^{1-1}=\alpha x^{\alpha-1}</math><br>
  
The CDF is <math>F(x) = x^{\alpha}</math> (using integration of <math>f(x)</math>)
+
By integrating <math>f(x)</math>, we find the CDF of X is <math>F(x) = x^{\alpha}</math>.
WIth CDF F(x) = x^α, if U have CDF, it is very easy to sample:
+
As <math>F(x)^{-1} = x^\frac {1}{\alpha}</math>, using the inverse transform method, <math> X = U^\frac {1}{\alpha} </math> with U ~ U[0,1].
y=x^α --> x=y^α --> inverseF(x)= x^(1/α)
 
U~U(0,1) --> x=u^(1/α)
 
Applying the inverse transform method with <math>y = x^\alpha \Rightarrow x = y^\frac {1}{\alpha}</math>
 
 
 
<math>F(x)^{-1} = y^\frac {1}{\alpha}</math>
 
 
 
between case 1 and case 2, when alpha and beta be different value, the beta distribution can simplify to other distribution.
 
  
 
'''Algorithm'''
 
'''Algorithm'''
Line 3,210: Line 3,214:
 
</pre>
 
</pre>
  
'''Case 3:'''<br\> To sample from beta in general. we use the property that <br\>
+
'''Case 3:'''<br\> To sample from beta in general, we use the property that <br\>
  
 
:if <math>Y_1</math> follows gamma <math>(\alpha,1)</math><br\>
 
:if <math>Y_1</math> follows gamma <math>(\alpha,1)</math><br\>
Line 3,222: Line 3,226:
  
 
'''Algorithm'''<br\>
 
'''Algorithm'''<br\>
*1. Sample from Y1 ~ Gamma (<math>\alpha</math>,1) #<math>\alpha</math> is the shape, and 1 is the scale. <br\>
+
*1. Sample from Y1 ~ Gamma (<math>\alpha</math>,1) <math>\alpha</math> is the shape, and 1 is the scale. <br\>
 
*2. Sample from Y2 ~ Gamma (<math>\beta</math>,1)  <br\>
 
*2. Sample from Y2 ~ Gamma (<math>\beta</math>,1)  <br\>
 
*3. Set  
 
*3. Set  
Line 3,242: Line 3,246:
 
'''MATLAB Code for generating Beta Distribution'''
 
'''MATLAB Code for generating Beta Distribution'''
 
<pre style='font-size:16px'>
 
<pre style='font-size:16px'>
>>Y1 = sum(-log(rand(10,1000)))            #Gamma(10,1), sum 10 exponential for each of the 1000 samples
+
>>Y1 = sum(-log(rand(10,1000)))            #Gamma(10,1), sum 10 exponentials for each of the 1000 samples
  
>>Y2 = sum(-log(rand(5,1000)))              #Gamma(5,1), sum 5 exponential for each of the 1000 samples
+
>>Y2 = sum(-log(rand(5,1000)))              #Gamma(5,1), sum 5 exponentials for each of the 1000 samples
  
 
%NOTE: here, lamda is 1, since the scale parameter for Y1 & Y2 are both 1
 
%NOTE: here, lamda is 1, since the scale parameter for Y1 & Y2 are both 1
Line 3,262: Line 3,266:
 
>>hist(Y)                                    #Do this to check that the shape fits beta. ~Beta(10,5).
 
>>hist(Y)                                    #Do this to check that the shape fits beta. ~Beta(10,5).
  
>>disttool                                  #Check the beta plot. We can change beta here.
+
>>disttool                                  #Check the beta plot.
  
 
</pre>
 
</pre>
Line 3,274: Line 3,278:
 
[[File:325px-Beta_distribution_pdf.png|300px]]
 
[[File:325px-Beta_distribution_pdf.png|300px]]
  
 +
[[File:untitled.jpg|300px]]<br />
 
MATLAB tips: rand(10,1000) produces one 10*1000 matrix and sum(rand(10,1000)) produces a 10*1000 matrix
 
MATLAB tips: rand(10,1000) produces one 10*1000 matrix and sum(rand(10,1000)) produces a 10*1000 matrix
 
and each element in the matrix follows CDF of uniform distribution.
 
and each element in the matrix follows CDF of uniform distribution.
Line 3,302: Line 3,307:
 
=== Random Vector Generation ===
 
=== Random Vector Generation ===
 
We want to sample from <math>X = (X_1, X_2, </math>…,<math> X_d)</math>, a d-dimensional vector from a known pdf <math>f(x)</math> and cdf <math>F(x)</math>.
 
We want to sample from <math>X = (X_1, X_2, </math>…,<math> X_d)</math>, a d-dimensional vector from a known pdf <math>f(x)</math> and cdf <math>F(x)</math>.
We need to take into account the following two cases:
+
We need to take into account the following two cases:  
  
 
====Case 1====
 
====Case 1====
If <math>x_1, x_2 \cdots, x_d</math>'s are independent, then<br/>
+
if the <math>x_1, x_2 \cdots, x_d</math>'s are independent, then<br/>
 
<math>f(x) = f(x_1,\cdots, x_d) = f(x_1)\cdots f(x_d)</math><br/>
 
<math>f(x) = f(x_1,\cdots, x_d) = f(x_1)\cdots f(x_d)</math><br/>
We can sample from each component <math>x_1, x_2,\cdots, x_d</math> individually, and then form a vector.<br/>
+
we can sample from each component <math>x_1, x_2,\cdots, x_d</math> individually, and then form a vector.<br/>
  
Based on the property of independence, we can derive the pdf or pmf of <math>x=x_1,x_2,x_3,x_4,x_5,\cdots</math>
+
based on the property of independence, we can derive the pdf or pmf of <math>x=x_1,x_2,x_3,x_4,x_5,\cdots</math>
  
 
====Case 2====
 
====Case 2====
Line 3,338: Line 3,343:
 
 
 
Algorithm: <br/>
 
Algorithm: <br/>
1)  for i = 1 to d <br/>
+
1)  For i = 1 to d <br/>
2)    U<sub>i</sub> ~ U(0,1)  (we want to translate this to U(a,b)) <br/>
+
2)    U<sub>i</sub> ~ U(0,1) <br/>
 
3)    x<sub>i</sub> = a<sub>i</sub> + U(b<sub>i</sub>-a<sub>i</sub>) <br/>
 
3)    x<sub>i</sub> = a<sub>i</sub> + U(b<sub>i</sub>-a<sub>i</sub>) <br/>
4)  end <br/>
+
4)  End <br/>
  
 
*Note: x<sub>i</sub> = a<sub>i</sub> + U(b<sub>i</sub>-a<sub>i</sub>) denotes X<sub>i</sub> ~U(a<sub>i</sub>,b<sub>i</sub>) <br/>
 
*Note: x<sub>i</sub> = a<sub>i</sub> + U(b<sub>i</sub>-a<sub>i</sub>) denotes X<sub>i</sub> ~U(a<sub>i</sub>,b<sub>i</sub>) <br/>
Line 3,347: Line 3,352:
 
An example of the 2-D case is given below:
 
An example of the 2-D case is given below:
  
<pre style='font-size:16px'>
+
<pre style='font-size:14px'>
 
 
 
>>a=[1 2];  
 
>>a=[1 2];  
 
>>b=[4 6];  
 
>>b=[4 6];  
Line 3,364: Line 3,368:
 
[[File:2d_ex.jpg|300px]]
 
[[File:2d_ex.jpg|300px]]
  
==== Code: ====
+
==== Matlab Code: ====
  
<pre style='font-size:16px'>
+
<pre style='font-size:14px'>
 
function x = urectangle (d,n,a,b)
 
function x = urectangle (d,n,a,b)
 
for ii = 1:d;
 
for ii = 1:d;
Line 3,373: Line 3,377:
 
     %keyboard                      #makes the function stop at this step so you can evaluate the variables
 
     %keyboard                      #makes the function stop at this step so you can evaluate the variables
 
end
 
end
 
  
 
>>x=urectangle(2, 100, 2, 5);
 
>>x=urectangle(2, 100, 2, 5);
Line 3,412: Line 3,415:
 
Suppose we sampled from the target area W uniformly, let Aw, Ag indicate the area of W and G, g(x)=1/Aw and f(x)=1/Ag
 
Suppose we sampled from the target area W uniformly, let Aw, Ag indicate the area of W and G, g(x)=1/Aw and f(x)=1/Ag
  
This is the picture of the example
 
[[File:AA.jpg]]
 
  
matlab code:
+
The following is a picture relating to the example
 +
 
 +
[[File:Untitled.jpg]]
 +
 
 +
Matlab code:
 
<pre style='font-size:16px'>
 
<pre style='font-size:16px'>
 
u = rand(d,n);
 
u = rand(d,n);
Line 3,439: Line 3,444:
  
 
==Class 10 - Thursday June 6th 2013 ==  
 
==Class 10 - Thursday June 6th 2013 ==  
MATLAB code for using Acceptance-Rejection Method to sample from a d-dimensional unit ball.
+
MATLAB code for using Acceptance/Rejection Method to sample from a d-dimensional unit ball.
 +
G: d-dimensional unit ball G
 +
W: d-dimensional Hypercube
  
 
<pre style='font-size:16px'>
 
<pre style='font-size:16px'>
1. U1~UNIF(0, 1)
+
1) U1~UNIF(0,1)
     U2~UNIF(0, 1)
+
     U2~UNIF(0,1)
 +
    ...
 +
    Ud~UNIF(0,1)
 +
2)  X1 = 1-2U1
 +
    X2 = 1-2U2
 
     ...
 
     ...
     Ud~UNIF(0, 1)
+
     Xd = 1-2Ud
 +
    R = sum(Xi^2)
 +
3)  If R<=1
 +
    X = (X1,X2,...,Xd),
 +
    else go to step 1
 
</pre>
 
</pre>
  
Line 3,454: Line 3,469:
  
 
u = rand(d,n);
 
u = rand(d,n);
z = 1 - 2*u;
+
z = 1- 2 *u;
 
R = sum(z.^2);
 
R = sum(z.^2);
 
jj=1;
 
jj=1;
Line 3,476: Line 3,491:
 
>> scatter(data(1,:), data(2,:))    %plot 2d graph
 
>> scatter(data(1,:), data(2,:))    %plot 2d graph
  
R(ii) determines whether the generated random coordinates fall within the unit ball. In 2-D we have a random x and y, thus if x^2+y^2 <=1 then it falls within the unit ball and we increase our count by 1.
+
R(ii) computes the sum of the square of each element of a vector, so if it is less than 1,
 +
then the vector is in the unit ball.
  
 
x(:,jj) means all the numbers in the jj column.
 
x(:,jj) means all the numbers in the jj column.
Line 3,482: Line 3,498:
 
z(:,ii) means all the numbers in the ii column starting from 1st column until the nth
 
z(:,ii) means all the numbers in the ii column starting from 1st column until the nth
 
column, which is the last one.
 
column, which is the last one.
 +
 +
higher dimension, less efficient and we need more data points
  
 
Save it with the name of the pattern.
 
Save it with the name of the pattern.
Line 3,554: Line 3,572:
 
[[File:3-dimensional unitball.jpg|400px]]
 
[[File:3-dimensional unitball.jpg|400px]]
  
Note that c increases exponentially as d increases, which will result in rejections of more points. So this method is not efficient for large values of d.
+
Note that c increases exponentially as d increases, which will result in a lower acceptance rate and more points being rejected. So this method is not efficient for large values of d.
  
 
In practice, when we need to vectorlize a high quality image or genes then d would have to be very large.  So AR method is not an efficient way to solve the problem.
 
In practice, when we need to vectorlize a high quality image or genes then d would have to be very large.  So AR method is not an efficient way to solve the problem.
Line 3,570: Line 3,588:
 
For example, for approximating value of <math>\pi</math>, when <math>d \text{(dimension)} =2</math>, the efficiency is around 0.7869; when <math>d=3</math>, the efficiency is around 0.5244; when <math>d=10</math>, the efficiency is around 0.0026: it is getting close to 0.
 
For example, for approximating value of <math>\pi</math>, when <math>d \text{(dimension)} =2</math>, the efficiency is around 0.7869; when <math>d=3</math>, the efficiency is around 0.5244; when <math>d=10</math>, the efficiency is around 0.0026: it is getting close to 0.
  
Thus, when we want to generate high dimension vectors, Acceptance-Rejection Method is not efficient to be used.
+
A 'C' value of 1 implies an acceptance rate of 100% (most efficient scenario) but as we sample from higher dimensions, 'C' usually gets larger. Thus, when we want to generate high dimension vectors, Acceptance-Rejection Method is not efficient to be used.
  
 
<span style="color:red;padding:0 auto;"><br>The end of midterm coverage</span>
 
<span style="color:red;padding:0 auto;"><br>The end of midterm coverage</span>
<div style="background-color:#CCCCFF;width:100%;height:200px;">
+
<div style="border:1px solid #cccccc;border-radius:10px;box-shadow: 0 5px 15px 1px rgba(0, 0, 0, 0.6), 0 0 200px 1px rgba(255, 255, 255, 0.5);padding:20px;margin:20px;background:#FFFFAD;">
<div style="float:left;margin:5px 5px 5px 5px;width:100%;cursor:wait;position:absolute;left:350px">
+
<h2 style="text-align:center;">Summary of vector acceptance-rejection sampling</h2>
<span style="font-family:cursive, sans-serif;
+
<p><b>Problem:</b> <math> f(x_1, x_2, ...x_n)</math> is difficult to sample from</p>
text-shadow:3px 3px 3px #330000;font-size:150%;font-variant:small-caps;font-size-adjust:0.49;font-stretch: expanded;float:left">Good luck on the midterm
+
<p><b>Plan:</b></p>
</span>
+
Let W represent the sample space covered by <math> f(x_1, x_2, ...x_n)</math>
 +
<ol>
 +
<li>1.Draw <math>\vec{y}=y_1,y_2...y_n\sim~g()</math> where g has sample space G which is greater than W. g is a distribution that is easy to sample from (i.e. uniform)</li>
 +
<li>2.if <math>\vec{y} \subseteq W </math> then <math>\vec{x}=\vec{y} </math><br /> else go 1) </li>
 +
</ol>
 +
<p>x will have the desired distribution.</p>
 +
 
 
</div>
 
</div>
<div style="position:absolute;margin-top:40px;margin-left:370px;height:160px;">
 
  
[[File:15g6454656.gif]]
 
‎</div>
 
</div>
 
<div style="margin-top:2px">
 
 
==== Stochastic Process ====
 
==== Stochastic Process ====
 
The basic idea of Stochastic Process (also called random process) is a collection of some random variables,  
 
The basic idea of Stochastic Process (also called random process) is a collection of some random variables,  
Line 3,590: Line 3,609:
  
 
'''Definition:''' In probability theory, a stochastic process /stoʊˈkæstɪk/, or sometimes random process (widely used) is a collection of random variables; this is often used to represent the evolution of some random value, or system, over time. This is the probabilistic counterpart to a deterministic process (or deterministic system). Instead of describing a process which can only evolve in one way (as in the case, for example, of solutions of an ordinary differential equation), in a stochastic or random process there is some indeterminacy: even if the initial condition (or starting point) is known, there are several (often infinitely many) directions in which the process may evolve. (from Wikipedia)
 
'''Definition:''' In probability theory, a stochastic process /stoʊˈkæstɪk/, or sometimes random process (widely used) is a collection of random variables; this is often used to represent the evolution of some random value, or system, over time. This is the probabilistic counterpart to a deterministic process (or deterministic system). Instead of describing a process which can only evolve in one way (as in the case, for example, of solutions of an ordinary differential equation), in a stochastic or random process there is some indeterminacy: even if the initial condition (or starting point) is known, there are several (often infinitely many) directions in which the process may evolve. (from Wikipedia)
</div>
+
 
In other words, stochastic process is non-deterministic. This means that there is some indeterminacy in the final state, even if the initial condition is known.
+
A stochastic process is non-deterministic. This means that even if we know the initial condition(state), and we know some possibilities of the states to follow, the exact value of the final state remains to be uncertain.  
  
 
We can illustrate this with an example of speech: if "I" is the first word in a sentence, the set of words that could follow would be limited (eg. like, want, am), and the same happens for the third word and so on. The words then have some probabilities among them such that each of them is a random variable, and the sentence would be a collection of random variables. <br>
 
We can illustrate this with an example of speech: if "I" is the first word in a sentence, the set of words that could follow would be limited (eg. like, want, am), and the same happens for the third word and so on. The words then have some probabilities among them such that each of them is a random variable, and the sentence would be a collection of random variables. <br>
Also, different stochastic process has different properties.
+
Also, Different Stochastic Process has different properties.
  
In the course, we study two stochastic process models.
+
In the course, we study two Stochastic Process models.
  
The two stochastic process models we will study are:
+
The two stochastic Process models we will study are:
  
1. Poisson Process - This is a continuous time counting process that satisfies a couple of properties that are listed in the next section. The Poisson process is understood to be a good model for independent events such as incoming phone calls, number of traffic accidents, and goals during a game of hockey or soccer. It is also an example of a birth-death process.<br>
+
1. Poisson Process-This is continuous time counting process that satisfies a couple of properties that are listed in the next section. The Poisson process is understood to be a good model for events such as incoming phone calls, number of traffic accidents, and goals during a game of hockey or soccer. It is also an example of a birth-death process.<br>
2. Markov Process - This is a stochastic process that satisfies the Markov property which can be understood as the memory-less property. The property states that the jump to a future state only depends on the current state of the process, and not of the process's history. This model is used to model random walks exhibited by particles, the health state of a life insurance policyholder, decision making by a memory-less mouse in a maze, etc. <br>
+
2. Markov Process- This is a stochastic process that satisfies the Markov property which can be understood as the memory-less property. The property states that the jump to a future state only depends on the current state of the process, and not of the process's history. This model is used to model random walks exhibited by particles, the health state of a life insurance policyholder, decision making by a memory-less mouse in a maze, etc. <br>
 +
 +
 
 +
=====Example=====
 +
The state space is the set of English words, and <math>x_t</math> are words that are said. Another example involves the stock market: the set of all non-negative numbers is the state space, and <math>x_t</math> are stock prices.
 +
 
 +
stochastic process always has state space and the index set to limit the range.
 +
 
 +
The state space is the set of cars, while <math>x_t</math> are sport cars.
  
Stochastic process always has a state space and the index is set to limit the range. For instances, in a stock market, the set of all non-negative numbers is the state space, while <math>x_t</math> are individual stock prices.
+
Births in a hospital occur randomly at an average rate
 +
 
 +
The number of cases of a disease in different towns
  
 
==== Poisson Process ====
 
==== Poisson Process ====
The Poisson process, which is discrete, arises when we count the number of occurrences of events over time.
+
[[File:Possionprocessidiagram.png‎]]
 +
 
 +
The Poisson process is a discrete counting process which counts the number of<br\>
 +
of events and the time that these occur in a given time interval.<br\>
 +
 
 +
e.g traffic accidents , arrival of emails. Emails arrive at random time <math>T_1, T_2 ... T_n</math> for example (2, 7, 3) is the number of emails received on day 1, day 2, day 3. This is a stochastic process and Poisson process with condition.
  
e.g traffic accidents , arrival of emails. Emails arrive at random time <math>T_1, T_2</math> ...
+
The probability of observing x events in a given interval is given by
 +
<math> P(X = x) = e^{-\lambda}* \lambda^x/ x! </math>
 +
where x = 0; 1; 2; 3; 4; ....
  
-Let <math>N_t</math> denote the number of arrivals in the time interval <math>(0,t]</math><br\>
+
-Let <math>N_t</math> denote the number of arrivals within the time interval <math>(0,t]</math><br\>
 
-Let <math>N(a,b]</math> denote the number of arrivals in the time interval (a,b]<br\>
 
-Let <math>N(a,b]</math> denote the number of arrivals in the time interval (a,b]<br\>
 
-By definition, <math>N(a,b]=N_b-N_a</math><br\>
 
-By definition, <math>N(a,b]=N_b-N_a</math><br\>
Line 3,624: Line 3,660:
 
E[N<sub>t</sub>] = <math>\lambda t</math> and Var[N<sub>t</sub>] = <math>\lambda t</math>
 
E[N<sub>t</sub>] = <math>\lambda t</math> and Var[N<sub>t</sub>] = <math>\lambda t</math>
  
==== ====
+
the rate parameter may change over time; such a process is called a non-homogeneous Poisson process
 +
 
 +
==== Examples ====
 
<br />
 
<br />
'''How to generate a multivariate normal with build in function "randn": (example)'''<br />
+
'''How to generate a multivariate normal with the built-in function "randn": (example)'''<br />
 
(please check the general idea at the end of lecture 6 course note.)
 
(please check the general idea at the end of lecture 6 course note.)
  
Line 3,636: Line 3,674:
 
                       %matrix to 1*n matrix;
 
                       %matrix to 1*n matrix;
 
</pre>
 
</pre>
 +
For example, if we use mu = [2 5], we would get <br/>
 +
<math> = \left[ \begin{array}{ccc}
 +
3.8214 & 0.3447 \\
 +
6.3097 & 5.6157 \end{array} \right]</math>
 +
  
and if we want to use box-muller to generate a multivariate normal, we could use the code in lecture 6:
+
If we want to use box-muller to generate a multivariate normal, we could use the code in lecture 6:
 
<pre style='font-size:16px'>
 
<pre style='font-size:16px'>
 
d = length(mu);
 
d = length(mu);
Line 3,651: Line 3,694:
 
X = Z*R + ones(n,1)*mu';
 
X = Z*R + ones(n,1)*mu';
 
</pre>
 
</pre>
 
  
 
==== '''Central Limit Theorem''' ====
 
==== '''Central Limit Theorem''' ====
  
We can use simulation to test results in probability and statistics. For example, with the central limit theorem, if we sample from a sufficiently large number of independently distributed random variables, the mean will be approximately normally distributed. We illustrate with an example using 1000 observations each of 20 independent exponential random variables.
+
Theorem: "Given a distribution with mean μ and variance σ², the sampling distribution of the mean approaches a  normal distribution with a mean (μ) and a variance σ²/N as N, the sample size, increases". Furthermore, the original distribution can be of any arbitrary shape, the sampling distribution of the mean will approach a normal distribution with a large enough N.<ref>
 +
http://davidmlane.com/hyperstat/A14043.html
 +
</ref>
  
'''Definition:'''
+
Applying the central limit theorem to simulations, we may revise the definition to be the following: Given certain conditions, the mean of a sufficiently large number of independent random variables, each with a well defined mean and variance, will be approximately normal distributed.(i.e. if we simulate sufficiently many independent observations based on well defined mean and variance, the mean of these observations will follow an approximately normal distribution.)
  
Given certain conditions, the mean of a sufficiently large number of independent random variables, each with a well defined mean and variance, will be approximately normal distributed.
+
We illustrate with an example using 1000 observations each of 20 independent exponential random variables.
  
 
<pre style='font-size:16px'>
 
<pre style='font-size:16px'>
 
>> X = exprnd (20,20,1000); % 1000 instances of 20 exponential random numbers with mean 20
 
>> X = exprnd (20,20,1000); % 1000 instances of 20 exponential random numbers with mean 20
 
>> hist(X(1,:))
 
>> hist(X(1,:))
>> hist(sum(X(1:2,:)))
+
>> hist(X(1:2,:))
 
...
 
...
>>hist(sum(X(1:20,:))) -> approaches to normal
+
>>hist(X(1:20,:)) -> approaches to normal
 
</pre>
 
</pre>
  
'''Theorem: Central Limit Theorem'''
+
(The definition of CLT is from http://en.wikipedia.org/wiki/Central_limit_theorem)
Let <math>X_1, ..., X_n</math> be iid random variables such that <math>E(X_i)=\mu</math> and <math> Var(X_i)=\sigma^2</math>, and <math> \bar{X} = n^{-1} \left ( \sum_{i=1}^n x_i \right ) </math>. <br> Then <math> \ \frac{\sqrt{n}(\bar{X} - \mu)}{\sigma} \xrightarrow{d}\ N(0,1)</math>
+
 
 +
<math> \lim_{n \to \infty} P*[{\frac{X_1 + ... + X_n -n*\mu}{\sigma*\surd n}} < x] = \Phi (x)</math>
  
 
==Class 11 - Tuesday,June 11, 2013==
 
==Class 11 - Tuesday,June 11, 2013==
 +
=== Announcement ===
 +
Midterm covers up to the middle of last lecture, which means stochastic process will not be on midterm. There won't be any Matlab syntax questions. And Students can contribute to any previous classes. We might however be asked to write down algorithms.
 +
 
===Poisson Process===
 
===Poisson Process===
 +
A Poisson Process is a stochastic approach to count number of events in a certain time period. <s>Strike-through text</s>
 
A discrete stochastic variable ''X'' is said to have a Poisson distribution with parameter ''λ'' > 0 if
 
A discrete stochastic variable ''X'' is said to have a Poisson distribution with parameter ''λ'' > 0 if
:<math>\!f(n)= \frac{\lambda^n e^{-\lambda}}{n!}  \qquad n= 0,1,\ldots,</math>.
+
:<math>\!f(n)= \frac{\lambda^n e^{-\lambda}}{n!}  \qquad n= 0,1,2,3,4,5,\ldots,</math>.
  
'''Definition'''<br>
+
<math>\{X_t:t\in T\}</math> where <math>\ X_t </math> is state space and T is index set.
The number of arrivals N(t) in a time interval of length t follows Poisson distribution with mean <math>lambda*t</math>,i.e.<br>
 
<math>P(N(t)=n) = \frac{e^{-\lambda t} (\lambda t)^n}{n!}</math>
 
  
In probability theory, a Poisson process is a stochastic process which counts the number of events[note 1] and the time that these events occur in a given time interval. The time between each pair of consecutive events has an exponential distribution with parameter λ and each of these inter-arrival times is assumed to be independent of other inter-arrival times. The process is named after the French mathematician Siméon-Denis Poisson and is a good model of radioactive decay,[1] telephone calls[2] and requests for a particular document on a web server,[3] among many other phenomena.
 
By wikipedia
 
  
 
'''Properties of Homogeneous Poisson Process'''<br>
 
'''Properties of Homogeneous Poisson Process'''<br>
 
(a) '''Independence:''' The numbers of arrivals in non-overlapping intervals are independent  <br>
 
(a) '''Independence:''' The numbers of arrivals in non-overlapping intervals are independent  <br>
(b) '''Homogeneity or Uniformity:''' The number of arrival in each interval I(a,b] is Poisson distribution with rate <math>\lambda (b-a)</math><br/>
+
(b) '''Homogeneity or Uniformity:''' The number of arrivals in each interval I(a,b] is Poisson distribution with rate <math>\lambda (b-a)</math><br/>
(c) '''Individuality:'''  for a sufficiently short time period of length h, the probability of 2 or more events occurring in the interval is close to 0
+
(c) '''Individuality:'''  for a sufficiently short time period of length h, the probability of 2 or more events occurring in the interval is close to 0, or formally <math>\mathcal{O}(h)</math><br>
 
 
 
 
'''Notation'''<br>
 
N<sub>t</sub> denotes the number of arrivals up to t, i.e.(0,t] <br>
 
N(a,b] = N<sub>b</sub> - N<sub>a</sub> denotes the number of arrivals in I(a, b]. <br> where I denotes the an interval.
 
  
 +
NOTE: it is very important to note that the time between the occurrence of consecutive events (in a Poisson Process) is exponentially distributed with the same parameter as that in the Poisson distribution. This characteristic is used when trying to simulate a Poisson Process.
  
 
For a small interval (t,t+h], where h is small<br>
 
For a small interval (t,t+h], where h is small<br>
1. The number of arrivals in this interval is independent of the number of arrivals up to t(N<sub>t</sub>)<br>
+
1. The number of arrivals up to time t(N<sub>t</sub>) is independent of the number of arrival in the interval<br>
 
2. <math> P (N(t,t+h)=1|N_{t} ) = P (N(t,t+h)=1) =\frac{e^{-\lambda h} (\lambda h)^1}{1!} =e^{-\lambda h} {\lambda h} \approx \lambda h </math> since <math>e^{-\lambda h} \approx 1</math> when h is small.<br>
 
2. <math> P (N(t,t+h)=1|N_{t} ) = P (N(t,t+h)=1) =\frac{e^{-\lambda h} (\lambda h)^1}{1!} =e^{-\lambda h} {\lambda h} \approx \lambda h </math> since <math>e^{-\lambda h} \approx 1</math> when h is small.<br>
  
Line 3,703: Line 3,744:
 
Similarly, the probability of not observing an arrival in this interval is 1-<math>\lambda </math> h.<br>
 
Similarly, the probability of not observing an arrival in this interval is 1-<math>\lambda </math> h.<br>
  
*Note : Recall that exponential random variable is the waiting time  until one event of interested occurs.<br>
 
In other words, the inter-arrival times are independent and follows Exponential distribution with mean λ.
 
  
 +
'''Generate a Poisson Process'''<br />
 +
 +
1. set <math>T_{0}=0</math> and n=1<br/>
 +
 +
2. <math>U_{n} \sim~ U(0,1)</math><br />
 +
 +
3. <math>T_{n} = T_{n-1}-\frac {1}{\lambda} log (U_{n})  </math> (declare an arrival)<br />
 +
 +
4. if <math>T_{n} \gneq T</math> stop<br />
 +
&nbsp;&nbsp;&nbsp;&nbsp;else<br />
 +
&nbsp;&nbsp;&nbsp;&nbsp;n=n+1 go to step 2<br />
 +
 +
Since <math>P(N(t,t+h)=1) = e^{-{\lambda} h}\lambda h</math>, we can regard <math>\lambda </math>h as a exponential distribution, and according to what we learnt, <math>T_n-T_{n-1} = h = -\frac {1}{\lambda} log(U_n)</math>.<br>
 +
*Note : Recall that exponential random variable is the waiting time  until one event of interested occurs.
  
 
'''Review of Poisson - Example'''
 
'''Review of Poisson - Example'''
Line 3,718: Line 3,771:
  
 
<span style="background:#F5F5DC">
 
<span style="background:#F5F5DC">
P(N<sub>3</sub>>3|N<sub>2</sub>)=P(N<sub>1</sub>>2)
+
<math>P(N_3> 3 | N_2)=P(N_1 > 2)</math>
 
</span>
 
</span>
  
when we use the inverse-transfer method, we can assume the poisson process to be exp distribution, and get the h function from the  inverse method, and sometimes we assume h is very small.
+
When we use the inverse-transform method, we can assume the poisson process to be an exponential distribution, and get the h function from the  inverse method. Sometimes we assume that h is very small.
 +
 
 +
'''Multi-dimensional Poisson Process'''<br>
 +
 
 +
The poisson distribution arises as the distribution of counts of occurrences of events in (multidimensional) intervals in multidimensional poisson process in a directly equivalent way to the result for unidimensional processes. This,is ''D'' is any region the multidimensional space for which |D|, the area or volume of the region, is finite, and if {{nowrap|''N''(''D'')}} is count of the number of events in ''D'', then
 +
 
 +
<math> P(N(D)=k)=\frac{(\lambda|D|)^k e^{-\lambda|D|}}{k!} .</math>
  
 
=== Generating a Homogeneous Poisson Process ===
 
=== Generating a Homogeneous Poisson Process ===
Line 3,745: Line 3,804:
  
 
<b>Higher Dimensions:</b><br>
 
<b>Higher Dimensions:</b><br>
 
The poisson distribution arises as the distribution of counts of occurrences of events in (multidimensional) intervals in multidimensional poisson process in a directly equivalent way to the result for unidimensional processes. This,is ''D'' is any region the multidimensional space for which |D|, the area or volume of the region, is finite, and if {{nowrap|''N''(''D'')}} is count of the number of events in ''D'', then
 
 
<math> P(N(D)=k)=\frac{(\lambda|D|)^k e^{-\lambda|D|}}{k!} .</math>
 
 
 
To sample from higher dimensional Poisson process:<br>
 
To sample from higher dimensional Poisson process:<br>
 
1. Generate a random number N that is Poisson distributed with parameter <math>{\lambda}</math>*A<sub>d</sub>, where A<sub>d</sub> is the area under the bounded region. (ie A<sub>2</sub> is area of the region, A<sub>3</sub> is the volume of the 3-d space.<br>
 
1. Generate a random number N that is Poisson distributed with parameter <math>{\lambda}</math>*A<sub>d</sub>, where A<sub>d</sub> is the area under the bounded region. (ie A<sub>2</sub> is area of the region, A<sub>3</sub> is the volume of the 3-d space.<br>
Line 3,770: Line 3,824:
 
end
 
end
  
plot(T)
+
plot(T, '.')
  
 
</pre>
 
</pre>
 
+
<br>
  
 
The following plot is using TT = 50.<br>
 
The following plot is using TT = 50.<br>
 
The number of points generated every time on average should be <math>\lambda</math> * TT. <br>
 
The number of points generated every time on average should be <math>\lambda</math> * TT. <br>
 
The maximum value of the points should be TT. <br>
 
The maximum value of the points should be TT. <br>
[[File:Poisson.jpg]]
+
[[File:Poisson.jpg]]<br>
 
when TT be big, the plot of the graph will be linear, when we set the TT be 5 or small number, the plot graph looks like discrete distribution.
 
when TT be big, the plot of the graph will be linear, when we set the TT be 5 or small number, the plot graph looks like discrete distribution.
  
 
===Markov chain===
 
===Markov chain===
A Markov Chain is a stochastic process where: <br/>
+
"A Markov Chain is a stochastic process where: <br/>
  
 
1) Each stage has a fixed number of states, <br/>
 
1) Each stage has a fixed number of states, <br/>
2) the (conditional) probabilities at each stage only depend on the previous state. <br/>
+
2) the (conditional) probabilities at each stage only depend on the previous state." <br/>
  
 
Source: "http://math.scu.edu/~cirving/m6_chapter8_notes.pdf" <br/>
 
Source: "http://math.scu.edu/~cirving/m6_chapter8_notes.pdf" <br/>
  
A Markov Chain is said to be irreducible if for each pair of states i and j there is a positive probability, starting in state i, that the process will ever enter state j.
+
A Markov Chain is said to be irreducible if for each pair of states i and j there is a positive probability, starting in state i, that the process will ever enter state j.(source:"https://en.wikipedia.org/wiki/Markov_chain")
  
 
Markov Chain is the simplification of assumption, for instance, the result of GPA in university depend on the GPA's in high school, middle school, elementary school, etc., but during a job interview after university graduation, the interviewer would most likely to ask about the GPA in university of the interviewee but not the GPA from early years because they assume what happened before are summarized and adequately represented by the information of the GPA earned during university. Therefore, it's not necessary to look back to elementary school. A Markov Chain works the same way, we assume that everything that has occurred earlier in the process is only important for finding out where we are now, the future only depends on the present of where we are now, not the past of how we got here. So the n<sub>th</sub>random variable would only depend on the n-1<sub>th</sub>term but not all the previous ones. A Markov process exhibits the memoryless property.<br/>
 
Markov Chain is the simplification of assumption, for instance, the result of GPA in university depend on the GPA's in high school, middle school, elementary school, etc., but during a job interview after university graduation, the interviewer would most likely to ask about the GPA in university of the interviewee but not the GPA from early years because they assume what happened before are summarized and adequately represented by the information of the GPA earned during university. Therefore, it's not necessary to look back to elementary school. A Markov Chain works the same way, we assume that everything that has occurred earlier in the process is only important for finding out where we are now, the future only depends on the present of where we are now, not the past of how we got here. So the n<sub>th</sub>random variable would only depend on the n-1<sub>th</sub>term but not all the previous ones. A Markov process exhibits the memoryless property.<br/>
Line 3,799: Line 3,853:
 
*Technology: The Google link analysis algorithm "PageRank"<br />
 
*Technology: The Google link analysis algorithm "PageRank"<br />
  
 +
 +
'''Definition''' An irreducible Markov Chain is said to be aperiodic if for some n <math>\ge 0 </math> and some state j.<br />
 +
<math> P*(X_n=j | X_0 =j) > 0 </math>    and    <math>  P*(X_{n+1} | X_0=j) > 0 </math> <br />
 +
 +
It can be shown that if the Markov Chain is irreducible and aperiodic then, <br />
 +
<math> \pi_j = \lim_{n -> \infty} P*(X_n = j) for j=1...N </math> <br />
 +
Source: From Simulation textbook <br />
  
 
Product Rule (Stochastic Process):<br />
 
Product Rule (Stochastic Process):<br />
Line 3,825: Line 3,886:
  
 
====Transition Matrix====
 
====Transition Matrix====
Definition: A Markov transition matrix is a square matrix describing the probabilities of moving from one state to another in a dynamic system. In each row are the probabilities of moving from the state represented by that row, to the other states. Thus the rows of a Markov transition matrix each add to one. <br />
 
A Transition Matrix is used to describe the transitions of a Markov chain. Each of its entries is a non-negative real number representing a probability.
 
  
 
Transition Probability: <math> P_{ij} = P(X_{t+1} =j | X_t =i) </math> is the one-step transition probability from state i to state j.
 
Transition Probability: <math> P_{ij} = P(X_{t+1} =j | X_t =i) </math> is the one-step transition probability from state i to state j.
Line 3,849: Line 3,908:
 
0.2 & 0.8
 
0.2 & 0.8
 
\end{matrix} \right] </math>
 
\end{matrix} \right] </math>
 +
 +
Note: Column 1 corresponds to the state at time t and Column 2 corresponds to the state at time t+1.
  
 
The above matrix can be drawn into a state transition diagram
 
The above matrix can be drawn into a state transition diagram
Line 3,858: Line 3,919:
 
2. <math>\sum_{j}^{}{P_{ij}=1}</math>  which means the rows of P should sum to 1.<br />
 
2. <math>\sum_{j}^{}{P_{ij}=1}</math>  which means the rows of P should sum to 1.<br />
  
 +
Remark: <math>\sum_{i}^{}{P_{ij}\neq1}</math> in general. If equality holds, the matrix is called a doubly stochastic matrix.
  
 
In general, one would consider a (finite) set of elements <math> \Omega </math> such that: <br>
 
In general, one would consider a (finite) set of elements <math> \Omega </math> such that: <br>
Line 3,874: Line 3,936:
  
 
Then one might consider the periodicity of the chain and derive a notion of cyclic behavior. <br>
 
Then one might consider the periodicity of the chain and derive a notion of cyclic behavior. <br>
<br>
 
 
Example of Double-Stochastic Matrix:<br>
 
 
Consider the following probability transition matrix.<br>
 
<math> P= \left [ \begin{matrix}
 
0 & p & q \\
 
q & 0 & p \\
 
p & q & 0
 
\end{matrix} \right] </math>
 
Where q=1-p<br>
 
Each row is sum to 1 and each column is also sum to 1. Thus, this kind of probability transition matrix called Double-Stochastic Matrix.<br>
 
  
 
=== Examples of Transition Matrix ===
 
=== Examples of Transition Matrix ===
  
'''Example 1'''
+
[[File:Mark13.png]]<br>
 
+
The picture is from http://www.google.ca/imgres?imgurl=http://academic.uprm.edu/wrolke/esma6789/graphs/mark13.png&imgrefurl=http://academic.uprm.edu/wrolke/esma6789/mark1.htm&h=274&w=406&sz=5&tbnid=6A8GGaxoPux9kM:&tbnh=83&tbnw=123&prev=/search%3Fq%3Dtransition%2Bmatrix%26tbm%3Disch%26tbo%3Du&zoom=1&q=transition+matrix&usg=__hZR-1Cp6PbZ5PfnSjs2zU6LnCiI=&docid=PaQvi1F97P2urM&sa=X&ei=foTxUY3DB-rMyQGvq4D4Cg&sqi=2&ved=0CDYQ9QEwAQ&dur=5515)
[[File:Mark13.png]]
 
  
 
1.There are four states: 0,1,2, and 3.
 
1.There are four states: 0,1,2, and 3.
Line 3,915: Line 3,964:
 
\end{align}</math><br />
 
\end{align}</math><br />
  
'''Example 2'''
+
== Class 12 - Thursday,June 13, 2013 ==
[[File:Tranmatrix.png]] <br>
+
<b>Time</b>
The transition matrix in this case would be: <br>
+
Jun 17, 2013 2:30 PM - 3:30 PM
<math> P=\left [ \begin{matrix} 0.7 & 0.1 & 0 \\ 0 & 1 & 0 \\ 0.3 & 0.7 & 0 \\ \end{matrix}\right] </math>. Notice the "1" entry at <math> P_{2,2} </math>, even though the image doesn't show that. This is because there is no way of getting out of state 2, hence the probability of staying in state 2 is 1 (cant get out).
 
 
 
 
===Midterm Review===
 
===Midterm Review===
  
 
=== Multiplicative Congruential Algorithm ===
 
=== Multiplicative Congruential Algorithm ===
x<sub>k+1</sub>= (ax<sub>k</sub>+c) mod m
+
<div style="border:1px solid red">
 +
A Linear Congruential Generator (LCG) yields a sequence of randomized numbers calculated with a linear equation. The method represents one of the oldest and best-known pseudorandom number generator algorithms.[1] The theory behind them is easy to understand, and they are easily implemented and fast, especially on computer hardware which can provide modulo arithmetic by storage-bit truncation.<br>
 +
from wikipedia
 +
</div>
  
Where a, c, m and x<sub>1</sub> (the seed) are values we must chose before running the algorithm. While there is no set value for each, it is best for m to be large and prime.
+
<math>\begin{align}x_k+1= (ax_k+c) \mod  m\end{align}</math><br />
  
Examples:
+
Where a, c, m and x<sub>1</sub> (the seed) are values we must chose before running the algorithm. While there is no set value for each, it is best for m to be large and prime. For example, Matlab uses a = 75,b = 0,m = 231 − 1.
      X<sub>0</sub> = 10 ,a = 2 , c = 1 , m = 13
 
     
 
          X<sub>1</sub> = 2 * 10 + 1  mod 13 = 8
 
          X<sub>2</sub> = 2 * 8  + 1 mod 13 = 4
 
          ... and so on
 
  
      X<sub>0</sub> = 44 ,a = 13 , c = 17 , m = 211
+
'''Examples:'''<br>  
        
+
1. <math>\begin{align}X_{0} = 10 ,a = 2 , c = 1 , m = 13 \end{align}</math><br> 
          X<sub>1</sub> = 13 * 44 + 17 mod 211 = 167
+
   
          X<sub>2</sub> = 13 * 167  + 17 mod 211 = 78
+
<math>\begin{align}X_{1} = 2 * 10 + 1\mod 13 = 8\end{align}</math><br>
          X<sub>3</sub> = 13 * 78  + 17 mod 211 = 187
+
 
          ... and so on
+
<math>\begin{align}X_{2} = 2 * 8  + 1\mod 13 = 4\end{align}</math> ... and so on<br>
 +
 
 +
 
 +
2. <math>\begin{align}X_{0} = 44 ,a = 13 , c = 17 , m = 211\end{align}</math><br>
 +
        
 +
<math>\begin{align}X_{1} = 13 * 44 + 17\mod 211 = 167\end{align}</math><br>
 +
 
 +
<math>\begin{align}X_{2} = 13 * 167  + 17\mod 211 = 78\end{align}</math><br>
 +
 
 +
<math>\begin{align}X_{3} = 13 * 78  + 17\mod 211 = 187\end{align}</math> ... and so on<br>
  
 
=== Inverse Transformation Method ===
 
=== Inverse Transformation Method ===
Line 3,956: Line 4,010:
 
===Acceptance-Rejection Method===
 
===Acceptance-Rejection Method===
 
cg(x)>=f(x)
 
cg(x)>=f(x)
<math>c=max\left[\frac{f(x)}{g(x)}\right]</math>
+
<math>c=max\frac{f(x)}{g(x)}</math>
 
<br><math>\frac{1}{c}</math> is the efficiency of the method/probability of acceptance
 
<br><math>\frac{1}{c}</math> is the efficiency of the method/probability of acceptance
  
Line 3,978: Line 4,032:
  
 
- Sample uniformly from a space W that contains the sample space G of interest<br/>
 
- Sample uniformly from a space W that contains the sample space G of interest<br/>
-Accept if the point is inside G<br/>
+
- Accept if the point is inside G <br/>
Sample uniformly from W<br/>
+
 
 +
Steps:
 +
1. Sample uniformly from W <br/>
 
g(x)=<math>\frac{1}{A_W}</math>, where A<sub>W</sub> is the area of W.<br/>
 
g(x)=<math>\frac{1}{A_W}</math>, where A<sub>W</sub> is the area of W.<br/>
 
f(x)=<math>\frac{1}{A_G}</math>, where A<sub>G</sub> is the area of G.<br/>
 
f(x)=<math>\frac{1}{A_G}</math>, where A<sub>G</sub> is the area of G.<br/>
 +
2. If the point is inside G, accept the point. Else, reject and repeat step 1.
  
  
Line 3,992: Line 4,049:
 
===Exponential===
 
===Exponential===
  
Models the waiting until the first success.<br>
+
Models the waiting time until the first success.<br>
 +
<math>X\sim~Exp(\lambda)</math> <br />
  
X~Exp<math>(\lambda) </math><br>
+
<math>f(x) = \lambda e^{-\lambda x} \, , x>0 </math><br/>
<math> f (x) = \lambda e^{-\lambda x}</math> , <math>x>0 </math><br/>
 
  
1. U~Unif(0,1)
+
<math>1.\, U\sim~U(0,1)</math>
 
+
<br />
2. The inverse of exponential function is x = <math>\frac{-1}{\lambda} log(U)</math>
+
<math>2.\, x = \frac{-1}{\lambda} log(U)</math>
  
 
===Normal===
 
===Normal===
Line 4,017: Line 4,074:
 
In the multivariate case,<br/ >
 
In the multivariate case,<br/ >
 
<math>\underline{Z}\sim N(\underline{0},I)\rightarrow  \underline{X} \sim N(\underline{\mu},\Sigma)</math> <br/ >
 
<math>\underline{Z}\sim N(\underline{0},I)\rightarrow  \underline{X} \sim N(\underline{\mu},\Sigma)</math> <br/ >
<math>\underline{X} = \underline{\mu} +\Sigma ^{1/2} \underline{Z}</math>
+
<math>\underline{X} = \underline{\mu} +\Sigma ^{1/2} \underline{Z}</math><br/>
 +
Note: <math>\Sigma^{1/2}</math> can be obtained from Cholesky decomposition (chol(A) in MATLAB), which is guaranteed to exist, as  <math>\Sigma</math> is positive semi-definite.
  
 
=== Gamma ===
 
=== Gamma ===
 
Gamma(t,λ) <br>
 
Gamma(t,λ) <br>
 
t: The number of exponentials and the shape parameter<br>
 
t: The number of exponentials and the shape parameter<br>
1/λ: The mean of the exponentials and the scale parameter<br>  
+
λ: The mean of the exponentials and the scale parameter<br>  
  
 
Also, Gamma(t,λ) can be expressed into a summation of t exp(λ).<br>
 
Also, Gamma(t,λ) can be expressed into a summation of t exp(λ).<br>
Line 4,030: Line 4,088:
  
 
<math>=\frac {-1}{\lambda}\log(\prod_{j=1}^{t} U_j)</math>
 
<math>=\frac {-1}{\lambda}\log(\prod_{j=1}^{t} U_j)</math>
 +
 +
This is a special property of gamma distribution.
  
 
=== Bernoulli ===
 
=== Bernoulli ===
Line 4,035: Line 4,095:
 
A Bernoulli random variable can only take two possible values: 0 and 1. 1 represents "success" and 0 represents "failure." If p is the probability of success, we have pdf
 
A Bernoulli random variable can only take two possible values: 0 and 1. 1 represents "success" and 0 represents "failure." If p is the probability of success, we have pdf
  
<math> f(x)= p^x (1-p)^{1-x}, x=0,1 </math><br>
+
<math> f(x)= p^x (1-p)^{1-x},\,  x=0,1 </math><br>
  
 
To generate a Bernoulli random variable we use the following procedure:
 
To generate a Bernoulli random variable we use the following procedure:
  
sample u~U(0,1)<br>
+
<math> 1. U\sim~U(0,1)</math><br>
if u <= p, then x=1<br>
+
<math> 2. if\, u <= p, then\, x=1\,</math><br />  
else x=0<br>
+
<math> else\, x=0</math><br/>
 
where 1 stands for success and 0 stands for failure.<br>
 
where 1 stands for success and 0 stands for failure.<br>
  
Line 4,048: Line 4,108:
 
The sum of n independent Bernoulli trials
 
The sum of n independent Bernoulli trials
 
<br\>
 
<br\>
X~ Bin(n,p)<br/>
+
<math> X\sim~ Bin(n,p)</math><br/>
1. U1, U2, ... Un ~ U(0,1)<br/>
+
1.<math> U1, U2, ... Un \sim~U(0,1)</math><br/>
 
2. <math> X= \sum^{n}_{1} I(U_i \leq p) </math> ,where <math>I(U_i \leq p)</math> is an indicator for a successful trial.<br/>
 
2. <math> X= \sum^{n}_{1} I(U_i \leq p) </math> ,where <math>I(U_i \leq p)</math> is an indicator for a successful trial.<br/>
 
Return to 1<br/>
 
Return to 1<br/>
  
I is an indicator variable if for U <= P, then I(U<=P)=1; else I(U>P)=0.
+
I is an indicator variable if for <math>U \leq P,\, then\, I(U\leq P)=1;\, else I(U>P)=0.</math>
  
 
Repeat this N times if you need N samples.
 
Repeat this N times if you need N samples.
Line 4,067: Line 4,127:
 
simulate this binomial distribution.
 
simulate this binomial distribution.
  
1) Generate <math> U_1....U_{10} </math> ~ <math> U(0,1) </math>  <br>
+
1) Generate <math>U_1....U_{10} \sim~ U(0,1) </math>  <br>
 
2) <math> X= \sum^{10}_{1} I(U_i \leq \frac{1}{6}) </math> <br>
 
2) <math> X= \sum^{10}_{1} I(U_i \leq \frac{1}{6}) </math> <br>
3)Return to one.
+
3)Return to 1)
  
 
=== Beta Distribution ===
 
=== Beta Distribution ===
Line 4,077: Line 4,137:
  
 
:<math>\displaystyle \text{Beta}(1,1) = U (0, 1) </math><br>
 
:<math>\displaystyle \text{Beta}(1,1) = U (0, 1) </math><br>
generate uniform directly
 
  
  
 
:<math>\displaystyle \text{Beta}(\alpha,1)={f}(x) = \frac{\Gamma(\alpha+1)}{\Gamma(\alpha)\Gamma(1)}x^{\alpha-1}(1-x)^{1-1}=\alpha x^{\alpha-1}</math><br>
 
:<math>\displaystyle \text{Beta}(\alpha,1)={f}(x) = \frac{\Gamma(\alpha+1)}{\Gamma(\alpha)\Gamma(1)}x^{\alpha-1}(1-x)^{1-1}=\alpha x^{\alpha-1}</math><br>
use inverse method to generate
+
 
 +
 
 +
'''Gamma Distribution'''
  
 
'''Algorithm'''<br\>
 
'''Algorithm'''<br\>
Line 4,091: Line 4,152:
  
 
This distribution models the number of failures before the first success.
 
This distribution models the number of failures before the first success.
<span><p style="color:#ECF1EF">this requires attention</p></span>
+
 
 
X~Geo(p)
 
X~Geo(p)
  
Line 4,114: Line 4,175:
 
<math>y-1=-(1-p)^x</math><br>
 
<math>y-1=-(1-p)^x</math><br>
 
<math>1-y=(1-p)^x</math><br>
 
<math>1-y=(1-p)^x</math><br>
<math>1-y=(e^(-\lambda))^x=e^(-\lambda*x)</math> since <math>1-p=e^-\lambda</math><br>
+
<math>1-y=(e^{-\lambda})^x=e^{-\lambda*x}</math> since <math>1-p=e^{-\lambda}</math><br>
 
<math>log(1-y)=-\lambda*x</math><br>
 
<math>log(1-y)=-\lambda*x</math><br>
 
<math>x=-1/(\lambda)*log(1-y)</math><br>
 
<math>x=-1/(\lambda)*log(1-y)</math><br>
<math>F^(-1)(x)=-1/(\lambda)*log(1-x)</math><br>
+
<math>F^-1(x)=-1/(\lambda)*log(1-x)</math><br>
 
<br>
 
<br>
 
'''Algorithm:'''<br />
 
'''Algorithm:'''<br />
Line 4,129: Line 4,190:
 
If X~Unif (0,1), Y= floor(5U)-2 = [5U]-2 -> Y~ DU[-2,2]  
 
If X~Unif (0,1), Y= floor(5U)-2 = [5U]-2 -> Y~ DU[-2,2]  
 
<br>
 
<br>
 +
 +
There is also another intuitive method:<br>
 +
1. Draw U ~ U(0,1)<br>
 +
2. i = 1, Pi = 1 - (1 - P)^i. <br>
 +
3. If u <= Pi = 1 - (1 - P)^i, set X = i.
 +
Else, i = i + 1. <br>
  
 
===Poisson===
 
===Poisson===
  
 +
This distribution models the number of times and event occurs in a given time period
  
In probability theory and statistics, the Poisson distribution (pronounced [pwasɔ̃]), named after French mathematician Siméon Denis Poisson, is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.[1] The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.
 
 
 
This distribution models the number of times and event occurs in a given time period
 
<span style="text-shadow: 0px 2px 3px hsl(310,15%,65%);margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">
 
 
X~Poi<math>(\lambda)</math> <br>
 
X~Poi<math>(\lambda)</math> <br>
 
X is the maximum number of iid Exp(<math>\lambda</math>) whose sum is less than or equal to 1.<br>
 
X is the maximum number of iid Exp(<math>\lambda</math>) whose sum is less than or equal to 1.<br>
Line 4,143: Line 4,206:
 
<math>  = \max\{n: \sum\limits_{i=1}^n \frac{-1}{\lambda} log(U_i)<=1 , U_i \sim U[0,1]\}</math><br>
 
<math>  = \max\{n: \sum\limits_{i=1}^n \frac{-1}{\lambda} log(U_i)<=1 , U_i \sim U[0,1]\}</math><br>
 
<math>  = \max\{n: \prod\limits_{i=1}^n U_i >= e^{-\lambda}, U_i \sim U[0,1]\}</math><br>
 
<math>  = \max\{n: \prod\limits_{i=1}^n U_i >= e^{-\lambda}, U_i \sim U[0,1]\}</math><br>
</span>
+
 
 
'''Algorithm'''<br\>
 
'''Algorithm'''<br\>
 
*1. Set n=1, a=1<br\>
 
*1. Set n=1, a=1<br\>
Line 4,151: Line 4,214:
 
An alternate way to write an algorithm for Poisson is as followings:
 
An alternate way to write an algorithm for Poisson is as followings:
  
1)  x = 0, F = <math>P(X=0) = e^{-\lambda} = p</math>
+
1)  x = 0, F = P(X=0) = e^-λ = p
  
  
Line 4,160: Line 4,223:
  
  
4)  Else <math>p = \frac{p*\lambda}{x+1} </math>
+
4)  Else p = (λ/(x+1)) * p
  
 
     F = F + p
 
     F = F + p
Line 4,171: Line 4,234:
  
 
== Class 13 - Tuesday June 18th 2013 ==
 
== Class 13 - Tuesday June 18th 2013 ==
 +
'''Markov Chain'''
 +
<br>N-Step Transition Matrix: a matrix <math> P_n </math> whose elements are the probability of moving from state i to state j in n steps. <br/>
 +
<math>P_n (i,j)=Pr⁡(X_{m+n}=j|X_m=i)</math> <br/>
  
=== N-step Transition Matrix ===
+
Explanation: (with an example) Suppose there 10 states { 1, 2, ..., 10}, and suppose you are on state 2, then P<sub>8</sub>(2, 5) represent the probability of moving from state 2 to state 5 in 8 steps.
 
 
<b>Definition</b>: The <b><font color='red'>n-step transition matrix</font></b> is the matrix <math>P_n</math> whose elements are the probability of moving from state <math>i</math> to state <math>j</math> in <math>n</math> steps. <br><b> <math> \quad P_n(i,j) = Pr(X_{m+n)}=j | X_m = i)</math><br><math>P_n(i,j)</math></b> is called <b><font color='red'>n-steps transition probability</font></b>
 
 
 
  
 
One-step transition probability:<br/>
 
One-step transition probability:<br/>
The probability of  X<sub>n+1</sub> being in state j given that X<sub>n</sub> is in state, i is called the  
+
The probability of  X<sub>n+1</sub> being in state j given that X<sub>n</sub> is in state i is called the  
 
one-step transition probability  and is denoted by P<sub>i,j</sub><sup>n,n+1</sup>. That is <br/>
 
one-step transition probability  and is denoted by P<sub>i,j</sub><sup>n,n+1</sup>. That is <br/>
P<sub>i,j</sub><sup>n,n+1</sup> = Pr(X<sub>n+1</sub>=j|X<sub>n</sub>=i)
+
P<sub>i,j</sub><sup>n,n+1</sup> = Pr(X<sub>n+1</sub> =j/X<sub>n</sub> =i)
 
 
 
 
Two-step transition probability:<br/>
 
The probability of moving from state a to state a: <br/>
 
<math>P_2 (a,a)=Pr⁡ (X_{m+2}=a| X_m=a)=Pr⁡(X_{m+1}=a| X_m=a)Pr⁡(X_{m+2}=a|X_{m+1}=a)+ Pr⁡(X_{m+1}=b|X_m=a)Pr⁡(X_{m+2}=a|X_{m+1}=b)</math> <br/>
 
<math> =0.7(0.7)+0.3(0.2)=0.55 </math><br/>
 
 
 
 
 
N-step Transition Matrixi:<br/>
 
A matrix <math> P_n </math> whose elements are the probability of moving to state j from state i in n steps. <br/>
 
<math>P_n (i,j)=Pr⁡(X_{m+n}=j|X_m=i)</math>  is the n-step transition probability.<br/>
 
  
In general <math>P_n = P^n</math> with <math>P_n(i,j) \geq 0</math> and <math>\sum_{j} P_n(i,j) = 1</math><br/>     
+
Example from previous class: <br/>
 
 
'''Example from previous class:''' <br/>
 
  
 
<math> P= \left [ \begin{matrix}
 
<math> P= \left [ \begin{matrix}
 
 
0.7 & 0.3 \\
 
0.7 & 0.3 \\
 
0.2 & 0.8
 
0.2 & 0.8
 
 
\end{matrix} \right] </math>
 
\end{matrix} \right] </math>
  
Line 4,219: Line 4,267:
 
0.3  & 0.7
 
0.3  & 0.7
 
\end{matrix} \right] </math><br\>
 
\end{matrix} \right] </math><br\>
 +
 +
Interpretation:<br\>
 +
- If at time 0 we are in state 1, then the probability of us being in state 1 at time 2 is 0.55 and 0.45 for state 2.<br\>
 +
- If at time 0 we are in state 2, then the probability of us being in state 1 at time 2 is 0.3 and 0.7 for state 2.<br\>
  
 
<math>P_2 = P_1 P_1 </math><br\>
 
<math>P_2 = P_1 P_1 </math><br\>
  
<math>P_n = P_1^n </math><br\>
+
<math>P_3 = P_1 P_2 </math><br\>
  
The equation above is a special case of the Chapman-Kolmogorov Equations.<br />
+
<math>P_n = P_1 P_(n-1) </math><br\>
  
It is true because of the Markov property or the memoryless property of Markov chains, where the probabilities of going forward to the next state <br />
+
<math>P_n = P_1^n </math><br\>
only depends on your current state, not your previous states. By intuition, we can multiply the 1-step transition <br />
 
matrix n-times to get a n-step transition matrix.<br />
 
 
 
<br>
 
  
'''Example:''' <br>
 
We can see how <math>P_n = P^n</math> from the following:<br/>
 
<math>\mu_1=\mu_0\cdot P</math> <br/>
 
<math>\mu_2=\mu_1\cdot P</math> <br/>
 
<math>\mu_3=\mu_2\cdot P</math> <br/>
 
  
  
Therefore,
+
The two-step transition probability of moving from state a to state a:
 
<br/>
 
<br/>
<math>\mu_3=\mu_0\cdot P^3
+
<math>P_2 (a,a)=Pr⁡ (X_{m+2}=a| X_m=a)=Pr⁡(X_{m+1}=a| X_m=a)Pr⁡(X_{m+2}=a|X_{m+1}=a)+ Pr⁡(X_{m+1}=b|X_m=a)Pr⁡(X_{m+2}=a|X_{m+1}=b)</math> <br/>
</math> <br/>
 
  
<math>P_n(i,j)</math> is called N-steps Transition Probability. <br>
+
<math> =0.7(0.7)+0.3(0.2)=0.55 </math><br/>
<math>\mu_0 </math> is called the '''Initial Distribution'''. <br>
 
<math>\mu_n = \mu_0* P^n </math> <br />
 
<br>
 
  
'''Example 1:''' <br>
+
Another Example: <br/>
Consider a two-state Markov chain {<math>X_t; t = 0, 1, 2,...</math>} with states {1,2} and transition probability matrix
 
  
 
<math> P= \left [ \begin{matrix}
 
<math> P= \left [ \begin{matrix}
1/2 & 1/2 \\
+
1 & 0 \\
1/3 & 2/3
+
0.7 & 0.3
 +
\end{matrix} \right] </math>
 +
 
 +
The two step transition probability matrix is:
 +
 
 +
<math> P P= \left [ \begin{matrix}
 +
1 & 0 \\
 +
0.7 & 0.3
 +
\end{matrix} \right] \left [ \begin{matrix}
 +
1 & 0 \\
 +
0.7 & 0.3
 +
\end{matrix} \right] </math>=<math>\left [ \begin{matrix}
 +
1(1)+ 0(0.7) & 1(0) + 0(0.3)              \\
 +
1(0.7)+0.7(0.3) & 0(0.7)+0.3(0.3)
 +
\end{matrix} \right] </math>=<math>\left [ \begin{matrix}
 +
1 &  0                  \\
 +
0.91  & 0.09
 +
\end{matrix} \right] </math><br\>
 +
 
 +
This is the two-step transition matrix.
 +
 
 +
=== n-step transition matrix ===
 +
The elements of matrix P<sub>n</sub> (i.e. the ij<sub>th</sub> entry P<sub>ij</sub>) is the probability of moving to state j from state i in n steps
 +
 
 +
In general <math>P_n = P^n</math> with <math>P_n(i,j) \geq 0</math> and <math>\sum_{j} P_n(i,j) = 1</math><br />
 +
Note: <math>P_2 = P_1\times P_1; P_n = P^n</math><br />
 +
The equation above is a special case of the Chapman-Kolmogorov equations.<br />
 +
It is true because of the Markov property or the memoryless property of Markov chains, where the probabilities of going forward to the next state <br />
 +
only depends on your current state, not your previous states. By intuition, we can multiply the 1-step transition <br />
 +
matrix n-times to get a n-step transition matrix.<br />
 +
 +
Example: We can see how <math>P_n = P^n</math> from the following:
 +
<br/>
 +
<math>\vec{\mu_1}=\vec{\mu_0}\cdot P</math> <br/>
 +
<math>\vec{\mu_2}=\vec{\mu_1}\cdot P</math> <br/>
 +
<math>\vec{\mu_3}=\vec{\mu_2}\cdot P</math> <br/>
 +
Therefore,
 +
<br/>
 +
<math>\vec{\mu_3}=\vec{\mu_0}\cdot P^3
 +
</math> <br/>
 +
 
 +
<math>P_n(i,j)</math> is called n-steps transition probability. <br>
 +
<math>\vec{\mu_0} </math> is called the '''initial distribution'''. <br>
 +
<math>\vec{\mu_n} = \vec{\mu_0}* P^n </math> <br />
 +
 
 +
Example with Markov Chain:
 +
Consider a two-state Markov chain {<math>X_t; t = 0, 1, 2,...</math>} with states {1,2} and transition probability matrix
 +
 
 +
<math> P= \left [ \begin{matrix}
 +
1/2 & 1/2 \\
 +
1/3 & 2/3
 
\end{matrix} \right] </math>
 
\end{matrix} \right] </math>
  
Line 4,263: Line 4,351:
 
b)<math> P(X_2=1, X_1=1 |X_0=1) = P(X_2=1|X_1=1)*P(X_1=1|X_0=1)= 1/2 * 1/2 = 1/4 </math>
 
b)<math> P(X_2=1, X_1=1 |X_0=1) = P(X_2=1|X_1=1)*P(X_1=1|X_0=1)= 1/2 * 1/2 = 1/4 </math>
  
c)<math> P(X_2=1|X_0=1)= P_2(1,1) = 5/12 </math> (Start from state 1 and then get back to it in 3 steps) (1 ->1->1 = (1/2)*(1/2) or 1->2->1 = (1/2)*(1/3)) 
+
c)<math> P(X_2=1|X_0=1)= P_2(1,1) = 5/12 </math>
  
 
d)<math> P^2=P*P= \left [ \begin{matrix}
 
d)<math> P^2=P*P= \left [ \begin{matrix}
Line 4,269: Line 4,357:
 
7/18 & 11/18
 
7/18 & 11/18
 
\end{matrix} \right] </math>
 
\end{matrix} \right] </math>
 
'''Example 2:''' <br>
 
Consider a 3-state Markov chain {<math>X_t; t = 0, 1, 2,...</math> with states {1,2,3} and transition probability matrix
 
 
<math> P= \left [\begin{matrix}
 
1/4 & 1/2 & 1/4 \\
 
1/3 & 1/3 & 1/3 \\
 
1/7 & 2/7 & 4/7
 
\end{matrix} \right] </math>
 
 
Given <math> X_0=1 </math>. Compute the following:
 
 
a)<math> P(X_2=1, X_1=3 | X_0 = 2) = P(X_2=1|X_1=3)*P(X_1=3|X_0=2)= 2/7 * 1/4 = 1/14 </math>
 
 
b)<math>P(X_2=3|X_0=1) = P(X_2=3|X_1=1)*P(X_1=1|X_0=1)+P(X_2=3|X_1=2)*P(X_1=2|X_0=1)+P(X_2=3|X_1=3)*P(X_1=3|X_0=1)=(1/4)*
 
(1/7)+(1/3)*(2/7)+(1/7)*(4/7)=125/588 </math>
 
  
 
=== Marginal Distribution of Markov Chain ===
 
=== Marginal Distribution of Markov Chain ===
<span style="text-shadow: 0px 2px 3px hsl(310,15%,65%);margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">We represent the probability of all states at time t with a vector <math>\underline{\mu_t}</math><br/>
+
We represent the probability of all states at time t with a vector <math>\underline{\mu_t}</math><br/>
<math>\underline{\mu_t}~=(\mu_t(1), \mu_t(2),...\mu_t(n))</math> where <math>\mu_t(1)</math> is the probability of being on state 1 at time t.<br/>
+
<math>\underline{\mu_t}~=(\mu_t(1), \mu_t(2),...\mu_t(n))</math> where <math>\underline{\mu_t(1)}</math> is the probability of being on state 1 at time t.<br/>
and in general, <math>\mu_t(i)</math> shows  the probability of being on state i at time t.<br/>
+
and in general, <math>\underline{\mu_t(i)}</math> shows  the probability of being on state i at time t.<br/>
 
For example, if there are two states a and b, then <math>\underline{\mu_5}</math>=(0.1, 0.9) means that the chance of being in state a at time 5 is 0.1 and the chance of being on state b at time 5 is 0.9. <br/>
 
For example, if there are two states a and b, then <math>\underline{\mu_5}</math>=(0.1, 0.9) means that the chance of being in state a at time 5 is 0.1 and the chance of being on state b at time 5 is 0.9. <br/>
 
If we generate a chain for many times, the frequency of states at each time shows marginal distribution of the chain at that time. <br/>
 
If we generate a chain for many times, the frequency of states at each time shows marginal distribution of the chain at that time. <br/>
 
The vector <math>\underline{\mu_0}</math> is called the initial distribution. <br/>
 
The vector <math>\underline{\mu_0}</math> is called the initial distribution. <br/>
</span>
+
 
<math> P_2~=P_1 P_1 </math> (as verified above)  
+
<math> P^2~=P\cdot P </math> (as verified above)  
  
 
In general,
 
In general,
<math> P_n~=(P_1)^n </math> **Note that <math>P_1</math> is equal to the matrix P <br/>
+
<math> P^n~= \Pi_{i=1}^{n} P</math> (P multiplied n times)<br/>
<math>\mu_n~=\mu_0 P_n</math><br/>
+
<math>\mu_n~=\mu_0 P^n</math><br/>
 
where <math>\mu_0</math> is the initial distribution,
 
where <math>\mu_0</math> is the initial distribution,
and <math>\mu_{m+n}~=\mu_m P_n</math><br/>
+
and <math>\mu_{m+n}~=\mu_m P^n</math><br/>
 
N can be negative, if P is invertible.
 
N can be negative, if P is invertible.
  
Line 4,321: Line 4,393:
 
<math>\mu_1~ = \mu_0P</math> <br>
 
<math>\mu_1~ = \mu_0P</math> <br>
 
<math>\mu_2~ = \mu_1P = \mu_0PP = \mu_0P^2</math> <br>
 
<math>\mu_2~ = \mu_1P = \mu_0PP = \mu_0P^2</math> <br>
Given the marginal distribustion at time n-1, we can compute the distribution at time n:<br>
+
 
<math>\mu_n~ = \mu_{n-1}~P</math> <br>
 
After repeating the algorithm starting at time 0, we can have the following relationship:<br>
 
 
In general, <math>\mu_n~ = \mu_0P^n</math><br />
 
In general, <math>\mu_n~ = \mu_0P^n</math><br />
 
Property: If <math>\mu_n~\neq\mu_t~</math>(for any t less than n), then we say P does not converge. <br />
 
Property: If <math>\mu_n~\neq\mu_t~</math>(for any t less than n), then we say P does not converge. <br />
Line 4,335: Line 4,405:
  
  
<math>\pi</math> is stationary distribution of the transition matrix P   if <math>\pi \cdot </math> P = <math>\pi</math>
+
<math>\pi</math> is stationary distribution of the chain if <math>\pi</math>P = <math>\pi</math> In other words, a stationary distribution is when the markov process that have equal probability of moving to other states as its previous move.
 +
 
 +
where <math>\pi</math> is a probability vector <math>\pi</math>=(<math>\pi</math><sub>i</sub> | <math>i \in X</math>) such that all the entries are nonnegative and sum to 1. It is the eigenvector in this case.
 +
 
 +
In other words, if X''<sub>0</sub>'' is draw from <math>\pi</math>. Then marginally, X''<sub>n</sub>'' is also drawn from the same distribution <math>\pi</math> for every n≥0.
 +
 
 +
The above conditions are used to find the stationary distribution
 +
In matlab, we could use <math>P^n</math> to find the stationary distribution.(n is usually larger than 100)<br/>
  
where <math>\pi</math> is a probability vector <math>\pi</math>=(<math>\pi</math><sub>i</sub> | <math>i \in X</math>) such that, all the entries are nonnegative and sum to 1.
 
  
In other words, if X''<sub>0</sub>'' is drawn from <math>\pi</math>. Generally, X''<sub>n</sub>'' is drawn from the same distribution as <math>\pi</math> for every n≥0.
+
'''Comments:'''<br/>
 +
As n gets bigger and bigger, <math>\mu_n</math> will possibly stop changing, so the quantity <math>\pi</math> <sub>i</sub> can also be interpreted as the limiting probability that the chain is in the state <math>j</math>
  
Example: consider the following transition matrix<br>
+
Comments: <br/>
 +
1. <math>\pi</math> may not exist and even if it exists, it may not always be unique. <br/>
 +
2. If <math>\pi</math> exists and is unique, then <math>\pi</math><sub>i</sub> is called the long-run proportion of the process in state i and the stationary distribution is also the limiting distribution of the process.<br/>
  
<math> P= \left [ \begin{matrix}
+
How long do you have to wait until you reach a steady sate?
 +
Ans: There is not clear way to find that out
  
0 & 1 \\
+
How do you increase the time it takes to reach the steady state?
1 & 0
+
Ans: Make the probabilities of transition much smaller, to reach from state 0 to state 1 and vice-versa p=0.005. And make the probabilities of staying in the same state extremely high. To stay in state 0 or state 1 p=0.995, then the matrix is in a "sticky state"
  
\end{matrix} \right] </math>
 
  
<span> so how to compute it: use pi = pi * P</span>
+
EXAMPLE : Random Walk on the cycle S={0,1,2}
Intuitively, the chain spends half of the time in each of the states, so <math>\pi</math> = (1/2,1/2)
 
  
Comments:<br/>
+
<math>P^2 = \left[ \begin{array}{ccc}
1. As n gets bigger and bigger, <math>\mu_n</math> will possibly stop changing, so the quantity <math>\pi</math> <sub>i</sub> can also be interpreted as the limiting probability that the chain is in the state <math>j</math> <br>
+
2pq & q^2 & p^2 \\
2. <math>\pi</math> may not exist and even if it exists, it may not always be unique. <br/>
+
p^2 & 2pq & q^2 \\
3. If <math>\pi</math> exists and is unique, then <math>\pi</math><sub>i</sub> is called the long-run proportion of the process in state i and the stationary distribution is also the limiting distribution of the process.<br/>
+
q^2 & p^2 & 2pq \end{array} \right]</math>
  
 +
Suppose<br/>
 +
<math>P(x_0=0)=\frac{1}{4}</math><br/>
 +
<math>P(x_0=1)=\frac{1}{2}</math><br/>
 +
<math>P(x_0=2)=\frac{1}{4}</math><br/>
 +
Thus<br/>
 +
<math>\pi_0 = \left[ \begin{array}{c} \frac{1}{4} \\ \frac{1}{2} \\ \frac{1}{4} \end{array} \right]</math><br/>
 +
so<br/>
 +
<math>\,\pi^2 = \pi_0 * P^2 </math>
 +
<math>= \left[ \begin{array}{c} \frac{1}{4} \\ \frac{1}{2} \\ \frac{1}{4} \end{array} \right] * \left[ \begin{array}{ccc}
 +
2pq & q^2 & p^2 \\
 +
p^2 & 2pq & q^2 \\
 +
q^2 & p^2 & 2pq \end{array} \right]</math>
 +
<math>= \left[ \begin{array}{c} \frac{1}{2}pq + \frac{1}{2}p^2+\frac{1}{4}q^2 \\ \frac{1}{4}q^2+pq+\frac{1}{4}p^2 \\ \frac{1}{4}p^2+\frac{1}{2}q^2+\frac{1}{2}pq\end{array} \right]</math>
  
'''MatLab Code'''
+
==== MatLab Code ====
 
<pre style='font-size:14px'>
 
<pre style='font-size:14px'>
  
Line 4,409: Line 4,500:
 
     0.4000    0.6000
 
     0.4000    0.6000
 
     0.4000    0.6000
 
     0.4000    0.6000
 +
  
 
</pre>
 
</pre>
 +
The definition of stationary distribution is that <math>\pi</math> is the stationary distribution of the chain if <math>\pi=\pi~P</math>, where <math>\pi</math> is a probability vector. For every n<math>>=</math>0.
  
 +
However, just because X<sub>''n''</sub> ~ <math>\pi</math> for every n<math>>=</math>0 does ''not'' mean every state is independently identically distributed.
  
'''An alternate Method of Computing the Stationary Distribution''' <br>
+
'''Limiting distribution''' of the chain refers the transition matrix that reaches the stationary state. If the lim(n-> infinite)P^n -> c, where c is a constant, then, we say this Markov chain is coverage;  otherwise, it's not coverage.
  
Recall that if <math>\lambda v = A v</math>, then <math>\lambda</math> is the eigenvalue of <math>A</math> corresponding to the eigenvector <math>v</math><br>
+
Example: Find the stationary distribution of P= <math>\left[ {\begin{array}{ccc}
 +
1/3 & 1/3 & 1/3 \\
 +
1/4 & 3/4 & 0 \\
 +
1/2 & 0 & 1/2 \end{array} } \right]</math>
 +
 
 +
Solution:
 +
<math>\pi=\left[ {\begin{array}{ccc}
 +
\pi_0 & \pi_1 & \pi_2 \end{array} } \right]</math>
 +
 
 +
Using the stationary distribution property <math>\pi=\pi~P</math> we get, <br>
 +
<math>\pi_0=\frac{1}{3}\pi_0+\frac{1}{4}\pi_1+\frac{1}{2}\pi_2 </math><br>
 +
<math>\pi_1=\frac{1}{3}\pi_0+\frac{3}{4}\pi_1+0\pi_2 </math><br>
 +
<math>\pi_2=\frac{1}{3}\pi_0+0\pi_1+\frac{1}{2}\pi_2 </math><br>
 +
 
 +
And since <math>\pi</math> is a probability vector, <br>
 +
<math> \pi_{0}~ + \pi_{1} + \pi_{2} = 1 </math>
 +
 
 +
Solving the 4 equations for the 3 unknowns gets, <br>
 +
<math>\pi_{0}~=1/3</math>, <math>\pi_{1}~=4/9</math>, and <math>\pi_{2}~=2/9</math> <br>
 +
Therefore <math>\pi=\left[ {\begin{array}{ccc}
 +
1/3 & 4/9 & 2/9 \end{array} } \right]</math>
 +
 
 +
Example 2: Find the stationary distribution of P= <math>\left[ {\begin{array}{ccc}
 +
1/3 & 1/3 & 1/3 \\
 +
1/4 & 1/2 & 1/4 \\
 +
1/6 & 1/3 & 1/2 \end{array} } \right]</math>
 +
 
 +
Solution:
 +
<math>\pi=\left[ {\begin{array}{ccc}
 +
\pi_0 & \pi_1 & \pi_2 \end{array} } \right]</math>
 +
 
 +
Using the stationary distribution property <math>\pi=\pi~P</math> we get, <br>
 +
<math>\pi_0=\frac{1}{3}\pi_0+\frac{1}{4}\pi_1+\frac{1}{6}\pi_2 </math><br>
 +
<math>\pi_1=\frac{1}{3}\pi_0+\frac{1}{2}\pi_1+\frac{1}{3}\pi_2 </math><br>
 +
<math>\pi_2=\frac{1}{3}\pi_0+\frac{1}{4}\pi_1+\frac{1}{2}\pi_2 </math><br>
 +
 
 +
And since <math>\pi</math> is a probability vector, <br>
 +
<math> \pi_{0}~ + \pi_{1} + \pi_{2} = 1 </math>
 +
 
 +
Solving the 4 equations for the 3 unknowns gets, <br>
 +
<math>\pi_{0}=\frac {6}{25}</math>, <math>\pi_{1}~=\frac {2}{5}</math>, and <math>\pi_{2}~=\frac {9}{25}</math> <br>
 +
Therefore <math>\pi=\left[ {\begin{array}{ccc}
 +
\frac {6}{25} & \frac {2}{5} & \frac {9}{25} \end{array} } \right]</math>
 +
 
 +
The above two examples are designed to solve for the stationary distribution of the matrix P however they also give us the limiting distribution of the matrices as we have mentioned earlier that the stationary distribution is equivalent to the limiting distribution.
 +
 
 +
'''Alternate Method of Computing the Stationary Distribution''' <br>
 +
 
 +
Recall that if <math>\lambda v = A v</math>, then <math>\lambda</math> is the eigenvalue of <math>A</math> corresponding to the eigenvector <math>v</math><br>
  
 
By definition of stationary distribution,  <math>\pi = \pi  P</math><br>
 
By definition of stationary distribution,  <math>\pi = \pi  P</math><br>
Line 4,425: Line 4,567:
  
 
It is thus possible to compute the stationary distribution by taking the eigenvector of the transpose of the transition matrix corresponding to 1, and normalize it such that all elements are non-negative and sum to one so that the elements satisfy the definition of a stationary distribution. The transformed vector is still an eigenvector since a linear transformation of an eigenvector is still within the eigenspace. Taking the transpose of this transformed eigenvector gives the stationary distribution. <br>  
 
It is thus possible to compute the stationary distribution by taking the eigenvector of the transpose of the transition matrix corresponding to 1, and normalize it such that all elements are non-negative and sum to one so that the elements satisfy the definition of a stationary distribution. The transformed vector is still an eigenvector since a linear transformation of an eigenvector is still within the eigenspace. Taking the transpose of this transformed eigenvector gives the stationary distribution. <br>  
 +
 +
 +
  
 
<span style="background:#F5F5DC">
 
<span style="background:#F5F5DC">
 
Generating Random Initial distribution<br>
 
Generating Random Initial distribution<br>
 
<math>\mu~=rand(1,n)</math><br>
 
<math>\mu~=rand(1,n)</math><br>
<math>\mu~=\mu/\Sigma(\mu)</math></span>
+
<math>\mu~=\frac{\mu}{\Sigma(\mu)}</math></span>
  
 
<span style="background:#F5F5DC">
 
<span style="background:#F5F5DC">
 
Doubly Stochastic Matrices<br></span>
 
Doubly Stochastic Matrices<br></span>
 
We say that the transition matrix <math>\, P=(p_{ij})</math> is doubly stochastic if both rows and columns sum to 1, i.e.,<br>
 
We say that the transition matrix <math>\, P=(p_{ij})</math> is doubly stochastic if both rows and columns sum to 1, i.e.,<br>
<math>\, \sum_{i} p_{ij} = \sum_{j} p_{ij} = 1 </math><br>
+
<math>\, \sum_{i} p_{ji} = \sum_{j} p_{ij} = 1 </math><br>
 
It is easy to show that the stationary distribution of an nxn doubly stochastic matrix P is:<br>
 
It is easy to show that the stationary distribution of an nxn doubly stochastic matrix P is:<br>
 
<math> (\frac{1}{n}, \ldots , \frac{1}{n}) </math>
 
<math> (\frac{1}{n}, \ldots , \frac{1}{n}) </math>
Line 4,441: Line 4,586:
  
 
A Markov chain is a random process usually characterized as '''memoryless''': the next state depends only on the current state and not on the sequence of events that preceded it. This specific kind of "memorylessness" is called the Markov property. Markov chains have many applications as statistical models of real-world processes.
 
A Markov chain is a random process usually characterized as '''memoryless''': the next state depends only on the current state and not on the sequence of events that preceded it. This specific kind of "memorylessness" is called the Markov property. Markov chains have many applications as statistical models of real-world processes.
 
'''Formal Definition:''' <br>
 
A ''Markov Chain'' consists of a countable (possibly finite) set ''S'' (called the state space) together with a countable family of random variables X<sub>0</sub>,X<sub>1</sub>,X<sub>2</sub>,X<sub>3</sub>,... with values in S such that:<br>
 
  <math>P(X_{t+1}=s \vert X_t=s_t,X_{t-1}=s_{t-1},...,X_0=s_0)=P(X_{t+1}=s \vert X_t=s_t)</math><br>
 
and we refer to this fundamental equation as the ''Markov property''. Note the sufficient property here is that the closest future state is independent of the past states. It is not necessary to be dependent on the current state. Here are some more properties of Markov Chain:<br>
 
  
 
1. Reducibility <br>
 
1. Reducibility <br>
Line 4,461: Line 4,601:
 
<math>T_i = \min\{n \geq 1:X_n=i \vert X_0=i\}</math><br />
 
<math>T_i = \min\{n \geq 1:X_n=i \vert X_0=i\}</math><br />
  
== Class 14 - Thursday June 20th 2013 ==
+
(The properties are from
 +
http://www2.math.uu.se/~takis/L/McRw/mcrw.pdf)
  
== Properties of Markov Chain (continued) ==
+
CHAPMAN-KOLMOGOROV EQUATION
<span style="text-shadow: 0px 2px 3px hsl(310,15%,65%);margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">4. Ergodicity<br>
+
For all <math>n</math> and <math>m</math>, and any state <math>i</math> and <math>j</math>,
If state i is aperiodic and positive recurrent, state i is said to be ergodic. In other words, state i is ergodic if it has a period of 1 and has finite mean recurrence time. Consequently, an irreducible Markov Chain is said to be ergodic if all states in the chain are ergodic. <br>
+
<math>P^{n+m}(X_n+m = j \vert X_0 =i)= \sum_{k} P^n(X_1 = k \vert X_0 = i)*P^m(X_1 = j \vert X_0 =k)</math>
  
 +
== Class 14 - Thursday June 20th 2013 ==
  
In statistics, the term describes a random process for which the time average of one sequence of events is the same as the ensemble average.
+
Example: Find the stationary distribution of <math> P= \left[ {\begin{array}{ccc}
source: wikipedia
+
\frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\[6pt]
An extra note here is that if a finite state irreducible Markov chain has an aperiodic state, then it is ergodic. If there is a finite number N such that any state can be reached from any other state in exactly N steps, a model is said to have the ergodic property. For example, if we have a fully connected transition matrix where all transitions have a non-zero probability, this condition is fulfilled with N=1. A model with more than one state and just one out-going transition per state cannot be ergodic. We will have more examples to follow later on.<br>
+
\frac{1}{4} & \frac{3}{4} & 0 \\[6pt]
</span>
+
\frac{1}{2} & 0 & \frac{1}{2} \end{array} } \right]</math>
5. Steady-state analysis and limiting distributions<br>
 
Definition: If a Markov Chain is time-homogeneous, then the vector  <math>\boldsymbol{\pi}</math>is called a '''stationary distribution''' if <math>\forall j \in S</math> it satisfies:<br>
 
    1)<math>0\leq\pi_j\leq1</math><br>
 
  
    2)<math>\sum_{j \in S}\pi_j = 1</math><br>
+
<math>\displaystyle \pi=\pi  p</math>
  
    3)<math>\pi_j = \sum_{i \in S} \pi_i p_{ij}</math><br>
+
Solve the system of linear equations to find a stationary distribution
  
If all states of an irreducible chain are positive recurrent, it has a stationary distribution. In this case,  <math>\boldsymbol{\pi}</math> is unique. It is also related to the expected return time:<br>
+
<math>\displaystyle \pi=(\frac{1}{3},\frac{4}{9}, \frac{2}{9})</math>
  
    <math>\pi_j = \frac{C}{M_j}\,, </math><br>
+
Note that <math>\displaystyle \pi=\pi  p</math> looks similar to eigenvectors/values <math>\displaystyle \lambda vec{u}=A vec{u}</math>
  
where C is the normalizing constant. <br>
+
<math>\pi</math> can be considered as an eigenvector of P with eigenvalue = 1. But note that the vector <math>vec{u}</math> is a column vector and o we need to transform our <math>\pi</math> into a column vector.
  
If the chain is both irreducible and aperiodic, then for any i and j,<br>
+
<math>=> \pi</math><sup>T</sup>= P<sup>T</sup><math>\pi</math><sup>T</sup><br/>
 +
Then <math>\pi</math><sup>T</sup> is an eigenvector of P<sup>T</sup> with eigenvalue = 1. <br />
 +
MatLab tips:[V D]=eig(A), where D is a diagonal matrix of eigenvalues and V is a matrix of eigenvectors of matrix A<br />
 +
==== MatLab Code ====
 +
<pre style='font-size:14px'>
  
    <math>\lim_{n \rarr \infty} p_{ij}^{(n)} = \frac{C}{M_j} </math><br>
+
P = [1/3 1/3 1/3; 1/4 3/4 0; 1/2 0 1/2]
  
'''Final result:''' <br>
+
pii = [1/3 4/9 2/9]
<math>\boldsymbol{\pi}</math> is called the equilibrium distribution of the chain if the chain converges to the stationary distribution regardless of where it begins.<br>
 
  
Source: https://en.wikipedia.org/wiki/Markov_chain
+
[vec val] = eig(P')            %% P' is the transpose of matrix P
 +
 +
vec(:,1) = [-0.5571 -0.7428 -0.3714]      %% this is in column form
  
== Examples of finding stationary distribution ==
+
a = -vec(:,1)
 
 
We will have examples for this later. <br>
 
Example: Find the stationary distribution of <math> P= \left[ {\begin{array}{ccc}
 
\frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\[6pt]
 
\frac{1}{4} & \frac{3}{4} & 0 \\[6pt]
 
\frac{1}{2} & 0 & \frac{1}{2} \end{array} } \right]</math>
 
  
Solve the system of linear equations to find the Stationary Distribution<br>
+
>> a =  
<br>
+
[0.5571 0.7428 0.3714]   
<math>\pi_0=\frac{1}{3}\pi_0+\frac{1}{4}\pi_1+\frac{1}{2}\pi_2 </math><br>
 
<math>\pi_1=\frac{1}{3}\pi_0+\frac{3}{4}\pi_1+0\pi_2 </math><br>
 
<math>\pi_2=\frac{1}{3}\pi_0+0\pi_1+\frac{1}{2}\pi_2 </math><br>
 
<math>\pi_{0}~ + \pi_{1} + \pi_{2} = 1 </math><br>
 
<br>
 
Solving the 4 equations, <br>
 
<math>\pi_{0}=\frac {1}{3}</math>, <math>\pi_{1}~=\frac {4}{9}</math>, and <math>\pi_{2}~=\frac {2}{9}</math> <br>
 
  
Therefore
+
%% a is in column form
<math>\pi=(\frac{1}{3},\frac{4}{9}, \frac{2}{9})</math>
 
  
Similarly, this can be achieved by calculating <br/>
+
%% Since we want this vector a to sum to 1, we have to scale it
<math>P^{30}=\left[ {\begin{array}{ccc}
 
\frac{1}{3} & \frac{4}{9} & \frac{2}{9} \\[6pt]
 
\frac{1}{3} & \frac{4}{9} & \frac{2}{9} \\[6pt]
 
\frac{1}{3} & \frac{4}{9} & \frac{2}{9} \end{array} } \right]</math><br/>
 
Which produces the same result as solving the systems of equations.
 
  
Alternatively, the system of equations can also be solved using matrix simplification. Namely, when the system of equations is more complicated and doesn't simplify easily. Continuing from the above example, we would have [-2/3 1/4 1/2 | 0]
+
b = a/sum(a)
[1/3 -1/4 0 | 0]
 
[1/3 0 -1/2 | 0]
 
[1 1 1 | 1]
 
  
Simplifying this matrix (from math 136), we can easily obtain the results for <math>\pi_0</math>, <math>\pi_1</math>, and <math>\pi_2</math>.
+
>> b =
 +
[0.3333 0.4444 0.2222] 
  
<math>\lambda u=A u</math>
+
%% b is also in column form
  
<math>\pi</math> can be considered as an eigenvector of P with eigenvalue = 1.
+
%% Observe that b' = pii
  
But the vector u here needs to be a column vector. So we need to transform <math>\pi</math> into a column vector.
+
</pre>
 
+
</br>
<math>\pi</math><sup>T</sup>= P<sup>T</sup><math>\pi</math><sup>T</sup>
+
==== Limiting distribution ====
 
+
A Markov chain has limiting distribution <math>\pi</math> if
Then <math>\pi</math><sup>T</sup> is an eigenvector of P<sup>T</sup> with eigenvalue = 1. <br />
 
MatLab tips:[V D]=eig(A), where D is a diagonal matrix of eigenvalues and V is a matrix of eigenvectors of matrix A<br />
 
 
 
 
 
==== Limiting Distribution ====
 
A Markov chain has '''Limiting Distribution''' <math>\pi</math> if
 
  
 
<math>\lim_{n\to \infty} P^n= \left[ {\begin{array}{ccc}
 
<math>\lim_{n\to \infty} P^n= \left[ {\begin{array}{ccc}
\pi \\
+
\pi_1 \\
 
\vdots \\
 
\vdots \\
\pi \\
+
\pi_n \\
 
\end{array} } \right]</math>
 
\end{array} } \right]</math>
  
 
That is <math>\pi_j=\lim[P^n]_{ij}</math> exists and is independent of i.<br/>  
 
That is <math>\pi_j=\lim[P^n]_{ij}</math> exists and is independent of i.<br/>  
  
A Markov Chain is convergent if and only if its limiting distribution exists. <br/>
+
A Markov Chain is convergent if and only if its limiting distribution exists. <br/>
 
 
If the Limiting Distribution <math>\pi</math> exists, it must be equal to the Stationary Distribution.<br/>
 
  
 +
If the limiting distribution <math>\pi</math> exists, it must be equal to the stationary distribution.<br/>
  
In general, there are chains with Stationary distributions that don't converge, which means that they have Stationary Distributions but are not limiting. Converge means the the limit extend to a certain number when n -> infinite.<br/>
+
This convergence means that,in the long run(n to infinity),the probability of finding the <br/>
 
+
Markov chain in state j is approximately <math>\pi_j</math> no matter in which state <br/>
 +
the chain began at time 0. <br/>
  
 
'''Example:'''
 
'''Example:'''
 
+
For a transition matrix <math> P= \left [ \begin{matrix}
<math> P= \left [ \begin{matrix}
+
0 & 1 & 0 \\[6pt]
\frac{4}{5} & \frac{1}{5} & 0 & 0 \\[6pt]
+
0 & 0 & 1 \\[6pt]
\frac{1}{5} & \frac{4}{5} & 0 & 0 \\[6pt]
+
1 & 0 & 0 \\[6pt]
0 & 0 & \frac{4}{5} & \frac{1}{5} \\[6pt]
 
0 & 0 & \frac{1}{10} & \frac{9}{10} \\[6pt]
 
 
\end{matrix} \right] </math>
 
\end{matrix} \right] </math>
 +
, find stationary distribution.<br/>
 +
We have:<br/>
 +
<math>0\times \pi_0+0\times \pi_1+1\times \pi_2=\pi_0</math><br/>
 +
<math>1\times \pi_0+0\times \pi_1+0\times \pi_2=\pi_1</math><br/>
 +
<math>0\times \pi_0+1\times \pi_1+0\times \pi_2=\pi_2</math><br/>
 +
<math>\,\pi_0+\pi_1+\pi_2=1</math><br/>
 +
this gives <math>\pi = \left [ \begin{matrix}
 +
\frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\[6pt]
 +
\end{matrix} \right] </math> <br/>
 +
However, there does not exist a limiting distribution. <math> \pi </math> is stationary but is not limiting.<br/>
 +
<br/>
 +
In general, there are chains with stationery distributions that don't converge, this means that they have stationary distribution but are not limiting.<br/>
  
This chain converges but is not a Limiting Distribution as the rows are not the same and it doesn't converge to the Stationary Distribution.<br />
+
=== MatLab Code ===
 +
<pre style='font-size:14px'>
 +
MATLAB
 +
>> P=[0, 1, 0;0, 0, 1; 1, 0, 0]
  
<span style="font-size:20px;color:red">The following contents are problematic. Please correct it if possible.</span><br />
+
P =
Suppose we're given that the Limiting Distribution <math> \pi </math> exists for  stochastic matrix P, that is, <math> \pi = \pi * P </math> <br>
 
  
WLOG assume P is diagonalizable, (if not we can always consider the Jordan form and the computation below is exactly the same. <br>
+
    0    1    0
 +
    0    0    1
 +
    1    0    0
  
Let <math> P = U * \Sigma * U^{-1} </math> be the eigenvalue decomposition of <math> P </math>, where <math>\Sigma = diag(\lambda_1,\ldots,\lambda_n) ; |\lambda_i| > |\lambda_j|, \forall i < j </math><br>
+
>> pii=[1/3, 1/3, 1/3]
  
Suppose <math> \pi^T = \sum a_i u_i </math> where <math> a_i \in \mathcal{R} </math> and <math> u_i </math> are eigenvectors of <math> P </math> for <math> i = 1\ldots n </math> <br>
+
pii =
  
By definition: <math> \pi^k = \pi*P = \pi*P^k \implies \pi = \pi*(U * \Sigma * U^{-1}) *(U * \Sigma * U^{-1} )*\ldots*(U * \Sigma * U^{-1}) </math> <br>
+
    0.3333    0.3333    0.3333
  
Therefore <math> \pi^k = \sum a_i * \lambda_i^k u_i </math> since <math> <u_i , u_j> = 0, \forall i\neq j </math>. <br>
+
>> pii*P
  
Therefore <math> \lim_{k \rightarrow \infty} \pi^k = \lim_{k \rightarrow \infty}  \lambda_i^k * a_1 * u_1 = u_1 </math>
+
ans =
 
 
=== MatLab Code ===
 
<pre style='font-size:14px'>
 
>> P=[1/3, 1/3, 1/3; 1/4, 3/4, 0; 1/2, 0, 1/2]      % We input a matrix P.This is the same matrix as last class'. 
 
 
 
P =
 
  
 
     0.3333    0.3333    0.3333
 
     0.3333    0.3333    0.3333
    0.2500    0.7500        0
 
    0.5000        0    0.5000
 
  
>> P^2
+
>> P^1000
  
 
ans =
 
ans =
  
     0.3611    0.3611    0.2778
+
    0    1     0
     0.2708    0.6458    0.0833
+
    0     0     1
     0.4167    0.1667    0.4167
+
    1     0    0
  
>> P^3
+
>> P^10000
  
 
ans =
 
ans =
  
     0.3495    0.3912    0.2593
+
    0    1     0
     0.2934    0.5747    0.1319
+
    0     0     1
     0.3889    0.2639    0.3472
+
    1     0    0
  
>> P^10
+
>> P^10002
 
 
the example of code and an example of Stand Sistribution, then all the pi probability in the matrix are the same.
 
  
 
ans =
 
ans =
  
     0.3341    0.4419    0.2240
+
    1     0     0
     0.3314    0.4507    0.2179
+
    0     1     0
     0.3360    0.4358    0.2282
+
    0     0    1
  
>> P^100                                            % The Stationary Distribution is [0.3333 0.4444 0.2222]  since values keep unchanged.
+
>> P^10003
  
 
ans =
 
ans =
  
     0.3333    0.4444    0.2222
+
    0    1     0
     0.3333    0.4444    0.2222
+
    0     0     1
     0.3333    0.4444    0.2222
+
    1     0    0
  
 +
>> %P^10000 = P^10003
 +
>> % This chain does not have limiting distribution, it has a stationary distribution. 
  
>> [vec val]=eigs(P')                              % We can find the eigenvalues and eigenvectors from the transpose of matrix P.
+
This chain does not converge, it has a cycle.
 +
</pre>
  
vec =
+
The first condition of limiting distribution is satisfied; however, the second condition where <math>\pi</math><sub>j</sub> has to be independent of i (i.e. all rows of the matrix are the same) is not met.<br>
  
  -0.5571    0.2447    0.8121
+
This example shows the distinction between having a stationary distribution and convergence(having a limiting distribution).Note: <math>\pi=(1/3,1/3,1/3)</math> is the stationary distribution as <math>\pi=\pi*p</math>. However, upon repeatedly multiplying P by itself (repeating the step <math>P^n</math> as n goes to infinite) one will note that the results become a cycle (of period 3) of the same sequence of matrices. The chain has a stationary distribution, but does not converge to it. Thus, there is no limiting distribution.<br>
  -0.7428  -0.7969  -0.3324
 
  -0.3714    0.5523  -0.4797
 
  
 +
'''Example:'''
  
val =
+
<math> P= \left [ \begin{matrix}
 +
\frac{4}{5} & \frac{1}{5} & 0 & 0 \\[6pt]
 +
\frac{1}{5} & \frac{4}{5} & 0 & 0 \\[6pt]
 +
0 & 0 & \frac{4}{5} & \frac{1}{5} \\[6pt]
 +
0 & 0 & \frac{1}{10} & \frac{9}{10} \\[6pt]
 +
\end{matrix} \right] </math>
  
    1.0000         0        0
+
This chain converges but is not a limiting distribution as the rows are not the same and it doesn't converge to the stationary distribution.<br />
        0   0.6477        0
+
<br />
        0         0  -0.0643
+
Double Stichastic Matrix: a double stichastic matrix is a matrix whose all colums sum to 1 and all rows sum to 1.<br />
 +
If a given transition matrix is a double stichastic matrix with n colums and n rows, then the stationary distribution matrix has all<br/>
 +
elements equals to 1/n.<br/>
 +
<br/>
 +
Example:<br/>
 +
For a stansition matrix <math> P= \left [ \begin{matrix}
 +
0 & \frac{1}{2} & \frac{1}{2} \\[6pt]
 +
\frac{1}{2} & 0 & \frac{1}{2} \\[6pt]
 +
\frac{1}{2} & \frac{1}{2} & 0 \\[6pt]
 +
\end{matrix} \right] </math>,<br/>
 +
We have:<br/>
 +
<math>0\times \pi_0+\frac{1}{2}\times \pi_1+\frac{1}{2}\times \pi_2=\pi_0</math><br/>
 +
<math>\frac{1}{2}\times \pi_0+0\times \pi_1+\frac{1}{2}\times \pi_2=\pi_1</math><br/>
 +
<math>\frac{1}{2}\times \pi_0+\frac{1}{2}\times \pi_1+0\times \pi_2=\pi_2</math><br/>
 +
<math>\pi_0+\pi_1+\pi_2=1</math><br/>
 +
The stationary distribution is <math>\pi = \left [ \begin{matrix}
 +
\frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\[6pt]
 +
\end{matrix} \right] </math> <br/>
  
>> a=-vec(:,1)                                    % The eigenvectors can be mutiplied by (-1) since  λV=AV  can be written as  λ(-V)=A(-V)
 
  
a =
+
<span style="font-size:20px;color:red">The following contents are problematic. Please correct it if possible.</span><br />
 +
Suppose we're given that the limiting distribution <math> \pi </math> exists for  stochastic matrix P, that is, <math> \pi = \pi \times P </math> <br>
  
    0.5571
+
WLOG assume P is diagonalizable, (if not we can always consider the Jordan form and the computation below is exactly the same. <br>
    0.7428
 
    0.3714
 
  
  >> sum(a)
+
Let <math> P = U  \Sigma U^{-1} </math> be the eigenvalue decomposition of <math> P </math>, where <math>\Sigma = diag(\lambda_1,\ldots,\lambda_n) ; |\lambda_i| > |\lambda_j|, \forall i < j </math><br>
  
ans =
+
Suppose <math> \pi^T = \sum a_i u_i </math> where <math> a_i \in \mathcal{R} </math> and <math> u_i </math> are eigenvectors of <math> P </math> for <math> i = 1\ldots n </math> <br>
  
    1.6713
+
By definition: <math> \pi^k = \pi P = \pi P^k \implies \pi = \pi(U  \Sigma  U^{-1}) (U  \Sigma  U^{-1} ) \ldots (U  \Sigma  U^{-1}) </math> <br>
  
>> a/sum(a)
+
Therefore <math> \pi^k = \sum a_i  \lambda_i^k  u_i </math> since <math> <u_i , u_j> = 0, \forall i\neq j </math>. <br>
  
ans =
+
Therefore <math> \lim_{k \rightarrow \infty} \pi^k = \lim_{k \rightarrow \infty}  \lambda_i^k  a_1  u_1 = u_1 </math>
  
    0.3333
+
=== MatLab Code ===
    0.4444
+
<pre style='font-size:14px'>
    0.2222
+
>> P=[1/3, 1/3, 1/3; 1/4, 3/4, 0; 1/2, 0, 1/2]      % We input a matrix P. This is the same matrix as last class.
</pre>
 
  
This is <math>\pi_j = lim[p^n]_(ij)</math> exist and is independent of i
+
P =
  
Example: Find the Stationary Distribution of P= <math>\left[ {\begin{array}{ccc}
+
    0.3333    0.3333    0.3333
0 & 1 & 0 \\
+
    0.2500    0.7500        0
0 & 0 & 1 \\
+
    0.5000        0   0.5000
1 & 0 & 0 \end{array} } \right]</math>
 
  
<math>\pi=\pi~P</math><br>
+
>> P^2
  
<math>\pi=</math> [<math>\pi</math><sub>0</sub>, <math>\pi</math><sub>1</sub>, <math>\pi</math><sub>2</sub>]<br>
+
ans =
  
The system of equations is:
+
    0.3611    0.3611    0.2778
 +
    0.2708    0.6458    0.0833
 +
    0.4167    0.1667    0.4167
  
0*<math>\pi</math><sub>0</sub>+0*<math>\pi</math><sub>1</sub>+1*<math>\pi</math><sub>2</sub> = <math>\pi</math><sub>0</sub> => <math>\pi</math><sub>2</sub> = <math>\pi</math><sub>0</sub><br>
+
>> P^3
1*<math>\pi</math><sub>0</sub>+0*<math>\pi</math><sub>1</sub>+0*<math>\pi</math><sub>2</sub> = <math>\pi</math><sub>1</sub> => <math>\pi</math><sub>1</sub> = <math>\pi</math><sub>0</sub><br>
 
0*<math>\pi</math><sub>0</sub>+1*<math>\pi</math><sub>1</sub>+0*<math>\pi</math><sub>2</sub> = <math>\pi</math><sub>2</sub> <br>
 
<math>\pi</math><sub>0</sub>+<math>\pi</math><sub>1</sub>+<math>\pi</math><sub>2</sub> = 1<br>
 
  
<math>\pi</math><sub>0</sub>+<math>\pi</math><sub>0</sub>+<math>\pi</math><sub>0</sub> = 3<math>\pi</math><sub>0</sub> = 1, which gives <math>\pi</math><sub>0</sub> = 1/3 <br>
+
ans =
Also, <math>\pi</math><sub>1</sub> = <math>\pi</math><sub>2</sub> = 1/3 <br>
 
So, <math>\pi</math> = <math>[\frac{1}{3}, \frac{1}{3}, \frac{1}{3}]</math> <br>
 
 
 
when the p matrix is a standard matrix, then all the probabilities of pi are the same in the matrix.
 
  
=== MatLab Code ===
+
    0.3495    0.3912    0.2593
<pre style='font-size:14px'>
+
    0.2934    0.5747    0.1319
MATLAB
+
    0.3889    0.2639    0.3472
>> P=[0, 1, 0;0, 0, 1; 1, 0, 0]
 
  
P =
+
>> P^10
  
    0    1    0
+
The example of code and an example of stand distribution, then the all the pi probability in the matrix are the same.
    0    0    1
 
    1    0    0
 
  
>> pii=[1/3, 1/3, 1/3]
+
ans =
  
pii =
+
    0.3341    0.4419    0.2240
 
+
    0.3314    0.4507    0.2179
     0.3333   0.3333   0.3333
+
     0.3360   0.4358   0.2282
  
>> pii*P
+
>> P^100                                  % The stationary distribution is [0.3333 0.4444 0.2222]  since values keep unchanged.
  
 
ans =
 
ans =
  
     0.3333    0.3333    0.3333
+
     0.3333    0.4444    0.2222
 +
    0.3333    0.4444    0.2222
 +
    0.3333   0.4444    0.2222
  
>> P^1000
 
  
ans =
+
>> [vec val]=eigs(P')                    % We can find the eigenvalues and eigenvectors from the transpose of matrix P.
  
    0    1    0
+
vec =
    0    0    1
 
    1    0    0
 
  
>> P^10000
+
  -0.5571    0.2447    0.8121
 +
  -0.7428  -0.7969  -0.3324
 +
  -0.3714    0.5523  -0.4797
  
ans =
 
  
    0    1    0
+
val =
    0    0    1
 
    1    0    0
 
  
>> P^10002
+
    1.0000        0        0
 +
        0    0.6477        0
 +
        0        0  -0.0643
 +
 
 +
>> a=-vec(:,1)                            % The eigenvectors can be mutiplied by (-1) since  λV=AV  can be written as  λ(-V)=A(-V)
  
ans =
+
a =
  
    1    0     0
+
     0.5571
    0    1     0
+
     0.7428
    0     0     1
+
     0.3714
  
>> P^10003
+
>> sum(a)
  
 
ans =
 
ans =
  
    0     1     0
+
     1.6713
    0    0    1
+
 
    1    0    0
+
>> a/sum(a)
  
>> %P^10000 = P^10003
+
ans =
>> % This chain does not have limiting distribution, it has a stationary distribution. 
 
  
This chain does not converge, it has a cycle.
+
    0.3333
 +
    0.4444
 +
    0.2222
 
</pre>
 
</pre>
  
The first condition of Limiting Distribution is satisfied; however, the second condition where <math>\pi</math><sub>j</sub> has to be independent of i (i.e. all rows of the matrix are the same) is not met.<br>
+
This is <math>\pi_j = lim[p^n]_(ij)</math> exist and is independent of i
 
 
This example shows the distinction between having a Stationary Distribution and having a Limiting Distribution(convergence).Note: <math>\pi=(1/3,1/3,1/3)</math> is the stationary distribution as <math>\pi=\pi*p</math>. However, upon repeatedly multiplying P by itself (repeating the step <math>P^n</math> as n goes to infinite) one will note that the results become a cycle (of period 3) of the same sequence of matrices. The chain has a Stationary Distribution but does not converge to it. Thus, there is no Limiting Distribution.<br>
 
  
 
Another example:
 
Another example:
Line 4,782: Line 4,916:
 
So, <math>\pi = [\frac{1}{2}, \frac{1}{4}, \frac{1}{4}]</math> <br>
 
So, <math>\pi = [\frac{1}{2}, \frac{1}{4}, \frac{1}{4}]</math> <br>
  
The definition of stationary distribution is that <math>\pi</math> is the stationary distribution of the chain if <math>\pi=\pi~P</math>, where <math>\pi</math> is a probability vector. For every n<math>>=</math>0.
+
=== Ergodic Chain ===
 +
 
 +
A Markov chain is called an ergodic chain if it is possible to go from every state to every state (not necessarily in one move). For instance, note that we can claim a Markov chain is ergodic if it is possible to somehow start at any state i and end at any state j in the matrix. We could have a chain with states 0, 1, 2, 3, 4 where it is not possible to go from state 0 to state 4 in just one step. However, it may be possible to go from 0 to 1, then from 1 to 2, then from 2 to 3, and finally 3 to 4 so we can claim that it is possible to go from 0 to 4 and this would satisfy a requirement of an ergodic chain. The example below will further explain this concept.
  
However, just because X<sub>''n''</sub> ~ <math>\pi</math> for every n<math>>=</math>0 does ''not'' mean every state is independently identically distributed.
+
'''Note:'''if there's a finite number N then every other state can be reached in N steps.
 +
'''Note:'''Also note that a Ergodic chain is irreducible (all states communicate) and aperiodic (d = 1). An Ergodic chain is promised to have a stationary and limiting distribution.<br/>
 +
'''Ergodicity:''' A state i is said to be ergodic if it is aperiodic and positive recurrent. In other words, a state i is ergodic if it is recurrent, has a period of 1 and it has finite mean recurrence time. If all states in an irreducible Markov chain are ergodic, then the chain is said to be ergodic.<br/>  
 +
'''Some more:'''It can be shown that a finite state irreducible Markov chain is ergodic if it has an aperiodic state. A model has the ergodic property if there's a finite number N such that any state can be reached from any other state in exactly N steps. In case of a fully connected transition matrix where all transitions have a non-zero probability, this condition is fulfilled with N=1.<br/>
  
'''Limiting Distribution''' of the chain refers the transition matrix that reaches the stationary state. If the lim(n-> infinite)P^n -> c, where c is a constant, then, we say this Markov chain is coverage;  otherwise, it's not coverage.
 
  
Example: Find the stationary distribution of P= <math>\left[ {\begin{array}{ccc}
+
====Example====
1/3 & 1/3 & 1/3 \\
+
<math> P= \left[ \begin{matrix}
1/4 & 3/4 & 0 \\
+
\frac{1}{3} \; & \frac{1}{3} \; & \frac{1}{3} \\ \\
1/2 & 0 & 1/2 \end{array} } \right]</math>
+
\frac{1}{4} \; & \frac{3}{4} \; & 0 \\ \\
 +
\frac{1}{2} \; & 0 \; & \frac{1}{2}
 +
\end{matrix} \right] </math><br />
  
Solution:
 
<math>\pi=\left[ {\begin{array}{ccc}
 
\pi_0 & \pi_1 & \pi_2 \end{array} } \right]</math>
 
  
Using the stationary distribution property <math>\pi=\pi~P</math> we get, <br>
+
<math> \pi=\left[ \begin{matrix}
<math>\pi_0=\frac{1}{3}\pi_0+\frac{1}{4}\pi_1+\frac{1}{2}\pi_2 </math><br>
+
\frac{1}{3} & \frac{4}{9} & \frac{2}{9}
<math>\pi_1=\frac{1}{3}\pi_0+\frac{3}{4}\pi_1+0\pi_2 </math><br>
+
\end{matrix} \right] </math><br />
<math>\pi_2=\frac{1}{3}\pi_0+0\pi_1+\frac{1}{2}\pi_2 </math><br>
 
  
And since <math>\pi</math> is a probability vector, <br>
 
<math> \pi_{0}~ + \pi_{1} + \pi_{2} = 1 </math>
 
  
Solving the 4 equations for the 3 unknowns gets, <br>
+
There are three states in this example.
<math>\pi_{0}~=1/3</math>, <math>\pi_{1}~=4/9</math>, and <math>\pi_{2}~=2/9</math> <br>
 
Therefore <math>\pi=\left[ {\begin{array}{ccc}
 
1/3 & 4/9 & 2/9 \end{array} } \right]</math>
 
  
Example 2: Find the Stationary Distribution of P= <math>\left[ {\begin{array}{ccc}
+
[[File:ab.png]]
1/3 & 1/3 & 1/3 \\
 
1/4 & 1/2 & 1/4 \\
 
1/6 & 1/3 & 1/2 \end{array} } \right]</math>
 
  
Solution:
+
In this case, state a can go to state a, b, or c; state b can go to state a, b, or c; and state c can go to state a, b, or c so it is possible to go from every state to every state. (Although state b cannot directly go into c in one move, it must go to a, and then to c.).
<math>\pi=\left[ {\begin{array}{ccc}
 
\pi_0 & \pi_1 & \pi_2 \end{array} } \right]</math>
 
  
Using the Stationary Distribution property <math>\pi=\pi~P</math> we get, <br>
+
A k-by-k matrix indicates that the chain has k states.
<math>\pi_0=\frac{1}{3}\pi_0+\frac{1}{4}\pi_1+\frac{1}{6}\pi_2 </math><br>
 
<math>\pi_1=\frac{1}{3}\pi_0+\frac{1}{2}\pi_1+\frac{1}{3}\pi_2 </math><br>
 
<math>\pi_2=\frac{1}{3}\pi_0+\frac{1}{4}\pi_1+\frac{1}{2}\pi_2 </math><br>
 
  
And since <math>\pi</math> is a probability vector, <br>
+
- Ergodic Markov chains are irreducible.
<math> \pi_{0}~ + \pi_{1} + \pi_{2} = 1 </math>
 
  
Solving the 4 equations for the 3 unknowns gets, <br>
+
- A Markov chain is called a '''regular''' chain if some power of the transition matrix has only positive elements.<br />
<math>\pi_{0}=\frac {6}{25}</math>, <math>\pi_{1}~=\frac {2}{5}</math>, and <math>\pi_{2}~=\frac {9}{25}</math> <br>
+
*Any transition matrix that has no zeros determines a regular Markov chain
 +
*However, it is possible for a regular Markov chain to have a transition matrix that has zeros.
 +
<br />
 +
For example, recall the matrix of the Land of Oz
  
 +
<math>P = \left[ \begin{matrix}
 +
& R & N & S \\
 +
R & 1/2 & 1/4 & 1/4 \\
 +
N & 1/2 & 0 & 1/2 \\
 +
S & 1/4 & 1/4 & 1/2 \\
 +
\end{matrix} \right]</math><br />
  
Therefore <math>\pi=\left[ {\begin{array}{ccc}
+
=== Theorem ===
\frac {6}{25} & \frac {2}{5} & \frac {9}{25} \end{array} } \right]</math>
+
An ergodic Markov chain has a unique stationary distribution <math>\pi</math>. The limiting distribution exists and is equal to <math>\pi</math><br/>
 +
Note: Ergodic Markov Chain is irreducible, aperiodic and positive recurrent.
  
 +
Example: Consider the markov chain of <math>\left[\begin{matrix}0 & 1 \\ 1 & 0\end{matrix}\right]</math>, the stationary distribution is obtained by solving <math>\pi P = \pi</math>, getting <math>\pi=[0.5, 0.5]</math>, but from the assignment we know that it does not converge, ie. there is no limiting distribution, because the Markov chain is not aperiodic and cycle repeats <math>P^2=\left[\begin{matrix}1 & 0 \\ 0 & 1\end{matrix}\right]</math> and <math>P^3=\left[\begin{matrix}0 & 1 \\ 1 & 0\end{matrix}\right]</math>
  
The above two examples are designed to solve for the stationary distribution of the matrix P. Meanwhile, they also give us the Limiting Distribution of the matrices as we have mentioned earlier that the Stationary Distribution is equivalent to the Limiting Distribution.
+
'''Another Example'''
  
=== Ergodic Chain ===
+
<math>P=\left[ {\begin{array}{ccc}
 +
\frac{1}{4} & \frac{3}{4} \\[6pt]
 +
\frac{1}{5} & \frac{4}{5} \end{array} } \right]</math> <br>
  
A Markov chain is called an ergodic chain if it is possible to go from every state to every state (not necessarily in one move). For instance, note that we can claim a Markov chain is ergodic if it is possible to somehow start at any state i and end at any state j in the matrix. We could have a chain with states 0, 1, 2, 3, 4 where it is not possible to go from state 0 to state 4 in just one step. However, it may be possible to go from 0 to 1, 1 to 2, then from 2 to 3, and finally from 3 to 4 .Therefore, we can claim that it is possible to go from 0 to 4 and this would satisfy a requirement of an ergodic chain. The example below will further explain this concept.
 
  
  
'''Note:'''if there's a finite number N then every other state can be reached in N steps.
+
[[File:Untitled*.jpg]]
  
 +
This matrix means that there are two points in the space, let's call them a and b<br/>
 +
Starting from a, the probability of staying in a is 1/4 <br/>
 +
Starting from a, the probability of going from a to b is 3/4 <br/>
 +
Starting from b, the probability of going from b to a is 1/5 <br/>
 +
Starting from b, the probability of staying in b is 4/5 <br/>
  
====Example====
+
Solve the equation <math> \pi = \pi P </math> <br>
<math> P= \left[ \begin{matrix}
+
<math> \pi_0 = .25 \pi_0 + .2 \pi_1 </math> <br>
\frac{1}{3} \; & \frac{1}{3} \; & \frac{1}{3} \\ \\
+
<math> \pi_1 = .75 \pi_0 + .8 \pi_1 </math> <br>
\frac{1}{4} \; & \frac{3}{4} \; & 0 \\ \\
+
<math> \pi_0 + \pi_1 = 1 </math> <br>
\frac{1}{2} \; & 0 \; & \frac{1}{2}
+
Solving this system of equations we get: <br>
\end{matrix} \right] </math><br />
+
<math> \pi_0 = \frac{4}{15} \pi_1 </math> <br>
 +
<math> \pi_1 = \frac{15}{19} </math> <br>
 +
<math> \pi_0 = \frac{4}{19} </math> <br>
 +
<math> \pi = [\frac{4}{19}, \frac{15}{19}] </math> <br>
 +
<math> \pi </math> is the long run distribution, and this is also a limiting distribution.
  
 +
We can use the stationary distribution to compute the expected waiting time to return to state 'a' <br/>
 +
given that we start at state 'a' and so on.. Formula for this will be : <math> E[T_{i,i}]=\frac{1}{\pi_i}</math><br/>
 +
In the example above this will mean that that expected waiting time for the markov process to return to<br/>
 +
state 'a' given that we start at state 'a' is 19/4.<br/>
  
<math> \pi=\left[ \begin{matrix}
+
definition of limiting distribution: when the stationary distribution is convergent, it is a limiting distribution.<br/>
\frac{1}{3} & \frac{4}{9} & \frac{2}{9}
 
\end{matrix} \right] </math><br />
 
  
 +
remark:satisfied balance of <math>\pi_i P_{ij} = P_{ji} \pi_j</math>, so there is other way to calculate the step probability.
  
There are three states in this example.
+
=== MatLab Code ===
 +
In the following, P is the transition matrix. eye(n) refers to the n by n Identity matrix. L is the Laplacian matrix, L = (I - P). The Laplacian matrix will have at least 1 zero Eigenvalue. For every 0 in the diagonal, there is a component. If there is exactly 1 zero Eigenvalue, then the matrix is connected and has only 1 component. The number of zeros in the Laplacian matrix is the number of parts in your graph/process. If there is more than one zero on the diagonal of this matrix, means there is a disconnect in the graph.
  
[[File:ab.png]]
 
  
In this case, state a can go to state a, b, or c; state b can go to state a, b, or c; and state c can go to state a, b, or c. Henceit is possible to go from every state to every state. (Although state b cannot directly go into c in one move, it must go to a, and then to c.).
+
<pre style='font-size:14px'>
 +
>> P=[1/3, 1/3, 1/3; 1/4, 3/4, 0; 1/2, 0, 1/2]
  
A K-by-K Matrix indicates that the chain has k states.
+
P =
  
- Ergodic Markov chains are irreducible.<br>
+
    0.3333    0.3333    0.3333
Recall that a Markov chain is irreducible if all the states communicate with each other.
+
    0.2500    0.7500        0
 +
    0.5000        0    0.5000
  
- A Markov chain is called a '''regular''' chain if some power of the transition matrix has only positive elements.<br />
+
>> eye(3) %%returns 3x3 identity matrix
*Any transition matrix that has no zeros determines a regular Markov chain
 
*However, it is possible for a regular Markov chain to have a transition matrix that has zeros.
 
<br />
 
For example, recall the matrix of the Land of Oz
 
  
<math>P = \left[ \begin{matrix}
+
ans =
& R & N & S \\
 
R & 1/2 & 1/4 & 1/4 \\
 
N & 1/2 & 0 & 1/2 \\
 
S & 1/4 & 1/4 & 1/2 \\
 
\end{matrix} \right]</math><br />
 
  
===Example===
+
    1    0    0
The following chain is <b>not</b> an ergodic chain.
+
    0    1    0
[[File:Notergodic.jpg]]
+
    0    0    1
<br>Note that this is true because it is not possible to get to every state from any state. So you cannot get to states C or D from states A or B, and vice versa.
 
  
The matrix looks like this:
+
>> L=(eye(3)-P) 
  
<math>L=
+
L =
\left[ {\begin{matrix} 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \end{matrix} } \right]</math>
 
  
Obviously, you cannot get from A or B to C or D.
+
    0.6667  -0.3333  -0.3333
 +
  -0.2500    0.2500        0
 +
  -0.5000        0    0.5000
  
=== Theorem ===
+
>> [vec val]=eigs(L)
An ergodic Markov chain has a unique Stationary Distribution <math>\pi</math>.
 
The Limiting Distribution exists and is equal to <math>\pi</math><br/>
 
*Note1: Ergodic Markov Chain is irreducible, aperiodic and positive recurrent, meaning all states have to be positive recurrent.<br>
 
*Note2: positive recurrent if the expected amount of time between recurrences is finite.Or, E(T<sub>i</sub>) < <math>infinite</math>. <br>
 
*Note3: If the limiting distribution exists for a markov chain does '''not''' always imply that the chain is ergodic.
 
  
'''Example'''
+
vec =
  
Consider the markov chain of <math>\left[\begin{matrix}0 & 1 \\ 1 & 0\end{matrix}\right]</math>, the Stationary Sistribution( If the Markov chain is a time-homogeneous Markov chain, so that the process is described by a single, time-independent matrix , then the vector  is called a stationary distribution (or invariant measure) if  it satisfies)
+
  -0.7295    0.2329    0.5774
 +
    0.2239  -0.5690    0.5774
 +
    0.6463    0.7887    0.5774
  
  
is obtained by solving <math>\pi P = \pi</math>,
+
val =
getting <math>\pi=[0.5, 0.5]</math>, but from the assignment we know that it does not converge, ie. there is no limiting distribution, because the Markov chain is not aperiodic and cycle repeats <math>P^2=\left[\begin{matrix}1 & 0 \\ 0 & 1\end{matrix}\right]</math> and <math>P^3=\left[\begin{matrix}0 & 1 \\ 1 & 0\end{matrix}\right]</math>
 
  
'''Another Example'''
+
    1.0643        0        0
 +
        0    0.3523        0
 +
        0        0  -0.0000
  
<math>P=\left[ {\begin{array}{ccc}
+
%% Only one value of zero on the diagonal means the chain is connected
\frac{1}{4} & \frac{3}{4} \\[6pt]
 
\frac{1}{5} & \frac{4}{5} \end{array} } \right]</math> <br>
 
  
 +
>> P=[0.8, 0.2, 0, 0;0.2, 0.8, 0, 0; 0, 0, 0.8, 0.2; 0, 0, 0.1, 0.9]
  
 +
P =
  
[[File:Untitled*.jpg]]
+
    0.8000    0.2000        0        0
 +
    0.2000    0.8000        0        0
 +
        0        0    0.8000    0.2000
 +
        0        0    0.1000    0.9000
  
This matrix means that there are two points in the space, let's call them a and b<br/>
+
>> eye(4)
Starting from a, the probability of staying in a is 1/4 <br/>
 
Starting from a, the probability of going from a to b is 3/4 <br/>
 
Starting from b, the probability of going from b to a is 1/5 <br/>
 
Starting from b, the probability of staying in b is 4/5 <br/>
 
  
Solve the equation <math> \pi = \pi P </math> <br>
+
ans =
<math> \pi_0 = .25 \pi_0 + .2 \pi_1 </math> <br>
 
<math> \pi_1 = .75 \pi_0 + .8 \pi_1 </math> <br>
 
<math> \pi_0 + \pi_1 = 1 </math> <br>
 
Solving this system of equations we get: <br>
 
<math> \pi_0 = \frac{4}{15} \pi_1 </math> <br>
 
<math> \pi_1 = \frac{15}{19} </math> <br>
 
<math> \pi_0 = \frac{4}{19} </math> <br>
 
<math> \pi = [\frac{4}{19}, \frac{15}{19}] </math> <br>
 
<math> \pi </math> is the long run distribution
 
  
We can use the Stationary Distribution to compute the expected waiting time to return to state 'a' <br/>
+
    1    0    0    0
given that we start at state 'a' and so on.. Formula for this will be : <math> E[T_{i,i}]=\frac{1}{\pi_i}</math><br/>
+
    0    1     0    0
In the example above this will mean that that expected waiting time for the markov process to return to<br/>
+
    0    0    1    0
state 'a' given that we start at state 'a' is 19/4.<br/>
+
    0    0    0    1
  
Definition of Limiting Distribution.
+
>> L=(eye(4)-P)
  
 +
L =
  
remark:satisfied balance of <math>\pi_i P_{ij} = P_{ji} \pi_j</math>, so there is other way to calculate the step probability.
+
    0.2000  -0.2000        0        0
 +
  -0.2000    0.2000        0        0
 +
        0        0    0.2000  -0.2000
 +
        0        0  -0.1000    0.1000
  
=== MatLab Code ===
+
>> [vec val]=eigs(L)
In the following, P is the transition matrix. eye(n) refers to the n by n Identity matrix. L is the Laplacian matrix, L = (I - P). The Laplacian matrix will have at least 1 zero Eigenvalue. For every 0 in the diagonal, there is a component. If there is exactly 1 zero Eigenvalue, then the matrix is connected and has only 1 component. The number of zeros in the Laplacian matrix is the number of parts in your graph/process. If there is more than one zero on the diagonal of this matrix, this means there is a disconnect in the graph.
 
  
 +
vec =
  
<pre style='font-size:14px'>
+
    0.7071        0    0.7071        0
>> P=[1/3, 1/3, 1/3; 1/4, 3/4, 0; 1/2, 0, 1/2]
+
  -0.7071        0    0.7071        0
 +
        0    0.8944        0    0.7071
 +
        0  -0.4472        0   0.7071
  
P =
 
  
    0.3333    0.3333    0.3333
+
val =
    0.2500    0.7500        0
 
    0.5000        0    0.5000
 
  
>> eye(3) %%returns 3x3 identity matrix
+
    0.4000        0        0        0
 +
        0    0.3000        0        0
 +
        0        0  -0.0000        0
 +
        0        0        0  -0.0000
  
ans =
+
%% Two values of zero on the diagonal means there are two 'islands' of chains
  
    1    0    0
+
</pre>
    0    1    0
 
    0    0    1
 
  
>> L=(eye(3)-P) 
+
<math>\Pi</math> satisfies detailed balance if <math>\Pi_i P_{ij}=P_{ji} \Pi_j</math>. Detailed balance guarantees that <math>\Pi</math> is stationary distribution.<br />
  
L =
+
'''Adjacency matrix''' - a matrix <math>A</math> that dictates which states are connected and way of portraying which vertices in the matrix are adjacent. Two vertices are adjacent if there exists a path between them of length 1.If we compute <math>A^2</math>, we can know which states are connected with paths of length 2.<br />
  
    0.6667  -0.3333  -0.3333
+
A '''Markov chain''' is called an irreducible chain if it is possible to go from every state to every state (not necessary in one more).<br />
  -0.2500    0.2500        0
+
Theorem: An '''ergodic''' Markov chain has a unique stationary distribution <math>\pi</math>. The limiting distribution exists and is equal to <math>\pi</math>. <br />
  -0.5000        0    0.5000
 
  
>> [vec val]=eigs(L)
 
  
vec =
+
Markov process satisfies detailed balance  if and only if it is a '''reversible''' Markov process
 +
where P is the matrix of  Markov transition.<br />
  
  -0.7295    0.2329    0.5774
+
Satisfying the detailed balance condition guarantees that <math>\pi</math> is stationary distributed.
    0.2239  -0.5690    0.5774
 
    0.6463    0.7887    0.5774
 
  
 +
<math> \pi </math> satisfies detailed balance if <math> \pi_i P_{ij} = P_{ji} \pi_j </math> <br>
 +
which is the same as the Markov process equation.
  
val =
+
Example in the class:
 +
<math>P= \left[ {\begin{array}{ccc}
 +
\frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\[6pt]
 +
\frac{1}{4} & \frac{3}{4} & 0 \\[6pt]
 +
\frac{1}{2} & 0 & \frac{1}{2} \end{array} } \right]</math>
  
    1.0643        0        0
+
and  <math>\pi=(\frac{1}{3},\frac{4}{9}, \frac{2}{9})</math>
        0    0.3523        0
 
        0        0  -0.0000
 
  
%% Only one value of zero on the diagonal means the chain is connected
+
<math>\pi_1 P_{1,2} = 1/3 \times 1/3 = 1/9,\, P_{2,1} \pi_2 = 1/4 \times 4/9 = 1/9 \Rightarrow \pi_1 P_{1,2} = P_{2,1} \pi_2 </math><br>
  
>> P=[0.8, 0.2, 0, 0;0.2, 0.8, 0, 0; 0, 0, 0.8, 0.2; 0, 0, 0.1, 0.9]
+
<math>\pi_2 P_{2,3} = 4/9 \times 0 = 0,\, P_{3,2} \pi_3 = 0 \times 2/9 = 0 \Rightarrow \pi_2 P_{2,3} = P_{3,2} \pi_3</math><br>
 +
Remark:Detailed balance of <math> \pi_i \times Pij = Pji \times \pi_j</math> , so there is other way to calculate the step probability<br />
 +
<math>\pi</math> is stationary but is not limiting.
 +
Detailed balance implies that <math>\pi</math> = <math>\pi</math> * P as shown in the proof and guarantees that <math>\pi</math> is stationary distribution.
  
P =
+
== Class 15 - Tuesday June 25th 2013 ==
 +
=== Announcement ===
 +
Note to all students, the first half of today's lecture will cover the midterm's solution; however please do not post the solution on the Wikicoursenote.<br />
  
    0.8000    0.2000        0        0
+
====Detailed balance====
    0.2000    0.8000        0        0
+
<div style="border:2px solid black">
        0        0    0.8000    0.2000
+
<b>Definition (from wikipedia)</b>
        0        0    0.1000    0.9000
+
The principle of detailed balance is formulated for kinetic systems which are decomposed into elementary processes (collisions, or steps, or elementary reactions): At equilibrium, each elementary process should be equilibrated by its reverse process.
 +
</div>
 +
Let <math>P</math> be the transition probability matrix of a Markov chain. If there exists a distribution vector <math>\pi</math> such that <math>\pi_i \cdot P_{ij}=P_{ji} \cdot \pi_j, \; \forall i,j</math>, then the Markov chain is said to have '''detailed balance'''. A detailed balanced Markov chain must have <math>\pi</math> given above as a stationary distribution, that is <math>\pi=\pi P</math>, where <math>\pi</math> is a 1 by n matrix and P is a n by n matrix.<br>
  
>> eye(4)
 
  
ans =
+
need to remember:
 +
'''Proof:''' <br>
 +
<math>\; [\pi P]_j = \sum_i \pi_i P_{ij} =\sum_i P_{ji}\pi_j =\pi_j\sum_i P_{ji} =\pi_j  ,\forall j</math>
  
    1    0    0    0
+
:Note: Since <math>\pi_j</math> is a sum of column j and we can do this proof for every element in matrix P; in general, we can prove <math>\pi=\pi P</math> <br>
    0    1    0    0
 
    0    0    1    0
 
    0    0    0    1
 
  
>> L=(eye(4)-P)
+
Hence <math>\pi</math> is always a stationary distribution of <math>P(X_{n+1}=j|X_n=i)</math>, for every n.
  
L =
+
In other terms, <math> P_{ij} = P(X_n = j| X_{n-1} = i) </math>, where <math>\pi_j</math> is the equilibrium probability of being in state j and <math>\pi_i</math> is the equilibrium probability of being in state i. <math>P(X_{n-1} = i) = \pi_i</math> is equivalent to <math>P(X_{n-1} = i,  Xn = j)</math> being symmetric in i and j.
  
    0.2000  -0.2000        0        0
+
Keep in mind that the detailed balance is a sufficient but not required condition for a distribution to be stationary.  
  -0.2000    0.2000        0        0
+
i.e. A distribution satisfying the detailed balance is stationary, but a stationary distribution does not necessarily satisfy the detailed balance.
        0        0    0.2000  -0.2000
 
        0        0  -0.1000    0.1000
 
  
>> [vec val]=eigs(L)
+
In the stationary distribution <math>\pi=\pi P</math>, in the proof the sum of the p is equal 1 so the <math>\pi P=\pi</math>.
  
vec =
+
=== PageRank (http://en.wikipedia.org/wiki/PageRank) ===
  
    0.7071        0    0.7071        0
+
*PageRank is a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. PageRank can be calculated for collections of documents of any size.
  -0.7071        0    0.7071        0
+
*PageRank is a link-analysis algorithm developed by and named after Larry Page from Google; used for measuring a website's importance, relevance and popularity.
        0    0.8944        0    0.7071
+
*PageRank is a graph containing web pages and their links to each other.
        0  -0.4472        0    0.7071
+
*Many social media sites use this (such as Facebook and Twitter)
 +
*It can also be used to find criminals (ie. theives, hackers, terrorists, etc.) by finding out the links.  
 +
This is what made Google the search engine of choice over Yahoo, Bing, etc.- What made Google's search engine a huge success is not its search function, but rather the algorithm it used to rank the pages. (Ex. If we come up with 100 million search results, how do you list them by relevance and importance so the users can easily find what they are looking for. Most users will not go past the first 3 or so search pages to find what they are looking for. It is this ability to rank pages that allow Google to remain more popular than Yahoo, Bink, AskJeeves, etc.). It should be noted that after using the PageRank algorithm, Google uses other processes to filter results.<br/>
  
 +
<br />'''The order of importance'''<br />
 +
1. A web page is more important if many other pages point to it<br />
 +
2. The more important a web page is, the more weight should be assigned to its outgoing links<br/ >
 +
3. If a webpage has many outgoing links, then its links have less value (ex: if a page links to everyone, like 411, it is not as important as pages that have incoming links)<br />
  
val =
+
<br />
 +
[[File:diagram.jpg]]
 +
<math>L=  
 +
\left[ {\begin{matrix}
 +
0 & 0 & 0 & 0 & 0 \\
 +
1 & 0 & 0 & 0 & 0 \\
 +
1 & 1 & 0 & 1 & 0 \\
 +
0 & 1 & 0 & 0 & 1 \\
 +
0 & 0 & 0 & 0 & 0 \end{matrix} } \right]</math>
  
    0.4000        0        0        0
+
The first row indicates who gives a link to 1. As shown in the diagram, nothing gives a link to 1, and thus it is all zero.
        0    0.3000        0        0
+
The second row indicates who gives a link to 2. As shown in the diagram, only 1 gives a link to 2, and thus column 1 is a 1 for row 2, and the rest are all zero.
        0        0  -0.0000        0
 
        0        0        0  -0.0000
 
  
%% Two values of zero on the diagonal means there are two 'islands' of chains
+
ie: According to the above example <br/ >
 +
Page 3 is the most important since it has the most links pointing to it, therefore more weigh should be placed on its outgoing links.<br/ >
 +
Page 4 comes after page 3 since it has the second most links pointing to it<br/ >
 +
Page 2 comes after page 4 since it has the third most links pointing to it<br/>
 +
Page 1 and page 5 are the least important since no links point to them<br/ >
 +
As page 1 and page 2 have the most outgoing links, then their links have less value compared to the other pages. <br/ >
  
</pre>
+
:<math>
 +
Lij = \begin{cases}
 +
1, & \text{if j has a link to i} \\
 +
0, & \text{otherwise}
 +
\end{cases}</math>
  
<math>\Pi</math> satisfies detailed balance if <math>\Pi_i P_{ij}=P_{ij} \Pi_j</math>. Detailed balance guarantees that <math>\Pi</math> is stationary distribution.<br />
+
<br />
 +
<math>C_j=</math> The number of outgoing links of page <math>j</math>:
 +
<math>C_j=\sum_i L_{ij}</math>
 +
(i.e. sum of entries in column j)<br />
 +
<br />
 +
<math>P_j</math> is the rank of page <math>j</math>.<br />
 +
Suppose we have <math>N</math> pages, <math>P</math> is a vector containing ranks of all pages.<br />
 +
- <math>P</math> is a <math>N \times 1</math> vector.
  
 +
- <math>P_i</math> counts the number of incoming links of page <math>i</math>
 +
<math>P_i=\sum_j L_{ij}</math> <br />(i.e. sum of entries in row i)
  
'''Adjacency Matrix''' - a matrix <math>A</math> that dictates which states are connected and way of portraying which vertices in the matrix are adjacent. If we compute <math>A^2</math>, we can know which states are connected with paths of length 2.<br />
+
For each row of <math>L</math>, if there is a 1 in the third column, it means page three point to that page.
  
'''Markov chain''' -an irreducible chain if it is possible to go from every state to every state (not necessary in one more).<br />
+
However, we should not define the rank of the page this way because links shouldn't be treated the same. The weight of the link is based on different factors. One of the factors is the importance of the page that link is coming from. For example, in this case, there are two links going to Page 4: one from Page 2 and one from Page 5. So far, both links have been treated equally with the same weight 1. But we must rerate the two links based on the importance of the pages they are coming from.
  
 +
A PageRank results from a mathematical algorithm based on the webgraph, created by all World Wide Web pages as nodes and hyperlinks as edges, taking into consideration authority hubs such as cnn.com or usa.gov. The rank value indicates an importance of a particular page. A hyperlink to a page counts as a vote of support. (This would be represented in our diagram as an arrow pointing towards the page. Hence in our example, Page 3 is the most important, since it has the most 'votes of support). The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it ("incoming links"). A page that is linked to by many pages with high PageRank receives a high rank itself. If there are no links to a web page, then there is no support for that page (In our example, this would be Page 1 and Page 5).
 +
(source:http://en.wikipedia.org/wiki/PageRank#Description)
  
Theorem:  
+
For those interested in PageRank, here is the original paper by Google co-founders Brin and Page: http://infolab.stanford.edu/pub/papers/google.pdf
  
An '''Ergodic''' Markov chain has a unique stationary distribution <math>\pi</math>. The limiting distribution exists and it is equal to <math>\pi</math>. <br />
+
=== Example of Page Rank Application in Real Life ===
  
 +
'''Page Rank checker'''
 +
- This is a free service to check Google™ page rank instantly via online PR checker or by adding a PageRank checking button to the web pages.
 +
  <font size="3">(http://www.prchecker.info/check_page_rank.php)</font>
  
Markov process satisfies detailed balance  if and only if it is a '''reversible''' Markov process
 
where P is the matrix of  Markov transition.<br />
 
  
Satisfying the detailed balance condition guarantees that <math>\pi</math> is stationary distributed.
+
GoogleMatrix G = d * [ (Hyperlink Matrix H) + (Dangling Nodes Matrix A) ] + ((1-d)/N) * (NxN Matrix U of all 1's)
  
<math> \pi </math> satisfies detailed balance if <math> \pi_i P_{ij} = P_{ji} \pi_j </math> <br>which is the same as the Markov process equation.
 
  
Example in the class:
+
[[File:Google matrix.png]]
<math>P= \left[ {\begin{array}{ccc}
 
\frac{1}{3} & \frac{1}{3} & \frac{1}{3} \\[6pt]
 
\frac{1}{4} & \frac{3}{4} & 0 \\[6pt]
 
\frac{1}{2} & 0 & \frac{1}{2} \end{array} } \right]</math>
 
  
and  <math>\pi=(\frac{1}{3},\frac{4}{9}, \frac{2}{9})</math>
 
  
<math>\pi_1 P_{1,2} = 1/3 \times 1/3 = 1/9,\, P_{2,1} \pi_2 = 1/4 \times 4/9 = 1/9 \Rightarrow \pi_1 P_{1,2} = P_{2,1} \pi_2 </math><br>
+
(source: https://googledrive.com/host/0B2GQktu-wcTiaWw5OFVqT1k3bDA/)
  
<math>\pi_2 P_{2,3} = 4/9 \times 0 = 0,\, P_{3,2} \pi_3 = 0 \times 2/9 = 0 \Rightarrow \pi_2 P_{2,3} = P_{3,2} \pi_3</math><br>
+
== Class 16 - Thursday June 27th 2013 ==
  
 +
=== Page Rank ===
 +
*<math>
 +
L_{ij} = \begin{cases}
 +
1, & \text{if j has a link to i }  \\
 +
0, & \text{otherwise} \end{cases} </math> <br/>
  
Remark:Detailed balance of <math> \pi_i * Pij = Pji * \pi_j</math> , so there is other way to calculate the step probability<br />
+
*<math>C_j</math>: number of outgoing links for page j, where <math>c_j=\sum_i L_{ij}</math>  
<math>\pi</math> is stationary but is not limiting.
 
  
== Class 15 - Tuesday June 25th 2013 ==
+
P is N by 1 vector contains rank of all N pages; for page i, the rank is <math>P_i</math>
=== Announcement ===
 
Note to all students, the first half of today's lecture will cover the midterm's solution; however, please do not post the solution on the Wikicoursenote.<br />
 
  
====Detailed balance====
+
<math>P_i= (1-d) + d\cdot \sum_j \frac {L_{ji}P_j}{c_j}</math>
 +
pi is the rank of a new created page(that no one knows about) is 0 since <math>L_ij</math> is 0 <br/>
 +
where 0 < d < 1 is constant (in original page rank algorithm d = 0.8), and <math>L_{ij}</math> is 1 if j has link to i, 0 otherwise.
  
Let <math>\displaystyle P</math> be the transition probability matrix of a Markov chain. If there exists a distribution vector <math>\displaystyle \underline{\pi} = [\pi_1 \pi_2 ... \pi_n]</math> such that <math>\pi_i \cdot P_{ij}=P_{ji} \cdot \pi_j, \; \forall i,j</math>, then the Markov chain is said to have '''detailed balance'''. The principle of detailed balance, formulated for kinetic systems, are decomposed into elementary processes (collisions, or steps, or elementary reactions): Each elementary process should be equilibrated by its reverse process at equilibrium.
+
Note that the rank of a page is proportional to the number of its incoming links and inversely proportional to the number of its outgoing links.
  
<br />A detailed balanced Markov chain must have <math>\displaystyle \underline{\pi}</math> given above as a stationary distribution, that is <math>\displaystyle \underline{\pi} = \underline{\pi} P</math>, where <math>\displaystyle \underline{\pi}</math> is a 1 by n matrix and <math>\displaystyle P</math> is a n by n matrix.<br />
+
Interpretation of the formula:<br/>
 +
1) sum of L<sub>ij</sub> is the total number of incoming links<br/>
 +
2) the above sum is weighted by page rank of the pages that contain the link to i (P<sub>j</sub>) i.e. if a high-rank page points to page i, then this link carries more weight than links from lower-rank pages.<br/>
 +
3) the sum is then weighted by the inverse of the number of outgoing links from the pages that contain links to i (c<sub>j</sub>). i.e. if a page has more outgoing links than other pages then its links carry less weight.<br/>
 +
4) finally, we take a linear combination of the page rank obtained from above and a constant 1. This ensures that every page has a rank greater than zero.<br/>
 +
5) d is the damping factor.  It represents the probability a user, at any page, will continue clicking to another page.<br/>
 +
If there is no damping (i.e. d=1), then there are no assumed outgoing links for nodes with no links. However, if there is damping (e.g. d=0.8), then these nodes are assumed to have links to all pages in the web.
  
 +
Note that this is a system of N equations with N unknowns.<br/>
  
'''Proof:''' <br>
+
<math>c_j</math> is the number of outgoing links, less outgoing links means more important.<br/>
  
    <math>\; [\pi P]_j = \sum_i \pi_i P_{ij} =\sum_i P_{ji}\pi_j =\pi_j\sum_i P_{ji} =\pi_j  ,\forall j</math>
 
  
:Note: Since <math>\pi_j</math> is a sum of column j and we can do this proof for every element in matrix P; in general, we can prove <math>\pi=\pi P</math> . Hence <math>\pi</math> is always a stationary distribution of <math>P(X_n+1=j|X_n=i)</math>, for every n.
+
Let D be a diagonal N by N matrix such that <math> D_{ii}</math> = <math>c_i</math>
  
In other terms, <math> P_{ij} = P(X_n = j| X_{n-1} = i) </math>, where <math>\pi_j</math> is the equilibrium probability of being in state j and <math>\pi_i</math> is the equilibrium probability of being in state i. <math>P(X_{n-1} = i) = \pi_i</math> is equivalent to <math>P(X_{n-1} = i,  Xn = j)</math> being symmetric in i and j.
+
Note: Ranks are arbitrary, all we want to know is the order. That is, we want to know how important the page rank relative to the other pages and are not interested in the value of the page rank.  
  
Keep in mind that the detailed balance is a sufficient but not required condition for a distribution to be stationary.  
+
<math>D=
i.e. A distribution satisfying the detailed balance is stationary, but a stationary distribution does not necessarily satisfy the detailed balance.
+
\left[ {\begin{matrix}
<span style="text-shadow: 0px 2px 3px hsl(310,15%,65%);margin-right:1em;font-family: 'Nobile', Helvetica, Arial, sans-serif;font-size:16px;line-height:25px;color:3399CC">consider there is another pre-condition which is: segma of pi s is equal to 1</span>
+
c_1 & 0 & ... & 0  \\
In the stationary distribution <math>\pi=\pi P</math>, in the proof the sum of the p is equal 1 so the <math>\pi P=\pi</math>.
+
0 & c_2 & ... &  0  \\
 +
0 & 0 & ... &  0 \\
 +
0 & 0 & ... & c_N \end{matrix} } \right]</math>
  
=== PageRank (http://en.wikipedia.org/wiki/PageRank) ===
+
Then <math>P=~(1-d)e+dLD^{-1}P</math>, P is an iegenvector of matrix A corresponding to an eigenvalue equal to 1.<br/> where e =[1 1 ....]<sup>T</sup> , i.e. a N by 1 vector.<br/>
 +
We assume that rank of all N pages sums to N. The sum of rank of all N pages can be any number, as long as the ranks have certain propotion. <br/>
 +
i.e. e<sup>T</sup> P = N, then <math>~\frac{e^{T}P}{N} = 1</math>
  
*PageRank is a link-analysis algorithm developed by Larry Page from Google in 1996; used for measuring a website's importance, relevance and popularity.
 
*PageRank is a graph containing web pages and their links to each other.
 
*Many social media sites use this (such as Facebook and Twitter).
 
*It can also be used to find criminals (ie. theives, hackers, terrorists, etc.) by finding out the links.
 
This is what made Google the search engine of choice over Yahoo, Bing, etc.- What made Google's search engine a huge success is not its search function, but rather the algorithm it used to rank the pages. (Eg. If we come up with 100 million search results, how do you list them by relevance and importance so the users can easily find what they are looking for. Most users will not go past the first 3 or so search pages to find what they are looking for. It is this ability to rank pages that allows Google to remain more popular than Yahoo, Bink, AskJeeves, etc.)<br />
 
  
<br />'''The order of importance'''<br />
+
D<sup>-1</sup> will be:
1. A web page is important if it has many other pages linked to it<br />
 
2. The more important a web page is, the more weight should be assigned to its outgoing links<br/ >
 
3. If a webpage has many outgoing links then its links have less value (eg: if a page links to everyone, like 411, it is not as important as pages that have incoming links)<br />
 
  
<br />
+
D<sup>-1</sup><math>=  
[[File:diagram.jpg]]
 
<math>L=  
 
 
\left[ {\begin{matrix}
 
\left[ {\begin{matrix}
0 & 0 & 0 & 0 & 0 \\
+
\frac {1}{c_1} & 0 & ... & 0 \\
1 & 0 & 0 & 0 & 0 \\
+
0 & \frac {1}{c_2} & ...  & 0 \\
1 & 1 & 0 & 1 & 0 \\
+
0 & 0 & ... & 0 \\
0 & 1 & 0 & 0 & 1 \\
+
0 & 0 & ... & \frac {1}{c_N} \end{matrix} } \right]</math>
0 & 0 & 0 & 0 & 0 \end{matrix} } \right]</math>
 
  
ie: According to the above example <br/ >
+
<math>P=~(1-d)e+dLD^{-1}P</math> where <math>e=\begin{bmatrix}
Page 3 is the most important since it has the most links pointing to it (3 links), therefore more weight should be placed on its outgoing links.<br/ >
+
1\\
Page 4 comes after page 3 since it has the second most links (2) pointing to it<br/ >
+
1\\
Page 2 comes after page 4 since it has the third most links (1) pointing to it<br/>
+
...\\
Page 1 and page 5 are the least important since no links point to them<br/ >
+
1
As page 1 and page 2 has the most outgoing links, then their links have less value compared to the other pages. <br/ >
+
\end{bmatrix}</math>
  
<math>L_{ij} = 1</math> if j has a link to i;<br/ >
+
<math>P=(1-d)~\frac{ee^{T}P}{N}+dLD^{-1}P</math>
<math>L_{ij} = 0</math> otherwise.<br />
 
<br />
 
C<sub>j</sub> = The number of outgoing links of page <math>j</math>:
 
<math>C_j=\sum_i L_{ij}</math>
 
suppose we have N pages, then p is a N by 1 vector contains rank of the pages
 
(i.e. sum of entries in column j)<br />
 
<br />
 
<math>P_j</math> is the rank of page <math>j</math>.<br />
 
- <math>P</math> is a <math>N \times 1</math> vector.
 
  
- <math>P_i</math> counts the number of incoming links of page <math>i</math>
+
<math>P=[(1-d)~\frac{ee^T}{N}+dLD^{-1}]P</math>
<br />(i.e. sum of entries in row i)
 
<br /><math>P_i=\sum_j L_{ij}</math>
 
For a row i, if there is a 1 in the third column, it means page three points to page i.
 
  
===Alternate Example===
+
<math>=> P=A*P</math>
[[File:pagerank.jpg]]<br>
 
In this case, Page 5 does not have any pointers to or from the cluster of pages on its left. When we build the algorithm to conduct page rank in the next lecture, we will ensure that Page 5 is not ignored in the ranking system.
 
  
Obviously the rank is Page 3, Page 4, Page 2, Page 1, Page 5. Without an additional term (<math>d</math>), poor old Page 5 would not come up on the search. However, we will make sure that it does. Yay!
+
'''Explanation of an eigenvector'''
  
===Explanation===
+
An eigenvector is a non-zero vector '''v''' such that when multiplied by a square matrix, A, the result is a scalar times the vector '''v''' itself. <br>
A PageRank results from a mathematical algorithm based on the webgraph, created by all World Wide Web pages as nodes and hyperlinks as edges, taking into consideration authority hubs such as cnn.com or usa.gov. The rank value indicates the importance of a particular page. A hyperlink to a page counts as a vote of support. (This would be represented in our diagram as an arrow pointing towards the page. Hence in our example, Page 3 is the most important, since it has the most votes of support). The PageRank of a page is defined recursively and depends on the number and PageRank metric of all pages that link to it ("incoming links"). <br />
+
That is, A*v = c*v. Where c is the eigenvalue of A corresponding to the eigenvector v. In our case of Page Rank, the eigenvalue c=1. <br>
  
A page that is linked to by many pages with high PageRank receives a high rank itself. If there are no links to a web page, then there is no support for that page (In our example, this would be Page 1 and Page 5).
+
We obtain that <math>P=AP</math> where <math>A=(1-d)~\frac{ee^T}{N}+dLD^{-1}</math><br/>
(source:http://en.wikipedia.org/wiki/PageRank#Description)
+
Thus, <math>P</math> is an eigenvector of <math>P</math> correspond to an eigen value equals 1.<br/>
  
For those interested in PageRank, here is the original paper by Google co-founders Brin and Page: http://infolab.stanford.edu/pub/papers/google.pdf
 
  
Notice: Page and Brin confused a formula in above paper.
+
Since,
 +
L is a N*N matrix,
 +
D<sup>-1</sup> is a N*N matrix,
 +
P is a N*1 matrix <br/>
 +
Then as a result, <math>LD^{-1}P</math> is a N*1 matrix. <br/>
  
=== Example of Page Rank Application in Real Life ===
+
N is a N*N matrix, d is a constant between 0 and 1.
 
 
'''Page Rank checker'''
 
- This is a free service to check Google™ page rank instantly via online PR checker or by adding a PageRank checking button to the
 
  web pages (http://www.prchecker.info/check_page_rank.php)
 
  
 +
'''P=AP'''<br />
 +
P is an eigenvector of A with corresponding eigenvalue equal to 1.<br>
 +
'''P<sup>T</sup>=P<sup>T</sup>A<sup>T</sup><br>'''
 +
Notice that all entries in A are non-negative and each row sums to 1. Hence A satisfies the definition of a transition probability matrix.<br>
 +
P<sup>T</sup> is the stationary distribution of a Markov Chain with transition probability matrix A<sup>T</sup>.
  
GoogleMatrix G = d * [ (Hyperlink Matrix H) + (Dangling Nodes Matrix A) ] + ((1-d)/N) * (NxN Matrix U of all 1's)
+
We can consider A to be the matrix describing all possible movements following links on the internet, and P<sup>t</sup> as the probability of being on any given webpage if we were on the internet long enough.
  
 +
Definition of rank page and proof it steps by steps, it shows with 3 n*n matrix and and one n*1 matrix and a constant d between 0 to 1.
 +
p is the stationary distribution so p=Ap.
  
[[File:Google matrix.png]]
+
=== Damping Factor "d" ===
  
 +
The PageRank assumes that any imaginary user who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will keep on clicking is a damping factor, <math>d</math>. After many studies, the approximation of <math>d</math> is 0.85. Other values for <math>d</math> have been used in class and may appear on assignments and exams.<br/>
  
(source: https://googledrive.com/host/0B2GQktu-wcTiaWw5OFVqT1k3bDA/)
+
In addition, <math>d</math> is a vector of ranks that are arbitrary. For example the rank can be [1 3 2], or [10 30 20], or [0.1 0.3 0.2]. All three of these examples are relative/equivalent since they are ranks, we could even have [1 10 3]. Therefore, <math>d</math> must have a relative rank.<br/>
  
== Class 16 - Thursday June 27th 2013 ==
+
So <math>P_1 + P_2 + \cdots + P_n=N</math> <br/>
 +
Which is equivalent to:
 +
<math>e^{T}P= [1 \cdots 1] [P_1 \cdots P_n]^T </math> <br/>
 +
Where <math>[1 \cdots 1]</math> is a 1 scalar vector and <math>[P_1 \cdots P_n]^T</math> is a rank vector. <br/>
 +
So <math>e^{T}P=N -> (e^{T}P)/N = 1 </math>
  
=== Page Rank ===
+
===Examples===
<math>L_{ij}</math> = 1 (if j has a link to i) & 0 (otherwise) <br>
+
<span style="background:#F5F5DC">
<math>C_j</math> :The number of outgoing links for page j, where <math>c_j=\sum_i L_{ij}</math>  
 
  
P is a Nx1 vector that contains the rank of all N pages.<br>
+
==== Example 1 ====
For page i, the rank is <math>P_i</math>
 
  
<math>P_i= (1-d) + d\cdot \sum_j \frac {L_{ij}P_j}{c_j}</math> <br>
 
where 0 < d < 1 is constant (in original page rank algorithm d = 0.8).
 
  
 +
[[File:eg1.jpg]]
 +
<br />
 +
</span>
 
<math>L=  
 
<math>L=  
 
\left[ {\begin{matrix}
 
\left[ {\begin{matrix}
0 & 1 & 1 \\
+
0 & 0 & 1  \\
1 & 0 & 0 \\
+
1 & 0 & 0  \\
1 & 0 & 0 \end{matrix} } \right]</math>
+
0 & 1 & 0 \end{matrix} } \right]\;c=
 +
\left[ {\begin{matrix}
 +
1 & 1 & 1 \end{matrix} } \right]\;D=
 +
\left[ {\begin{matrix}
 +
1 & 0 & 0 \\
 +
0 & 1 & 0  \\
 +
0 & 0 & 1 \end{matrix} } \right]</math>
  
then C = [2 1 1]
+
<pre style='font-size:14px'>
  
Note: the rank of page i is <br>
+
MATLAB Code
1) Proportional to the importance of each page that linked to it and <br>
 
2) Inversely proportional to the total number of links coming from each of those pages.
 
 
 
If given all other P<sub>j</sub>, we can easily calculate P<sub>i</sub>. However, we don't know any of them, thus we need to solve N unknowns with the N equations that we have.
 
 
 
Note: We do not want a page with rank 0 to occur (to give new websites an opportunity to be clicked), so we use d (damping factor)
 
 
 
The reason why we are multiplying by the d term in front of the <math>\sum_j \frac {L_{ij}P_j}{c_j}</math> is to get rid of the problem with extreme cases where term one or two is dominant over the others.
 
  
 +
d=0.8
 +
N=3
 +
A=(1-d)*ones(N)/N+d*L*pinv(D) #pinv: Moore-Penrose inverse (pseudoinverse) of symbolic matrix
 +
We use the pinv(D) function [pseudo-inverse] instead of the inv(D) function because in
 +
the case of a non-invertible matrix, it would not crash the program. 
 +
[vec val]=eigs(A) (eigen-decomposition)
 +
a=-vec(:,1) (find the eigenvector equals to 1)
 +
a=a/sum(a) (normalize a)
 +
or to show that A transpose is a stationary transition matrix
 +
(transpose(A))^200 will be the same as a=a/sum(a)
 +
</pre>
  
Interpretation of the formula:<br/>
+
'''NOTE:''' Changing the value of d, does not change the ranking order of the pages.  
1) Sum of L<sub>ij</sub> is the total number of incoming links.<br/>
 
2) The above sum is weighted by page rank of the pages that contain the link to i (P<sub>j</sub>) i.e. if a high-rank page points to page i, then this link carries more weight than links from lower-rank pages.<br/>
 
3) The sum is then weighted by the inverse of the number of outgoing links from the pages that contain links to i (c<sub>j</sub>). i.e. if a page has more outgoing links than other pages then its links carry less weight.<br/>
 
4) Finally, we take a linear combination of the page rank obtained from above and a constant 1. This ensures that every page has a rank greater than zero.<br/>
 
  
 +
By looking at each entry after normalizing a, we can tell the ranking order of each page.<br>
 +
<span style="background:#F5F5DC">
  
Note that this is a system of N equations with N unknowns.<br/>
+
c = [1 1 1] since there are 3 pages, each page is one way recurrent to each other and there is only one outgoing for each page. Hence, D is a 3x3 standard diagonal matrix.
  
<math>c_j</math> is the number of outgoing links, the less outgoing links a page has, the more important the page is.<br/>
+
==== Example 2 ====
  
 +
[[File:Screen_shot_2013-07-02_at_3.43.04_AM.png]]
  
Let D be a diagonal N by N matrix such that <math> D_{ii}</math> = <math>c_i</math>
 
  
<math>D=  
+
<math>L=
 +
\left[ {\begin{matrix}
 +
0 & 0 & 1  \\
 +
1 & 0 & 1  \\
 +
0 & 1 & 0 \end{matrix} } \right]\;
 +
c=
 +
\left[ {\begin{matrix}
 +
1 & 1 & 2 \end{matrix} } \right]\;
 +
D=  
 
\left[ {\begin{matrix}
 
\left[ {\begin{matrix}
c_1 & 0 & ... & 0  \\
+
1 & 0 & 0  \\
0 & c_2 & ...  &  0  \\
+
0 & 1 & 0  \\
0 & 0 & ... & 0 \\
+
0 & 0 & 2 \end{matrix} } \right]</math>
0 & 0 & ... & c_N \end{matrix} } \right]</math>
+
 
 +
<pre style='font-size:14px'>
  
Then P = (1-d) e + dLD<sup>-1</sup>P <br/> where e =[1 1 ....]<sup>T</sup> , i.e. a N by 1 vector.<br/>
+
Matlab code
  
The relative value of page rank is valuable, but the absolute value is meaningless. <br/>
+
>> L=[0 0 1;1 0 1;0 1 0];
P is a vector of rank which could contain any arbitrary numbers. We only care about whether one web is more important than others(P<sub>i</sub><math>>=</math> P<sub>j</sub> for any j)
+
>> C=sum(L);
 +
>> D=diag(C);
 +
>> d=0.8;
 +
>> N=3;
 +
>> A=(1-d)*ones(N)/N+d*L*pinv(D);
 +
>> [vec val]=eigs(A)
  
We assume that rank of all N pages sums to N. The sum of rank of all N pages can be any number, as long as the ranks have certain proportions. <br/>
+
vec =
  
i.e. e<sup>T</sup> P = N, then <math>~\frac{e^{T}P}{N} = 1</math>
+
  -0.3707            -0.3536 + 0.3536i  -0.3536 - 0.3536i
 +
  -0.6672            -0.3536 - 0.3536i  -0.3536 + 0.3536i
 +
  -0.6461            0.7071            0.7071         
  
  
D<sup>-1</sup> will be: 
+
val =
  
D<sup>-1</sup><math>=
+
  1.0000                  0                 0        
\left[ {\begin{matrix}
+
        0            -0.4000 - 0.4000i        0        
\frac {1}{c_1} & 0 & ... & 0 \\
+
        0                 0            -0.4000 + 0.4000i
0 & \frac {1}{c_2} & ...  &  0 \\
+
 
0 & 0 & ... 0 \\
+
>> a=-vec(:,1)
0 & 0 & ... & \frac {1}{c_N} \end{matrix} } \right]</math>
+
 
 +
a =
 +
 
 +
    0.3707
 +
    0.6672
 +
    0.6461
  
<math>P=~(1-d)e+dLD^{-1}P</math>  where <math>e=\begin{bmatrix}
+
>> a=a/sum(a)
1\\
 
1\\
 
...\\
 
1
 
\end{bmatrix}</math>
 
  
<math>P=(1-d)~\frac{ee^{T}P}{N}+dLD^{-1}P</math>
+
a =
  
<math>P=[(1-d)~\frac{ee^T}{N}+dLD^{-1}]P</math>
+
    0.2201
 +
    0.3962
 +
    0.3836
 +
</pre>
 +
'''NOTE:''' Page 2 is the most important page because it has 2 incomings. Similarly, page 3 is more important than page 1 because page 3 has the incoming result from page 2.
  
<math>P=AP, where A=[(1-d)~\frac{ee^T}{N}+dLD^{-1}]</math>
+
This example is similar to the first example, but here, page 3 can go back to page 2, so the matrix of the outgoing matrix, the third column of the D matrix is 3 in the third row. And we use the code to calculate the p=Ap. Therefore 2, 3, 1 is the order of importance.
  
<br> P is the stationary distribution of A and it is a row vector
+
==== Example 3 ====
<br> A is a transitional probability matrix (N<math>*</math>N)
 
  
<b>The following variables are necessary to calculate the page rank.</b>
+
[[File:eg 3.jpg]]<br>
<math>N</math> - the rank <br>
 
<math>L</math> - an NxN matrix (Binary Matrix) <br>
 
<math>D^{-1}=</math> - an N*N matrix (Diagonal Matrix) <br>
 
<math>P</math> - an N*1 matrix (Row Vector) <br>
 
<math>d</math> - a constant between 0 and 1 (In the origional algorithm, d =0.8 ) <br>
 
  
Example:Given that <br>
 
 
<math>L=  
 
<math>L=  
 
\left[ {\begin{matrix}
 
\left[ {\begin{matrix}
L11 & L12 \\
+
0 & 1 & 0  \\
L21 & L22 \end{matrix} } \right]</math>
+
1 & 0 & 1  \\
 
+