stat341 / CM 361: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 9: Line 9:
===Lecture of May 12th, 2009===
===Lecture of May 12th, 2009===


In order to study statistics computationally, we need a good way to generate random numbers from various distributions using computational methods, or at least numbers whose distribution appears to be random (pseudo-random). Outside a computational setting, this is fairly easy (at least for the uniform distribution). Rolling a die, for example, produces numbers with a uniform distribution very well.
In order to study statistics computationally, we need a good way to generate random numbers from various distributions using computational methods, or at least numbers whose distribution appears to be random (pseudo-random). Outside a computational setting, this is fairly easy (at least for the uniform distribution). Rolling a fair die repetitively, for example, produces a series of random numbers (from 1 to 6) with a uniform distribution.


We begin by considering the simplest case: the uniform distribution.
We begin by considering the simplest case: the uniform distribution.
Line 15: Line 15:
====Multiplicative Congruential Method====
====Multiplicative Congruential Method====


One way to generate pseudorandom numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''n'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically (based on the seed) generates a sequence of numbers with a seemingly random distribution (with some caveats). It proceeds as follows:
One way to generate pseudo random numbers from the uniform distribution is using the '''Multiplicative Congruential Method'''. This involves three integer parameters ''a'', ''b'', and ''m'', and a '''seed''' variable ''x<sub>0</sub>''. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:


:<math>x_{i+1} = ax_{i} + b \mod{n}</math>
:<math>x_{i+1} = (ax_{i} + b) \mod{m}</math>


For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:
For example, with ''a'' = 13, ''b'' = 0, ''m'' = 31, ''x<sub>0</sub>'' = 1, we have:
Line 27: Line 27:
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math>
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math>
:<math>\begin{align}
:<math>\begin{align}
x_{1} &{}= 13 \times 1 + 0 \mod{31} \\
x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\
      &{}= 13
\end{align}</math>
\end{align}</math>
:<math>\begin{align}
:<math>\begin{align}
x_{2} &{}= 13 \times 13 + 0 \mod{31} \\
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\
      &{}= 14
\end{align}</math>
:<math>\begin{align}
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\
\end{align}</math>
\end{align}</math>


etc.
etc.


This method generates numbers between 0 and ''m'' - 1, and by scaling this output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1. For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1.
For correctly chosen values of ''a'', ''b'', and ''m'', this method will generate a sequence of integers including all integers between 0 and ''m'' - 1. Scaling the output by dividing the terms of the resulting sequence by ''m - 1'', we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.
 
Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudo random numbers.
 
For example, with ''a'' = 3, ''b'' = 2, ''m'' = 4, ''x<sub>0</sub>'' = 1, we have:
 
:<math>x_{i+1} = (3x_{i} + 2) \mod{4}</math>
 
So,
 
:<math>\begin{align} x_{0} &{}= 1 \end{align}</math>
:<math>\begin{align}
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\
\end{align}</math>
:<math>\begin{align}
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\
\end{align}</math>
 
etc.


Of course, not all values of ''a'', ''b'', and ''m'' will behave in this way, and will not be suitable for use in generating pseudorandom numbers. In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.
In practice, it has been found by a paper published in 1988 by Park and Miller, that ''a'' = 7<sup>5</sup>, ''b'' = 0, and ''m'' = 2<sup>31</sup> - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.


====General Methods====
====General Methods====


Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudorandom numbers from other distributions.
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.


=====Inverse Transform Method=====
=====Inverse Transform Method=====


This method uses the fact that when the inverse of a distribution cumulative density function(cdf) for a given distribution is applied to the uniform distribution, the resulting distribution is the distribution of the cdf. This is shown by this theorem:
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:


'''Theorem''':
'''Theorem''':


If <math>U \sim~ Unif[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).
If <math>U \sim~ Unif[0, 1]</math> is a random variable and <math>X = F^{-1}(U)</math>, where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).


'''Proof''':
'''Proof''':
Line 75: Line 94:
'''Procedure (Continuous Case)'''
'''Procedure (Continuous Case)'''


This method then gives us the following procedure for finding pseudorandom numbers from a continuous distribution:
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:


*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.
*Step 1: Draw <math>U \sim~ Unif [0, 1] </math>.
Line 88: Line 107:
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math>
:<math> F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} </math>


:<math> F^{-1}(x) = \frac{-log(1-y)}{\theta} </math>
:<math> F^{-1}(x) = \frac{-log(1-y)}{\theta} = \frac{-log(u)}{\theta} </math>


Now we can generate our random sample <math>i=1\dots n</math> from <math>f(x)</math> by:
Now we can generate our random sample <math>i=1\dots n</math> from <math>f(x)</math> by:


:<math>1)\ u_i \sim Unif[0, 1]</math>
:<math>1)\ u_i \sim Unif[0, 1]</math>
:<math>2)\ x_i = \frac{-log(1-u_i)}{\theta}</math>
:<math>2)\ x_i = \frac{-log(u_i)}{\theta}</math>


The <math>x_i</math> are now a random sample from <math>f(x)</math>.
The <math>x_i</math> are now a random sample from <math>f(x)</math>.
Line 119: Line 138:
:<math>\sum p_i = 1</math>
:<math>\sum p_i = 1</math>


Thus we can define the following method to find pseudorandom numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of <math>U = 1</math> is missed):


*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.
*Step 1: Draw <math> U~ \sim~ Unif [0,1] </math>.
*Step 2:
*Step 2:
**If <math>U \leq p_0</math>, return <math>X = x_0</math>
**If <math>U < p_0</math>, return <math>X = x_0</math>
**If <math>U \leq p_0 + p_1</math>, return <math>X = x_1</math>
**If <math>p_0 \leq U < p_0 + p_1</math>, return <math>X = x_1</math>
** ...
** ...
**In general, if <math>U \leq p_0 + \dots + p_k</math>, return <math>X = x_k</math>
**In general, if <math>p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k</math>, return <math>X = x_k</math>


'''Example''' (from class):
'''Example''' (from class):

Revision as of 22:24, 14 May 2009

Computational Statistics and Data Analysis is a course offered at the University of Waterloo
Spring 2009
Instructor: Ali Ghodsi


Sampling (Generating Random numbers)

Lecture of May 12th, 2009

In order to study statistics computationally, we need a good way to generate random numbers from various distributions using computational methods, or at least numbers whose distribution appears to be random (pseudo-random). Outside a computational setting, this is fairly easy (at least for the uniform distribution). Rolling a fair die repetitively, for example, produces a series of random numbers (from 1 to 6) with a uniform distribution.

We begin by considering the simplest case: the uniform distribution.

Multiplicative Congruential Method

One way to generate pseudo random numbers from the uniform distribution is using the Multiplicative Congruential Method. This involves three integer parameters a, b, and m, and a seed variable x0. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:

[math]\displaystyle{ x_{i+1} = (ax_{i} + b) \mod{m} }[/math]

For example, with a = 13, b = 0, m = 31, x0 = 1, we have:

[math]\displaystyle{ x_{i+1} = 13x_{i} \mod{31} }[/math]

So,

[math]\displaystyle{ \begin{align} x_{0} &{}= 1 \end{align} }[/math]
[math]\displaystyle{ \begin{align} x_{1} &{}= 13 \times 1 + 0 \mod{31} = 13 \\ \end{align} }[/math]
[math]\displaystyle{ \begin{align} x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\ \end{align} }[/math]
[math]\displaystyle{ \begin{align} x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\ \end{align} }[/math]

etc.

For correctly chosen values of a, b, and m, this method will generate a sequence of integers including all integers between 0 and m - 1. Scaling the output by dividing the terms of the resulting sequence by m - 1, we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.

Of course, not all values of a, b, and m will behave in this way, and will not be suitable for use in generating pseudo random numbers.

For example, with a = 3, b = 2, m = 4, x0 = 1, we have:

[math]\displaystyle{ x_{i+1} = (3x_{i} + 2) \mod{4} }[/math]

So,

[math]\displaystyle{ \begin{align} x_{0} &{}= 1 \end{align} }[/math]
[math]\displaystyle{ \begin{align} x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\ \end{align} }[/math]
[math]\displaystyle{ \begin{align} x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\ \end{align} }[/math]

etc.

In practice, it has been found by a paper published in 1988 by Park and Miller, that a = 75, b = 0, and m = 231 - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.

General Methods

Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.

Inverse Transform Method

This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:

Theorem:

If [math]\displaystyle{ U \sim~ Unif[0, 1] }[/math] is a random variable and [math]\displaystyle{ X = F^{-1}(U) }[/math], where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).

Proof:

Recall that, if f is the pdf corresponding to F, then,

[math]\displaystyle{ F(x) = P(X \leq x) = \int_{-\infty}^x f(x) }[/math]

So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.

Note also that in the uniform distribution on [0, 1], we have for all a within [0, 1], [math]\displaystyle{ P(U \leq a) = a }[/math].

So,

[math]\displaystyle{ \begin{align} P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\ &{}= P(U \leq F(x)) \\ &{}= F(x) \end{align} }[/math]

Completing the proof.

Procedure (Continuous Case)

This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:

  • Step 1: Draw [math]\displaystyle{ U \sim~ Unif [0, 1] }[/math].
  • Step 2: Compute [math]\displaystyle{ X = F^{-1}(U) }[/math].

Example:

Suppose we want to draw a sample from [math]\displaystyle{ f(x) = \lambda e^{-\lambda x} }[/math] where [math]\displaystyle{ x \gt 0 }[/math] (the exponential distribution).

We need to first find [math]\displaystyle{ F(x) }[/math] and then its inverse, [math]\displaystyle{ F^{-1} }[/math].

[math]\displaystyle{ F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x} }[/math]
[math]\displaystyle{ F^{-1}(x) = \frac{-log(1-y)}{\theta} = \frac{-log(u)}{\theta} }[/math]

Now we can generate our random sample [math]\displaystyle{ i=1\dots n }[/math] from [math]\displaystyle{ f(x) }[/math] by:

[math]\displaystyle{ 1)\ u_i \sim Unif[0, 1] }[/math]
[math]\displaystyle{ 2)\ x_i = \frac{-log(u_i)}{\theta} }[/math]

The [math]\displaystyle{ x_i }[/math] are now a random sample from [math]\displaystyle{ f(x) }[/math].


This example can be illustrated in Matlab using the code below. Generate [math]\displaystyle{ u_i }[/math], calculate [math]\displaystyle{ x_i }[/math] using the above formula and letting [math]\displaystyle{ \theta=1 }[/math], plot the histogram of [math]\displaystyle{ x_i }[/math]'s for [math]\displaystyle{ i=1,...,100,000 }[/math].

u=rand(1,100000);
x=-log(1-u)/1;
hist(x)

The histogram shows exponential pattern as expected.


The major problem with this approach is that we have to find [math]\displaystyle{ F^{-1} }[/math] and for many distributions it is too difficult (or impossible) to find the inverse of [math]\displaystyle{ F(x) }[/math]. Further, for some distributions it is not even possible to find [math]\displaystyle{ F(x) }[/math]

Procedure (Discrete Case)

The above method can be easily adapted to work on discrete distributions as well.

In general in the discrete case, we have [math]\displaystyle{ x_0, \dots , x_n }[/math] where:

[math]\displaystyle{ \begin{align}P(X = x_i) &{}= p_i \end{align} }[/math]
[math]\displaystyle{ x_0 \leq x_1 \leq x_2 \dots \leq x_n }[/math]
[math]\displaystyle{ \sum p_i = 1 }[/math]

Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of [math]\displaystyle{ U = 1 }[/math] is missed):

  • Step 1: Draw [math]\displaystyle{ U~ \sim~ Unif [0,1] }[/math].
  • Step 2:
    • If [math]\displaystyle{ U \lt p_0 }[/math], return [math]\displaystyle{ X = x_0 }[/math]
    • If [math]\displaystyle{ p_0 \leq U \lt p_0 + p_1 }[/math], return [math]\displaystyle{ X = x_1 }[/math]
    • ...
    • In general, if [math]\displaystyle{ p_0+ p_1 + \dots + p_{k-1} \leq U \lt p_0 + \dots + p_k }[/math], return [math]\displaystyle{ X = x_k }[/math]

Example (from class):

Suppose we have the following discrete distribution:

[math]\displaystyle{ \begin{align} P(X = 0) &{}= 0.3 \\ P(X = 1) &{}= 0.2 \\ P(X = 2) &{}= 0.5 \end{align} }[/math]

The cumulative density function (cdf) for this distribution is then:

[math]\displaystyle{ F(x) = \begin{cases} 0, & \text{if } x \lt 0 \\ 0.3, & \text{if } 0 \leq x \lt 1 \\ 0.5, & \text{if } 1 \leq x \lt 2 \\ 1, & \text{if } 2 \leq x \end{cases} }[/math]

Then we can generate numbers from this distribution like this, given [math]\displaystyle{ u_0, \dots, u_n }[/math] from [math]\displaystyle{ U \sim~ Unif[0, 1] }[/math]:

[math]\displaystyle{ x_i = \begin{cases} 0, & \text{if } u_i \leq 0.3 \\ 1, & \text{if } 0.3 \lt u_i \leq 0.5 \\ 2, & \text{if } 0.5 \lt u_i \leq 1 \end{cases} }[/math]