stat341f11

Please contribute to the discussion of splitting up this page into multiple pages on the talk page.

Notation

The following guidelines on notation were posted on the Wiki Course Note page for STAT 946. Add to them as necessary for consistent notation on this page.

Capital letters will be used to denote random variables and lower case letters denote observations for those random variables:

• $\{X_1,\ X_2,\ \dots,\ X_n\}$ random variables
• $\{x_1,\ x_2,\ \dots,\ x_n\}$ observations of the random variables

The joint probability mass function can be written as:

$P( X_1 = x_1, X_2 = x_2, \dots, X_n = x_n )$

or as shorthand, we can write this as $p( x_1, x_2, \dots, x_n )$. In these notes both types of notation will be used. We can also define a set of random variables $X_Q$ where $Q$ represents a set of subscripts.

Sampling - September 20, 2011

The meaning of sampling is to generate data points or numbers such that these data follow a certain distribution.
i.e. From $x \sim~f(x)$ sample $\,x_{1}, x_{2}, ..., x_{1000}$

In practice, it maybe difficult to find the joint distribution of random variables. We will explore different methods for simulating random variables, and how to draw conclusions using the simulated data.

Sampling from Uniform Distribution

Computers cannot generate random numbers as they are deterministic; however they can produce pseudo random numbers using algorithms. Generated numbers mimic the properties of random numbers but they are never truly random. One famous algorithm that is considered highly reliable is the Mersenne twister[1], which generates random numbers in an almost uniform distribution.

Multiplicative Congruential

• involves four parameters: integers $\,a, b, m$, and an initial value $\,x_0$ which we call the seed
• a sequence of integers is defined as
$x_{k+1} \equiv (ax_{k} + b) \mod{m}$

Example: $\,a=13, b=0, m=31, x_0=1$ creates a uniform histogram.

MATLAB code for generating 1000 random numbers using the multiplicative congruential method:

a = 13;
b = 0;
m = 31;
x(1) = 1;

for ii = 2:1000
x(ii) = mod(a*x(ii-1)+b, m);
end


MATLAB code for displaying the values of x generated:

x


MATLAB code for plotting the histogram of x:

hist(x)


Histogram Output:

• In this example, the first 30 terms in the sequence are a permutation of integers from 1 to 30 and then the sequence repeats itself. In the general case, this algorithm has a period of m-1.
• Values are between 0 and m-1, inclusive.
• Dividing the numbers by m-1 yields numbers in the interval [0,1].
• MATLAB's rand function once used this algorithm with a= 75, b= 0, m= 231-1,for reasons described in Park and Miller's 1988 paper "Random Number Generators: Good Ones are Hard to Find" (available online).
• Visual Basic's RND function also used this algorithm with a= 1140671485, b= 12820163, m= 224. (Reference)

Inverse Transform Method

This is a basic method for sampling. Theoretically using this method we can generate sample numbers at random from any probability distribution once we know its cumulative distribution function (cdf). This method is very efficient computationally if the cdf of can be analytically inverted.

Theorem

Take $U \sim~ \mathrm{Unif}[0, 1]$ and let $\ X = F^{-1}(U)$. Then $\ X$ has a cumulative distribution function of $F(\cdot)$, ie. $F(x)=P(X \leq x)$, where $F^{-1}(\cdot)$ is the inverse of $F(\cdot)$.

Proof

Recall that

$P(a \leq X\lt b)=\int_a^{b} f(x) dx$
$cdf=F(x)=P(X \leq x)=\int_{-\infty}^{x} f(x) dx$

Note that if $U \sim~ \mathrm{Unif}[0, 1]$, we have $P(U \leq a)=a$

\begin{align} P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\ &{}= P(U \leq F(x)) \\ &{}= F(x) \end{align}

Completing the proof.

Continuous Case

Generally it takes two steps to get random numbers using this method.

• Step 1. Draw $U \sim~ \mathrm{Unif}[0, 1]$
• Step 2. X=F −1(U)

Example

Take the exponential distribution for example

$\,f(x)={\lambda}e^{-{\lambda}x}$
$\,F(x)=\int_0^x {\lambda}e^{-{\lambda}u} du=[-e^{-{\lambda}u}]_0^x=1-e^{-{\lambda}x}$

Let: $\,F(x)=y$

$\,y=1-e^{-{\lambda}x}$
$\,ln(1-y)={-{\lambda}x}$
$\,x=\frac{ln(1-y)}{-\lambda}$
$\,F^{-1}(x)=\frac{-ln(1-x)}{\lambda}$

Therefore, to get a exponential distribution from a uniform distribution takes 2 steps.

• Step 1. Draw $U \sim~ \mathrm{Unif}[0, 1]$
• Step 2. $x=\frac{-ln(1-U)}{\lambda}$

Note: If U~Unif[0, 1], then (1 - U) and U have the same distribution. This allows us to slightly simplify step 2 into an alternate form:

• Alternate Step 2. $x=\frac{-ln(U)}{\lambda}$

MATLAB code

for exponential distribution case,assuming $\lambda=0.5$

for ii = 1:1000
u = rand;
x(ii) = -log(1-u)/0.5;
end
hist(x)


MATLAB result

Discrete Case - September 22, 2011

This same technique can be applied to the discrete case. Generate a discrete random variable $\,x$ that has probability mass function $\,P(X=x_i)=P_i$ where $\,x_0\lt x_1\lt x_2...$ and $\,\sum_i P_i=1$

• Step 1. Draw $u \sim~ \mathrm{Unif}[0, 1]$
• Step 2. $\,x=x_i$ if $\,F(x_{i-1})\lt u \leq F(x_i)$

Example

Let x be a discrete random variable with the following probability mass function:

\begin{align} P(X=0) = 0.3 \\ P(X=1) = 0.2 \\ P(X=2) = 0.5 \end{align}

Given the pmf, we now need to find the cdf.

We have:

$F(x) = \begin{cases} 0 & x \lt 0 \\ 0.3 & x \leq 0 \\ 0.5 & x \leq 1 \\ 1 & x \leq 2 \end{cases}$

We can apply the inverse transform method to obtain our random numbers from this distribution.

Pseudo Code for generating the random numbers:

Draw U ~ Unif[0,1]
if U <= 0.3
return 0
else if 0.3 < U <= 0.5
return 1
else if 0.5 < U <= 1
return 2


MATLAB code for generating 1000 random numbers in the discrete case:

for ii = 1:1000
u = rand;

if u <= 0.3
x(ii) = 0;
else if u <= 0.5
x(ii) = 1;
else
x(ii) = 2;
end
end


Matlab Output:

Pseudo code for the Discrete Case:

1. Draw U ~ Unif [0,1]

2. If $U \leq P_0$, deliver X= x0

3. Else if $U \leq P_0 + P_1$, deliver X= x1

4. Else If $U \leq P_0 +....+ P_k$, deliver X= xk

Limitations

This method is useful, but it's not practical in many cases because we can't always obtain $F$ or $F^{-1}$ (some functions are not integrable or invertible), and sometimes $f(x)$ cannot be obtained in closed form. Let's look at some examples:

• Continuous case

If we want to use this method to draw the pdf of normal distribution, we may find ourselves geting stuck when trying to find its cdf. The simplest case of normal distribution is $f(x)=\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}$, whose cdf is $F(x)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{x}{e^{-\frac{u^2}{2}}}du$. This integral cannot be expressed in terms of elementary functions. So evaluating it and then finding the inverse is a very difficult task.

• Discrete case

It is easy for us to simulate when there are only a few values taken by the particular random variable, like the case above. And it is easy to simulate the binomial distribution $X \sim~ \mathrm{B}(n,p)$ when the parameter n is not too large. But when n takes on values that are very large, say 50, it is hard to do so.

Acceptance/Rejection Method

The aforementioned difficulties of the inverse transform method motivates a sampling method that does not require analytically calculating cdf's and their inverses, which is the acceptance/rejection sampling method. Here, $\displaystyle f(x)$ is approximated by another function, say $\displaystyle g(x)$, with the idea being that $\displaystyle g(x)$ is a "nicer" function to work with than $\displaystyle f(x)$.

Suppose we assume the following:

1. There exists another distribution $\displaystyle g(x)$ that is easier to work with and that you know how to sample from, and

2. There exists a constant c such that $f(x) \leq c \cdot g(x)$ for all x

Under these assumptions, we can sample from $\displaystyle f(x)$ by sampling from $\displaystyle g(x)$

General Idea

Looking at the image below we have graphed $c \cdot g(x)$ and $\displaystyle f(x)$.

Using the acceptance/rejection method we will accept some of the points from $\displaystyle g(x)$ and reject some of the points from $\displaystyle g(x)$. The points that will be accepted from $\displaystyle g(x)$ will have a distribution similar to $\displaystyle f(x)$. We can see from the image that the values around $\displaystyle x_1$ will be sampled more often under $c \cdot g(x)$ than under $\displaystyle f(x)$, so we will have to reject more samples taken at x1. Around $\displaystyle x_2$ the number of samples that are drawn and the number of samples we need are much closer, so we accept more samples that we get at $\displaystyle x_2$. We should remember to take these considerations into account when evaluating the efficiency of a given acceptance-rejection method. Rejecting a high proportion of samples ultimately leaves us with a longer time until we retrieve our desired distribution. We should question whether a better function $g(x)$ can be chosen in our situation.

Procedure

1. Draw y ~ g

2. Draw U ~ Unif [0,1]

3. If $U \leq \frac{f(y)}{c \cdot g(y)}$ then x=y; else return to 1

Note that the choice of $c$ plays an important role in the efficiency of the algorithm. We want $c \cdot g(x)$ to be "tightly fit" over $f(x)$ to increase the probability of accepting points, and therefore reducing the number of sampling attempts. Mathematically, we want to minimize $c$ such that $f(x) \leq c \cdot g(x) \ \forall x$. We do this by setting

$\frac{d}{dx}(\frac{f(x)}{g(x)}) = 0$, solving for a maximum point $x_0$ and setting $c = \frac{f(x_0)}{g(x_0)}.$

Proof

Mathematically, we need to show that the sample points given that they are accepted have a distribution of f(x).

\begin{align} P(y|accepted) &= \frac{P(y, accepted)}{P(accepted)} \\ &= \frac{P(accepted|y) P(y)}{P(accepted)}\end{align} (Bayes' Rule)

$\displaystyle P(y) = g(y)$

$P(accepted|y) =P(u\leq \frac{f(y)}{c \cdot g(y)}) =\frac{f(y)}{c \cdot g(y)}$,where u ~ Unif [0,1]

$P(accepted) = \sum P(accepted|y)\cdot P(y)=\int^{}_y \frac{f(s)}{c \cdot g(s)}g(s) ds=\int^{}_y \frac{f(s)}{c} ds=\frac{1}{c} \cdot\int^{}_y f(s) ds=\frac{1}{c}$

So,

$P(y|accepted) = \frac{ \frac {f(y)}{c \cdot g(y)} \cdot g(y)}{\frac{1}{c}} =f(y)$

Continuous Case

Example

Sample from Beta(2,1)

In general:

$Beta(\alpha, \beta) = \frac{\Gamma (\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}$ $\displaystyle x^{\alpha-1}$ $\displaystyle(1-x)^{\beta-1}$, $\displaystyle 0\lt x\lt 1$

Note: $\!\Gamma(n) = (n-1)!$ if n is a positive integer

\begin{align} f(x) &= Beta(2,1) \\ &= \frac{\Gamma(3)}{\Gamma(2)\Gamma(1)} x^1(1-x)^0 \\ &= \frac{2!}{1! 0!}\cdot (1) x \\ &= 2x \end{align}

We want to choose $\displaystyle g(x)$ that is easy to sample from. So we choose $\displaystyle g(x)$ to be uniform distribution on $\ (0,1)$ since that is the domain for the Beta function.

We now want a constant c such that $f(x) \leq c \cdot g(x)$ for all x from Unif(0,1)

So,

$c \geq \frac{f(x)}{g(x)}$, for all x from (0,1)

\begin{align}c &\geq max (\frac {f(x)}{g(x)}, 0\lt x\lt 1) \\ &= max (\frac {2x}{1},0\lt x\lt 1) \\ &= 2 \end{align}

Now that we have c =2,

1. Draw y ~ g(x) => Draw y ~ Unif [0,1]

2. Draw u ~ Unif [0,1]

3. if $u \leq \frac{2y}{2 \cdot 1}$ then x=y; else return to 1

MATLAB code for generating 1000 samples following Beta(2,1):

close all
clear all
ii=1;
while ii < 1000
y = rand;
u = rand;

if u <= y
x(ii)=y;
ii=ii+1;
end
end
hist(x)


MATLAB result

Discrete Example

Generate random variables according to the p.m.f:

\begin{align} P(Y=1) = 0.15 \\ P(Y=2) = 0.22 \\ P(Y=3) = 0.33 \\ P(Y=4) = 0.10 \\ P(Y=5) = 0.20 \end{align}

find a g(y) discrete uniform distribution from 1 to 5

$c \geq \frac{P(y)}{g(y)}$
$c = \max \left(\frac{P(y)}{g(y)} \right)$
$c = \max \left(\frac{0.33}{0.2} \right) = 1.65$ Since P(Y=3) is the max of P(Y) and g(y) = 0.2 for all y.

1. Generate Y according to the discrete uniform between 1 - 5

2. U ~ unif[0,1]

3. If $U \leq \frac{P(y)}{1.65 \times 0.2} \leq \frac{P(y)}{0.33}$, then x = y; else return to 1.

In MATLAB, the code would be:

   py = [0.15 0.22 0.33 0.1 0.2];
ii =1;
while ii <= 1000
y = unidrnd(5);
u = rand;
if u <= py(y)/0.33
x(ii) = y;
ii = ii+1;
end
end
hist(x);


MATLAB result

Limitations

Most of the time we have to sample many more points from g(x) before we can obtain an acceptable amount of samples from f(x), hence this method may not be computationally efficient. It depends on our choice of g(x). For example, in the example above to sample from Beta(2,1), we need roughly 2000 samples from g(X) to get 1000 acceptable samples of f(x).

In addition, in situations where a g(x) function is chosen and used, there can be a discrepancy between the functional behaviors of f(x) and g(x) that render this method unreliable. For example, given the normal distribution function as g(x) and a function of f(x) with a "fat" mid-section and "thin tails", this method becomes useless as more points near the two ends of f(x) will be rejected, resulting in a tedious and overwhelming number of sampling points having to be sampled due to the high rejection rate of such a method.

Sampling From Gamma and Normal Distribution - September 27, 2011

Sampling From Gamma

Gamma Distribution

The Gamma function is written as $X \sim~ Gamma (t, \lambda)$

$F(x) = \int_{0}^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy$

If you have t samples of the exponential distribution,

\begin{align} X_1 \sim~ Exp(\lambda)\\ \vdots \\ X_t \sim~ Exp(\lambda) \end{align}

The sum of these t samples has a gamma distribution,

$X_1+X_2+ ... + X_t \sim~ Gamma (t, \lambda)$
$\sum_{i=1}^{t} X_i \sim~ Gamma (t, \lambda)$ where $X_i \sim~Exp(\lambda)$

Method
Suppose we want to sample $\ k$ points from $\ Gamma (t, \lambda)$.
We can sample the exponential distribution using the inverse transform method from the previous class,

$\,f(x)={\lambda}e^{-{\lambda}x}$
$\,F^{-1}(u)=\frac{-ln(1-u)}{\lambda}$
$\,F^{-1}(u)=\frac{-ln(u)}{\lambda}$

(1 - u) is the same as x since $U \sim~ unif [0,1]$

\begin{align} \frac{-ln(u_1)}{\lambda} - \frac{ln(u_2)}{\lambda} - ... - \frac{ln(u_t)}{\lambda} = x_1\\ \vdots \\ \frac{-ln(u_1)}{\lambda} - \frac{ln(u_2)}{\lambda} - ... - \frac{ln(u_t)}{\lambda} = x_k \end{align} :
$\frac {-\sum_{i=1}^{t} ln(u_i)}{\lambda} = x$

MATLAB code for a Gamma(3,1) is

x = sum(-log(rand(1000,3)),2);
hist(x)


And the Histogram of X follows a Gamma distribution with long tail:

We can improve the quality of histogram by adjusting the Matlab code to include the number of bins we want: hist(x, number_of_bins)

x = sum(-log(rand(20000,3)),2);
hist(x,40)


R code for a Gamma(3,1) is

a<-apply(-log(matrix(runif(3000),nrow=1000)),1,sum);
hist(a);


Histogram:

Here is another histogram of Gamma coding with R

a<-apply(-log(matrix(runif(3000),nrow=1000)),1,sum);
hist(a,freq=F);
lines(density(a),col="blue");
rug(jitter(a));


Sampling from Normal Distribution using Box-Muller Transform - September 29, 2011

Procedure
1. Generate $\displaystyle u_1$ and $\displaystyle u_2$, two values sampled from a uniform distribution between 0 and 1.
2. Set $\displaystyle R^2 = -2log(u_1)$ so that $\displaystyle R^2$ is exponential with mean 2
Set $\!\theta = 2*\pi*u_2$ so that $\!\theta$ ~ $\ Unif[0, 2\pi]$
3. Set $\displaystyle X = R cos(\theta)$
Set $\displaystyle Y = R sin(\theta)$
Justification

Suppose we have X ~ N(0, 1) and Y ~ N(0, 1) where X and Y are independent normal random variables. The relative probability density function of these two random variables using Cartesian coordinates is:

$f(X, Y) dxdy= f(X) f(Y) dxdy= \frac{1}{\sqrt{2\pi}}e^{-x^2/2} \frac{1}{\sqrt{2\pi}}e^{-y^2/2} dxdy= \frac{1}{2\pi}e^{-(x^2+y^2)/2}dxdy$

In polar coordinates $\displaystyle R^2 = x^2 + y^2$, so the relative probability density function of these two random variables using polar coordinates is:

$f(R, \theta) = \frac{1}{2\pi}e^{-R^2/2}$

If we have $\displaystyle R^2 \sim exp(1/2)$ and $\!\theta \sim unif[0, 2\pi]$ we get an equivalent relative probability density function. Notice that when doing a two by two linear transformation, the determinant of the Jacobian needs to be included in the change of variable formula where: $|J|=\left|\frac{\partial(x,y)}{\partial(R,\theta)}\right|= \left|\begin{matrix}\frac{\partial x}{\partial R}&\frac{\partial x}{\partial \theta}\\\frac{\partial y}{\partial R}&\frac{\partial y}{\partial \theta}\end{matrix}\right|=R$

$f(X, Y) dxdy = f(R, \theta)|J|dRd\theta = \frac{1}{2\pi}e^{-R^2/2}R dRd\theta= \frac{1}{4\pi}e^{-\frac{s}{2}} dSd\theta$
where $S=R^2.$

Therefore we can generate a point in polar coordinates using the uniform and exponential distributions, then convert the point to Cartesian coordinates and the resulting X and Y values will be equivalent to samples generated from N(0, 1).

MATLAB code

In MatLab this algorithm can be implemented with the following code, which generates 20,000 samples from N(0, 1):

x = zeros(10000, 1);
y = zeros(10000, 1);
for ii = 1:10000
u1 = rand;
u2 = rand;
R2 = -2 * log(u1);
theta = 2 * pi * u2;
x(ii) = sqrt(R2) * cos(theta);
y(ii) = sqrt(R2) * sin(theta);
end
hist(x)


In one execution of this script, the following histogram for x was generated:

Non-Standard Normal Distributions

Example 1: Single-variate Normal

If X ~ Norm(0, 1) then (a + bX) has a normal distribution with a mean of $\displaystyle a$ and a standard deviation of $\displaystyle b$ (which is equivalent to a variance of $\displaystyle b^2$). Using this information with the Box-Muller transform, we can generate values sampled from some random variable $\displaystyle Y\sim N(a,b^2)$ for arbitrary values of $\displaystyle a,b$.

1. Generate a sample u from Norm(0, 1) using the Box-Muller transform.
2. Set v = a + bu.

The values for v generated in this way will be equivalent to sample from a $\displaystyle N(a, b^2)$distribution. We can modify the MatLab code used in the last section to demonstrate this. We just need to add one line before we generate the histogram:

v = a + b * x;


For instance, this is the histogram generated when b = 15, a = 125:

Example 2: Multi-variate Normal

The Box-Muller method can be extended to higher dimensions to generate multivariate normals. The objects generated will be nx1 vectors, and their variance will be described by nxn covariance matrices.

$\mathbf{z} = N(\mathbf{u}, \Sigma)$ defines the n by 1 vector $\mathbf{z}$ such that:

• $\displaystyle u_i$ is the average of $\displaystyle z_i$
• $\!\Sigma_{ii}$ is the variance of $\displaystyle z_i$
• $\!\Sigma_{ij}$ is the co-variance of $\displaystyle z_i$ and $\displaystyle z_j$

If $\displaystyle z_1, z_2, ..., z_d$ are normal variables with mean 0 and variance 1, then the vector $\displaystyle (z_1, z_2,..., z_d)$ has mean 0 and variance $\!I$, where 0 is the zero vector and $\!I$ is the identity matrix. This fact suggests that the method for generating a multivariate normal is to generate each component individually as single normal variables.

The mean and the covariance matrix of a multivariate normal distribution can be adjusted in ways analogous to the single variable case. If $\mathbf{z} \sim N(0,I)$, then $\Sigma^{1/2}\mathbf{z}+\mu \sim N(\mu,\Sigma)$. Note here that the covariance matrix is symmetric and nonnegative, so its square root should always exist.

We can compute $\mathbf{z}$ in the following way:

1. Generate an n by 1 vector $\mathbf{x} = \begin{bmatrix}x_{1} & x_{2} & ... & x_{n}\end{bmatrix}$ where $x_{i}$ ~ Norm(0, 1) using the Box-Muller transform.
2. Calculate $\!\Sigma^{1/2}$ using singular value decomposition.
3. Set $\mathbf{z} = \Sigma^{1/2} \mathbf{x} + \mathbf{u}$.

The following MatLab code provides an example, where a scatter plot of 10000 random points is generated. In this case x and y have a co-variance of 0.9 - a very strong positive correlation.

x = zeros(10000, 1);
y = zeros(10000, 1);
for ii = 1:10000
u1 = rand;
u2 = rand;
R2 = -2 * log(u1);
theta = 2 * pi * u2;
x(ii) = sqrt(R2) * cos(theta);
y(ii) = sqrt(R2) * sin(theta);
end

E = [1, 0.9; 0.9, 1];
[u s v] = svd(E);
root_E = u * (s ^ (1 / 2));

z = (root_E * [x y]');
z(1,:) = z(1,:) + 0;
z(2,:) = z(2,:) + -3;

scatter(z(1,:), z(2,:))


This code generated the following scatter plot:

In Matlab, we can also use the function "sqrtm()" or "chol()" (Cholesky Decomposition) to calculate square root of a matrix directly. Note that the resulting root matrices may be different but this does materially affect the simulation. Here is an example:

E = [1, 0.9; 0.9, 1];
r1 = sqrtm(E);
r2 = chol(E);


R code for a multivariate normal distribution:

n=10000;
r2<--2*log(runif(n));
theta<-2*pi*(runif(n));
x<-sqrt(r2)*cos(theta);

y<-sqrt(r2)*sin(theta);
a<-matrix(c(x,y),nrow=n,byrow=F);
e<-matrix(c(1,.9,09,1),nrow=2,byrow=T);
svde<-svd(e);
root_e<-svde$u %*% diag(svde$d)^1/2;
z<-t(root_e %*%t(a));
z[,1]=z[,1]+5;
z[,2]=z[,2]+ -8;
par(pch=19);
plot(z,col=rgb(1,0,0,alpha=0.06))

Remarks

MATLAB's randn function uses the ziggurat method to generate normal distributed samples. It is an efficient rejection method based on covering the probability density function with a set of horizontal rectangles so as to obtain points within each rectangle. It is reported that a 800 MHz Pentium III laptop can generate over 10 million random numbers from normal distribution in less than one second. (Reference)

Sampling From Binomial Distributions

In order to generate a sample x from $\displaystyle X \sim Bin(n, p)$, we can follow the following procedure:

1. Generate n uniform random numbers sampled from $\displaystyle Unif [0, 1]$: $\displaystyle u_1, u_2, ..., u_n$.

2. Set x to be the total number of cases where $\displaystyle u_i \lt = p$ for all $\displaystyle 1 \lt = i \lt = n$.

In MatLab this can be coded with a single line. The following generates a sample from $\displaystyle X \sim Bin(n, p)$

sum(rand(n, 1) <= p, 1)


Bayesian Inference and Frequentist Inference - October 4, 2011

Bayesian inference vs Frequentist inference

The Bayesian method has become popular in the last few decades as simulation and computer technology makes it more applicable. For more information about its history and application, please refer to http://en.wikipedia.org/wiki/Bayesian_inference. As for frequentist inference, please refer to http://en.wikipedia.org/wiki/Frequentist_inference.

Example

Consider the 'probability' that a person drinks a cup of coffee on a specific day. The interpretations of this for a frequentist and a bayesian are as follows:

Frequentist: There is no explanation to this expression. It is essentially meaningless since it has only occurred once. Therefore, it is not a probability.
Bayesian: Probability captures not only the frequency of occurrences but also one's degree of belief about the random component of a proposition. Therefore it is a valid probability.

Example of face identification

Consider a picture of a face that is associated with an identity (person). Take the face as input x and the person as output y. The person can be either Ali or Tom. We have y=1 if it is Ali and y=0 if it is Tom. We can divide the picture into 100*100 pixels and insert them into a 10,000*1 column vector, which captures x.

If you are a frequentist, you would compare $\Pr(X=x|y=1)$ with $\Pr(X=x|y=0)$ and see which one is higher.
If you are Bayesian, you would compare $\Pr(y=1|X=x)$ with $\Pr(y=0|X=x)$.

Summary of differences between two schools

• Frequentist: Probability refers to limiting relative frequency. (objective)
• Bayesian: Probability describes degree of belief not frequency. (subjective)

e.g. The probability that you drank a cup of tea on May 20, 2001 is 0.62 does not refer to any frequency.

• Frequentist: Parameters are fixed, unknown constants.
• Bayesian: Parameters are random variables and we can make probabilistic statement about them.

• Frequentist: Statistical procedures should be designed to have long run frequency properties.

e.g. a 95% confidence interval should trap true value of the parameter with limiting frequency at least 95%.

• Bayesian: It makes inferences about $\theta$ by producing a probability distribution for $\theta$. Inference (e.g. point estimates and interval estimates) will be extracted from this distribution :$f(\theta|X) = \frac{f(X | \theta)\, f(\theta)}{f(X)}.$

Bayesian inference

Bayesian inference is usually carried out in the following way:

1. Choose a prior probability density function of $\!\theta$ which is $f(\!\theta)$. This is our belief about $\theta$ before we see any data.

2. Choose a statistical model $\displaystyle f(x|\theta)$ that reflects our beliefs about X.

3. After observing data $\displaystyle x_1,...,x_n$, we update our beliefs and calculate the posterior probability.

$f(\theta|x) = \frac{f(\theta,x)}{f(x)}=\frac{f(x|\theta) \cdot f(\theta)}{f(x)}=\frac{f(x|\theta) \cdot f(\theta)}{\int^{}_\theta f(x|\theta) \cdot f(\theta) d\theta}$, where $\displaystyle f(\theta|x)$ is the posterior probability, $\displaystyle f(\theta)$ is the prior probability, $\displaystyle f(x|\theta)$ is the likelihood of observing X=x given $\!\theta$ and f(x) is the marginal probability of X=x.

If we have i.i.d. observations $\displaystyle x_1,...,x_n$, we can replace $\displaystyle f(x|\theta)$ with $f({x_1,...,x_n}|\theta)=\prod_{i=1}^n f(x_i|\theta)$ because of independency.

We denote $\displaystyle f({x_1,...,x_n}|\theta)$ as $\displaystyle L_n(\theta)$ which is called likelihood. And we use $\displaystyle x^n$ to denote $\displaystyle (x_1,...,x_n)$.

$f(\theta|x^n) = \frac{f(x^n|\theta) \cdot f(\theta)}{f(x^n)}=\frac{f(x^n|\theta) \cdot f(\theta)}{\int^{}_\theta f(x^n|\theta) \cdot f(\theta) d\theta}$ , where $\int^{}_\theta f(x^n|\theta) \cdot f(\theta) d\theta$ is a constant $\displaystyle c_n$ which does not depend on $\displaystyle \theta$. So $f(\theta|x^n) \propto f(x^n|\theta) \cdot f(\theta)$. The posterior probability is proportional to the likelihood times prior probability. Note that it does not matter if we throw away $\displaystyle c_n$,we can always recover it.

What do we do about the posterier distribution?

• Point estimate

$\bar{\theta}=\int\theta \cdot f(\theta|x^n) d\theta=\frac{\int\theta \cdot L_n(\theta)\cdot f(\theta) d(\theta)}{c_n}$

• Baysian Interval estimate

$\int^{a}_{-\infty} f(\theta|x^n) d\theta=\int^{\infty}_{b} f(\theta|x^n) d\theta=1-\alpha$

Let C=(a,b); Then $P(\theta\in C|x^n)=\int^{b}_{a} f(\theta|x^n)d(\theta)=1-\alpha$. C is a $\displaystyle 1-\alpha$ posterior interval.

Let $\tilde{\theta}=(\theta_1,...,\theta_d)^T$, then $f(\theta_1|x^n) = \int^{} \int^{} \dots \int^{}f(\tilde{\theta}|X)d\theta_2d\theta_3 \dots d\theta_d$ and $E(\theta_1|x^n)=\int^{}\theta_1 \cdot f(\theta_1|x^n) d\theta_1$

Example 1: Estimating parameters of a univariate Gaussian distribution

Suppose X follows a univariate Gaussian distribution (i.e. a Normal distribution) with parameters $\!\mu$ and $\displaystyle {\sigma^2}$.

(a) For Frequentists:

$f(x|\theta)= \frac{1}{\sqrt{2\pi}\sigma} \cdot e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}$

$L_n(\theta)= \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma} \cdot e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2}$

$\ln L_n(\theta) = l(\theta) = \sum_{i=1}^n[ -\frac{1}{2}\ln 2\pi-\ln \sigma-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2]$

To get the maximum likelihood estimator of $\!\mu$ (mle), we find the $\hat{\mu}$ which maximizes $\displaystyle L_n(\theta)$:

$\frac{\partial l(\theta)}{\partial \mu}= \sum_{i=1}^n \frac{1}{\sigma}(\frac{x_i-\mu}{\sigma})=0 \Rightarrow \sum_{i=1}^n x_i = n\mu \Rightarrow \hat{\mu}_{mle}=\bar{x}$

(b) For Bayesians:

$f(\theta|x) \propto f(x|\theta) \cdot f(\theta)$

We assume that the mean of the above normal distribution is itself distributed normally with mean $\!\mu_0$ and variance $\!\Gamma$.

Suppose $\!\mu\sim N(\mu_0, \!\Gamma^2$),

so $f(\mu) = \frac{1}{\sqrt{2\pi}\Gamma} \cdot e^{-\frac{1}{2}(\frac{\mu-\mu_0}{\Gamma})^2}$

$f(x|\mu) = \frac{1}{\sqrt{2\pi}\tilde{\sigma}} \cdot e^{-\frac{1}{2}(\frac{x-\mu}{\tilde{\sigma}})^2}$

$\tilde{\mu} = \frac{\frac{n}{\sigma^2}}{\frac{n}{\sigma^2}+\frac{1}{\Gamma^2}}\bar{x}+\frac{\frac{1}{\Gamma^2}}{\frac{n}{\sigma^2}+\frac{1}{\Gamma^2}}\mu_0$, where $\tilde{\mu}$ is the estimator of $\!\mu$.

• If prior belief about $\!\mu_0$ is strong, then $\!\Gamma$ is small and $\frac{1}{\Gamma^2}$ is large. $\tilde{\mu}$ is close to $\!\mu_0$ and the observations will not affect too much. On the contrary, if prior belief about $\!\mu_0$ is weak, $\!\Gamma$ is large and $\frac{1}{\Gamma^2}$ is small. $\tilde{\mu}$ depends more on observations. (This is intuitive, when our original belief is reliable, then the sample is not important in improving the result; when the belief is not reliable, then we depend a lot on the sample.)
• When the sample is large (i.e. n $\to \infty$), $\tilde{\mu} \to \bar{x}$ and the impact of prior belief about $\!\mu$ is weakened.

Basic Monte Carlo Integration - October 6th, 2011

Three integration methods would be taught in this course:

• Basic Monte Carlo Integration
• Importance Sampling
• Markov Chain Monte Carlo (MCMC)

The first, and most basic, method of numerical integration we will see is Monte Carlo Integration. We use this to solve an integral of the form: $I = \int_{a}^{b} h(x) dx$

Note the following derivation:

\begin{align} \displaystyle I & = \int_{a}^{b} h(x)dx \\ & = \int_{a}^{b} h(x)((b-a)/(b-a))dx \\ & = \int_{a}^{b} (h(x)(b-a))(1/(b-a))dx \\ & = \int_{a}^{b} w(x)f(x)dx \\ & = E[w(x)] \\ \end{align}

~ $\frac{1}{n} \sum_{i=1}^{n} w(x)$

Where w(x) = h(x)(b-a) and f(x) is the probability density function of a uniform random variable on the interval [a,b]. The expectation, with respect to the distribution of f, of w is taken from n samples of x.

General Procedure

i) Draw n samples $x_i \sim~ Unif[a,b]$

ii) Compute $\ w(x_i)$ for every sample

iii) Obtain an estimate of the integral, $\hat{I}$, as follows:

$\hat{I} = \frac{1}{n} \sum_{i=1}^{n} w(x_i)$ . Clearly, this is just the average of the simulation results.

By the strong law of large numbers $\hat{I}$ converges to $\ I$ as $\ n \rightarrow \infty$. Because of this, we can compute all sorts of useful information, such as variance, standard error, and confidence intervals.

Standard Error: $SE = \frac{Standard Deviation} {\sqrt{n}}$

Variance: $V = \frac{\sum_{i=1}^{n} (w(x)-\hat{I})^2}{n-1}$

Confidence Interval: $\hat{I} \pm t_{(\alpha/2)} SE$

Example: Uniform Distribution

Consider the integral, $\int_{0}^{1} x^3dx$, which is easily solved through standard analytical integration methods, and is equal to .25. Now, let us check this answer with a numerical approximation using Monte Carlo Integration.

We generate a 1 by 10000 vector of uniform (on the interval [0,1]) random variables and call that vector 'u'. We see that our 'w' in this case is $x^3$, so we set $w = u^3$. Our $\hat{I}$ is equal to the mean of w.

In Matlab, we can solve this integration problem with the following code:

u = rand(1,10000);
w = u.^3;
mean(w)
ans = 0.2475


Note the '.' after 'u' in the second line of code, indicating that each entry in the matrix is cubed. Also, our approximation is close to the actual value of .25. Now let's try to get an even better approximation by generating more sample points.

u= rand(1,100000);
w= u.^3;
mean(w)
ans = .2503


We see that when the number of sample points is increased, our approximation improves, as one would expect.

Generalization

Up to this point we have seen how to numerically approximate an integral when the distribution of f is uniform. Now we will see how to generalize this to other distributions.

$I = \int h(x)f(x)dx$

If f is a distribution function (pdf), then $I$ can be estimated as Ef[h(x)]. This means taking the expectation of h with respect to the distribution of f. Our previous example is the case where f is the uniform distribution between [a,b].

Procedure for the General Case

i) Draw n samples from f

ii) Compute h(xi)

iii) $\hat{I} = \frac{1}{n} \sum_{i=1}^{n} h(x$i$)$

Example: Exponential Distribution

Find $E[\sqrt{x}]$ for $\displaystyle f = e^{-x}$, which is the exponential distribution with mean 1.

$I = \int_{0}^{\infty} \sqrt{x} e^{-x}dx$

We can see that we must draw samples from f, the exponential distribution.

To find a numerical solution using Monte Carlo Integration we see that:

u= rand(1,10000)
X= -log(u)
h= $\sqrt{x}$
I= mean(h)

To implement this procedure in Matlab, use the following code:

u = rand(1,10000);
X = -log(u);
h = x.^.5;
mean(h)
ans = .8841


An easy way to check whether your approximation is correct is to use the built in Matlab function 'quadl' which takes a function and bounds for the integral and returns a solution for the definite integral of that function. For this specific example, we can enter:

f = @(x) sqrt(x).*exp(-x);
% quadl runs into computational problems when the upper bound is "inf" or an extremely large number,
% so choose just a moderately large number.
ans =
0.8862


From the above result, we see that our approximation was quite close.

Example: Normal Distribution

Let $f(x) = (1/(2 \pi)^{1/2}) e^{(-x^2)/2}$. Compute the cumulative distribution function at some point x.

$F(x)= \int_{-\infty}^{x} f(s)ds = \int_{-\infty}^{x}(1)(1/(2 \pi)^{1/2}) e^{(-s^2)/2}ds$. The (1) is inserted to illustrate that our h(x) will be the constant function 1, and our f(x) is the normal distribution. To take into account the upper bound of integration, x, any values sampled that are greater than x will be set to zero.

This is the Matlab code for solving F(2):


u = randn(1,10000)
h = u < 2;
mean(h)
ans = .9756



We generate a 1 by 10000 vector of standard normal random variables and we return a value of 1 if u is less than 2, and 0 otherwise.

We can also build the function F(x) in matlab in the following way:

function F(x)
u=rand(1,1000000);
h=u<x;
mean(h)


Example: Binomial Distribution

In this example we will see the Bayesian Inference for 2 Binomial Distributions.

Let $X ~ Bin(n,p)$ and $Y ~ Bin(m,q)$, and let $\!\delta = p-q$.

Therefore, $\displaystyle \!\delta = x/n - y/m$ which is the frequentist approach.

Bayesian wants $\displaystyle f(p,q|x,y) = f(x,y|p,q)f(p,q)/f(x,y)$, where $f(x,y)=\iint\limits_{\!\theta} f(x,y|p,q)f(p,q)\,dp\,dq$ is a constant.

Thus, $\displaystyle f(p,q|x,y)\propto f(x,y|p,q)f(p,q)$. Now we assume that $\displaystyle f(p,q) = f(p)f(q) = 1$ and f(p) and f(q) are uniform.

Therefore, $\displaystyle f(p,q|x,y)\propto p^x(1-p)^{n-x}q^y(1-q)^{m-y}$.

$E[\delta|x,y] = \int_{0}^{1} \int_{0}^{1} (p-q)f(p,q|x,y)dpdq$.

As you can see this is much tougher than the frequentist approach.

Importance Sampling and Basic Monte Carlo Integration - October 11th, 2011

Example: Binomial Distribution (Continued)

Suppose we are given two independent Binomial Distributions $\displaystyle X \sim Bin(n, p_1)$, $\displaystyle Y \sim Bin(m, p_2)$. We would like to give an Monte Carlo estimate of $\displaystyle \delta = p_1 - p_2$

Frequentist approach:

$\displaystyle \hat{p_1} = \frac{X}{n}$ ; $\displaystyle \hat{p_2} = \frac{Y}{m}$

$\displaystyle \hat{\delta} = \hat{p_1} - \hat{p_2} = \frac{X}{n} - \frac{Y}{m}$

Bayesian approach to compute the expected value of $\displaystyle \delta$:

$\displaystyle E(\delta|X,Y) = \int\int(p_1-p_2) f(p_1,p_2|X,Y)\,dp_1dp_2$

Assume that $\displaystyle n = 100, m = 100, p_1 = 0.5, p_2 = 0.8$ and the sample size is 1000.
MATLAB code of the above example:

n = 100;
m = 100;
p_1 = 0.5;
p_2 = 0.8;
p1 = mean(rand(n,1000)<p_1,1);
p2 = mean(rand(m,1000)<p_2,1);
delta = p2 - p1;
hist(delta)
mean(delta)


In one execution of the code, the mean of delta was 0.3017. The histogram of delta generated was:

Through Monte Carlo simulation, we can obtain an empirical distribution of delta and carry out inference on the data obtained, such as computing the mean, maximum, variance, standard deviation and the standard error of delta.

Importance Sampling

Motivation

Consider the integral $\displaystyle I = \int h(x)f(x)\,dx$

According to basic Monte Carlo Integration, if we can sample from the probability density function $\displaystyle f(x)$ and feed the samples of $\displaystyle f(x)$ back to $\displaystyle h(x)$, $\displaystyle I$ can be estimated as an average of $\displaystyle h(x)$ ( i.e. $\hat{I} = \frac{1}{n} \sum_{i=1}^{n} h(x_i)$ )
However, the Monte Carlo method works when we know how to sample from $\displaystyle f(x)$. In the case where it is difficult to sample from $\displaystyle f(x)$, importance sampling is a technique that we can apply. Importance Sampling relies on another function $\displaystyle g(x)$ which we know how to sample from.

The above integral can be rewritten as follow:
\begin{align} \displaystyle I & = \int h(x)f(x)\,dx \\ & = \int h(x)f(x)\frac{g(x)}{g(x)}\,dx \\ & = \int \frac{h(x)f(x)}{g(x)}g(x)\,dx \\ & = \int y(x)g(x)\,dx \\ & = E_g(y(x)) \\ \end{align}
$where \ y(x) = \frac{h(x)f(x)}{g(x)}$

The integral can thus be simulated as $\displaystyle \hat{I} = \frac{1}{n} \sum_{i=1}^{n} Y_i \ , \ where \ Y_i = \frac{h(x_i)f(x_i)}{g(x_i)}$

Procedure

Suppose we know how to sample from $\displaystyle g(x)$

1. Choose a suitable $\displaystyle g(x)$ and draw n samples $x_1,x_2....,x_n \sim~ g(x)$
2. Set $Y_i =\frac{h(x_i)f(x_i)}{g(x_i)}$
3. Compute $\hat{I} = \frac{1}{n}\sum_{i=1}^{n} Y_i$

By the Law of large numbers, $\displaystyle \hat{I} \rightarrow I$ provided that the sample size n is large enough.

Remarks: One can think of $\frac{f(x)}{g(x)}$ as a weight to $\displaystyle h(x)$ in the computation of $\hat{I}$

$\displaystyle i.e. \ \hat{I} = \frac{1}{n}\sum_{i=1}^{n} Y_i = \frac{1}{n}\sum_{i=1}^{n} (\frac{f(x_i)}{g(x_i)})h(x_i)$

Therefore, $\displaystyle \hat{I}$ is a weighted average of $\displaystyle h(x_i)$

Problem

If $\displaystyle g(x)$ is not chosen appropriately, then the variance of the estimate $\hat{I}$ may be very large. The problem here is actually similar to what we encounter with the Rejection-Acceptance method. Consider the second moment of $y(x)$:

\begin{align} E_g((y(x))^2) \\ & = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx \\ & = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx \\ & = \int \frac{h^2(x)f^2(x)}{g(x)} dx \\ \end{align}

When $\displaystyle g(x)$ is very small, then the above integral could be very large, hence the variance can be very large when g is not chosen appropriately. This occurs when $\displaystyle g(x)$ has a thinner tail than $\displaystyle f(x)$ such that the quantity $\displaystyle \frac{h^2(x)f^2(x)}{g(x)}$ is large.

Remarks:

1. We can actually compute the form of $\displaystyle g(x)$ to have optimal variance.
Mathematically, it is to find $\displaystyle g(x)$ subject to $\displaystyle \min_g [\ E_g([y(x)]^2) - (E_g[y(x)])^2\ ]$
It can be shown that the optimal $\displaystyle g(x)$ is $\displaystyle {|h(x)|f(x)}$. Using the optimal $\displaystyle g(x)$ will minimize the variance of estimation in Importance Sampling. This is of theoretical interest but not useful in practice. As we can see, if we can actually show the expression of g(x), we must first have the value of the integration---which is what we want in the first place.

2. In practice, we shall choose $\displaystyle g(x)$ which has similar shape as $\displaystyle f(x)$ but with a thicker tail than $\displaystyle f(x)$ in order to avoid the problem mentioned above.

Example

Estimate $\displaystyle I = Pr(Z\gt 3),\ where\ Z \sim N(0,1)$

Method 1: Basic Monte Carlo

\begin{align} Pr(Z\gt 3) & = \int^\infty_3 f(x)\,dx \\ & = \int^\infty_{-\infty} h(x)f(x)\,dx \end{align}
$where \ h(x) = \begin{cases} 0, & \text{if } x \le 3 \\ 1, & \text{if } x \gt 3 \end{cases}$ $\ ,\ f(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2}$

MATLAB code to compute $\displaystyle I$ from 100 samples of standard normal distribution:

h = randn(100,1) > 3;
I = mean(h)


In one execution of the code, it returns a value of 0 for $\displaystyle I$, which differs significantly from the true value of $\displaystyle I \approx 0.0013$. The problem of using Basic Monte Carlo in this example is that $\displaystyle Pr(Z\gt 3)$ has a small value, and hence many points sampled from the standard normal distribution will be wasted. Therefore, although Basic Monte Carlo is a feasible method to compute $\displaystyle I$, it gives a poor estimation.

Method 2: Importance Sampling

$\displaystyle I = Pr(Z\gt 3)= \int^\infty_3 f(x)\,dx$

To apply importance sampling, we have to choose a $\displaystyle g(x)$ which we will sample from. In this example, we can choose $\displaystyle g(x)$ to be the probability density function of exponential distribution, normal distribution with mean 0 and variance greater than 1 or normal distribution with mean greater than 0 and variance 1 etc. The goal is to minimize the number of rejected samples in order to produce a more accurate result! For the following, we take $\displaystyle g(x)$ to be the pdf of $\displaystyle N(4,1)$.

Procedure:

1. Draw n samples $x_1,x_2....,x_n \sim~ g(x)$
2. Calculate \begin{align} \frac{f(x)}{g(x)} & = \frac{ \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^2} }{ \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x-4)^2} } \\ & = e^{8-4x} \end{align}
3. Set $Y_i = h(x_i)e^{8-4x_i}\ with\ h(x) = \begin{cases} 0, & \text{if } x \le 3 \\ 1, & \text{if } x \gt 3 \end{cases}$
4. Compute $\hat{Y} = \frac{1}{n}\sum_{i=1}^{n} Y_i$

The above procedure from 100 samples of $\displaystyle g(x)$can be implemented in MATLAB as follow:

for ii = 1:100
x = randn + 4 ;
h = x > 3 ;
y(ii) = h * exp(8-4*x) ;
end
mean(y)


In one execution of the code, it returns a value of 0.001271 for $\hat{Y}$, which is much closer to the true value of $\displaystyle I \approx 0.0013$. From many executions of the code, the variance of basic monte carlo is approximately 150 times that of importance sampling. This demonstrates that this method can provide a better estimate than the Basic Monte Carlo method.

Importance Sampling with Normalized Weight and Markov Chain Monte Carlo - October 13th, 2011

Importance Sampling with Normalized Weight

Recall that we can think of $\displaystyle b(x) = \frac{f(x)}{g(x)}$ as a weight applied to the samples $\displaystyle h(x)$. If the form of $\displaystyle f(x)$ is known only up to a constant, we can use an alternate, normalized form of the weight, $\displaystyle b^*(x)$. (This situation arises in Bayesian inference.) Importance sampling with normalized or standard weight is also called indirect importance sampling.

We derive the normalized weight as follows:
\begin{align} \displaystyle I & = \int h(x)f(x)\,dx \\ &= \int h(x)\frac{f(x)}{g(x)}g(x)\,dx \\ &= \frac{\int h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int f(x) dx} \\ &= \frac{\int h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int \frac{f(x)}{g(x)}g(x) dx} \\ &= \frac{\int h(x)b(x)g(x)\,dx}{\int\ b(x)g(x) dx} \end{align}

$\hat{I}= \frac{\sum_{i=1}^{n} h(x_i)b(x_i)}{\sum_{i=1}^{n} b(x_i)}$

Then, the normalized weight is $b^*(x) = \displaystyle \frac{b(x_i)}{\sum_{i=1}^{n} b(x_i)}$

Note that $\int f(x) dx = 1 = \int b(x)g(x) dx = 1$

We can also determine the associated Monte Carlo variance of this estimate by

$Var(\hat{I})= \frac{\sum_{i=1}^{n} b(x_i)(h(x_i) - \hat{I})^2}{\sum_{i=1}^{n} b(x_i)}$

Markov Chain Monte Carlo

We still want to solve $I = \displaystyle\int^\ h(x)f(x)\,dx$

Stochastic Process

A stochastic process $\{ x_t : t \in T \}$ is a collection of random variables. Variables $\displaystyle x_t$ take values in some set $\displaystyle X$ called the state space. The set $\displaystyle T$ is called the index set.

Markov Chain

A Markov Chain is a stochastic process for which the distribution of $\displaystyle x_t$ depends only on $\displaystyle x_{t-1}$. It is a random process characterized as being memoryless, meaning that the next occurrence of a defined event is only dependent on the current event;not on the preceding sequence of events.

  Formal Definition: The process $\{ x_t : t \in T \}$ is a Markov Chain if $\displaystyle Pr(x_t|x_0, x_1,..., x_{t-1})= Pr(x_t|x_{t-1})$ for all $\{t \in T \}$ and for all $\{x \in X \}$


For a Markov Chain, $\displaystyle f(x_1,...x_n)= f(x_1)f(x_2|x_1)f(x_3|x_2)...f(x_n|x_{n-1})$

Real Life Example:
When going for an interview, the employer only looks at your highest education achieved. The employer would not look at the past educations received (elementary school, high school etc.) because the employer believes that the highest education achieved summarizes your previous educations. Therefore, anything before your most recent previous education is irrelevant. In other word, we can say that $\! x_t$ is regarded as the summary of $\!x_{t-1},...,x_2,x_1$, so when we need to determine $\!x_{t+1}$, we only need to pay attention in $\!x_{t}$.

Transition Probabilities

The transition probability is the probability of jumping from one state to another state in a Markov Chain.

Formally, let us define $\displaystyle P_{ij} = Pr(x_{t+1}=j|x_t=i)$ to be the transition probability.

That is, $\displaystyle P_{ij}$ is the probability of transitioning from state i to state j in a single step. Then the matrix $\displaystyle P$ whose (i,j) element is $\displaystyle P_{ij}$ is called the transition matrix.

Properties of P:

1) $\displaystyle P_{ij} \gt = 0$ (The probability of going to another state cannot be negative)
2) $\displaystyle \sum_{\forall j}P_{ij} = 1$ (The probability of transitioning to some state from state i (including remaining in state i) is one)

Random Walk

Example: Start at one point and flip a coin where $\displaystyle Pr(H)=p$ and $\displaystyle Pr(T)=1-p=q$. Take one step right if heads and one step left if tails. If at an endpoint, stay there. The transition matrix is $P=\left(\begin{matrix}1&0&0&\dots&\dots&0\\ q&0&p&0&\dots&0\\ 0&q&0&p&\dots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ \vdots&\vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&\dots&\dots&\dots&1 \end{matrix}\right)$

Let $\displaystyle P_n$ be the matrix such that its (i,j) element is $\displaystyle P_{ij}(n)$. This is called n-step probability.

$\displaystyle P_n = P^n$
$\displaystyle P_1 = P$
$\displaystyle P_2 = P^2$

Markov Chain Properties and Page Rank - October 18th, 2011

Summary of Terminology

Transition Matrix

A matrix $\!P$ that defines a Markov Chain has the form:

$P = \begin{bmatrix} P_{11} & \cdots & P_{1N} \\ \vdots & \ddots & \vdots \\ P_{N1} & \cdots & P_{NN} \end{bmatrix}$

where $\!P(i,j) = P_{ij} = Pr(x_{t+1} = j | x_t = i)$ is the probability of transitioning from state i to state j in the Markov Chain in a single step. Note that this implies that all rows add up to one.

n-step Transition matrix

A matrix $\!P_n$ whose (i,j)th entry is the probability of moving from state i to state j after n transitions:

$\!P_n(i,j) = Pr(x_{m+n}=j|x_m = i)$

This probability is called the n-step transition probability. A nice property of this matrix is that

$\!P_n = P^n$

For all n >= 0, where P is the transition matrix. Note that the rows of $\!P_n$ should still add up to one.

Marginal distribution of a Markov Chain

We represent the state at time t as a vector.

$\mu_t = (\mu_t(1) \; \mu_t(2) \; ... \; \mu_t(n))$

Consider this Markov Chain:

$\mu_t = (A \; B)$, where A is the probability of being in state a at time t, and B is the probability of being in state b at time t.

For example if $\mu_t = (0.1 \; 0.9)$, we have a 10% chance of being in state a at time t, and a 90% chance of being in state b at time t.

Suppose we run this Markov chain many times, and record the state at each step.

In this example, we run 4 trials, up until t=5.

t Trial 1 Trial 2 Trial 3 Trial 4 Observed $\mu$
1 a b b a (0.5, 0.5)
2 b a a a (0.75, 0.25)
3 a a b a (0.75, 0.25)
4 b b a b (0.25, 0.75)
5 b b b a (0.25, 0.75)

Imagine simulating the chain many times. If we collect all the outcomes at time t from all the chains, the histogram of this data would look like $\!\mu_t$.

We can find the marginal probabilities as $\!\mu_n = \mu_0 P^n$

Stationary Distribution

Let $\pi = (\pi_i \mid i \in \chi)$ be a vector of non-negative numbers that sum to 1. (i.e. $\!\pi$ is a pmf)

If $\!\pi = \pi P$, then $\!\pi$ is a stationary distribution, also known as an invariant distribution.

Limiting Distribution

A Markov chain has limiting distribution $\!\pi$ if $\lim_{n \to \infty} P^n = \begin{bmatrix} \pi \\ \vdots \\ \pi \end{bmatrix}$

That is, $\!\pi_j = \lim_{n \to \infty}\left [ P^n \right ]_{ij}$ exists and is independent of i.

Here is an example:

Suppose we want to find the stationary distribution of $P=\left(\begin{matrix} 1/3&1/3&1/3\\ 1/4&3/4&0\\ 1/2&0&1/2 \end{matrix}\right)$

We want to solve $\pi=\pi P$ and we want $\displaystyle \pi_0 + \pi_1 + \pi_2 = 1$

$\displaystyle \pi_0 = \frac{1}{3}\pi_0 + \frac{1}{4}\pi_1 + \frac{1}{2}\pi_2$
$\displaystyle \pi_1 = \frac{1}{3}\pi_0 + \frac{3}{4}\pi_1$
$\displaystyle \pi_2 = \frac{1}{3}\pi_0 + \frac{1}{2}\pi_2$

Solving the system of equations, we get
$\displaystyle \pi_1 = \frac{4}{3}\pi_0$
$\displaystyle \pi_2 = \frac{2}{3}\pi_0$

So using our condition above, we have $\displaystyle \pi_0 + \frac{4}{3}\pi_0 + \frac{2}{3}\pi_0 = 1$ and by solving we get $\displaystyle \pi_0 = \frac{1}{3}$

Using this in our system of equations, we obtain:
$\displaystyle \pi_1 = \frac{4}{9}$
$\displaystyle \pi_2 = \frac{2}{9}$

Thus, the stationary distribution is $\displaystyle \pi = (\frac{1}{3}, \frac{4}{9}, \frac{2}{9})$

Detailed Balance

$\!\pi$ has the detailed balance property if $\!\pi_iP_{ij} = P_{ji}\pi_j$

Theorem

If $\!\pi$ satisfies detailed balance, then $\!\pi$ is a stationary distribution.

In other words, if $\!\pi_iP_{ij} = P_{ji}\pi_j$, then $\!\pi = \pi P$

Proof:

$\!\pi P = \begin{bmatrix}\pi_1 & \pi_2 & \cdots & \pi_N\end{bmatrix} \begin{bmatrix}P_{11} & \cdots & P_{1N} \\ \vdots & \ddots & \vdots \\ P_{N1} & \cdots & P_{NN}\end{bmatrix}$

Observe that the jth element of $\!\pi P$ is

$\!\left [ \pi P \right ]_j = \pi_1 P_{1j} + \pi_2 P_{2j} + \dots + \pi_N P_{Nj}$

$\! = \sum_{i=1}^N \pi_i P_{ij}$
$\! = \sum_{i=1}^N P_{ji} \pi_j$, by the definition of detailed balance.
$\! = \pi_j \sum_{i=1}^N P_{ji}$
$\! = \pi_j$, as the sum of the entries in a row of P must sum to 1.

So $\!\pi = \pi P$.

Example

Find the marginal distribution of

Start by generating the matrix P.

$\!P = \begin{pmatrix} 0.2 & 0.8 \\ 0.6 & 0.4 \end{pmatrix}$

We must assume some starting value for $\mu_0$

$\!\mu_0 = \begin{pmatrix} 0.1 & 0.9 \end{pmatrix}$

For t = 1, the marginal distribution is

$\!\mu_1 = \mu_0 P$

Notice that this $\mu$ converges.

If you repeatedly run:

$\!\mu_{i+1} = \mu_i P$

It converges to $\mu = \begin{pmatrix} 0.4286 & 0.5714 \end{pmatrix}$

This can be seen by running the following Matlab code:

P = [0.2 0.8; 0.6 0.4];
mu = [0.1 0.9];
while 1
mu_old = mu;
mu = mu * P;
if mu_old == mu
disp(mu);
break;
end
end


Another way of looking at this simple question is that we can see whether the ultimate pmf converges:

Let $\hat{p_n}(1)=\frac{1}{n}\sum_{k=1}^n I(X_k=1)$ denote the estimator of the stationary probability of state 1,$\hat{p_n}(2)=\frac{1}{n}\sum_{k=1}^n I(X_k=2)$ denote the estimator of the stationary probability of state 2, where $\displaystyle I(X_k=1)$ and $\displaystyle I(X_k=2)$ are indicator variables which equal 1 if $X_k=1$(or $X_k=2$ for the latter one).

Matlab codes for this explanation is

n=1;
if rand<0.1
x(1)=1;
else
x(1)=0;
end
p1(1)=sum(x)/n;
p2(1)=1-p1(1);
for i=2:10000
n=n+1;
if (x(i-1)==1&rand<0.2)|(x(i-1)==0&rand<0.6)
x(i)=1;
else
x(i)=0;
end
p1(i)=sum(x)/n;
p2(i)=1-p1(i);
end
plot(p1,'red');
hold on;
plot(p2)


The results can be easily seen from the graph below:

Additionally, we can plot the marginal distribution as it converges without estimating it. The following Matlab code shows this:

%transition matrix
P=[0.2 0.8; 0.6 0.4];
%mu at time 0
mu=[0.1 0.9];
%number of points for simulation
n=20;
for i=1:n
mu_a(i)=mu(1);
mu_b(i)=mu(2);
mu=mu*P;
end
t=[1:n];
plot(t, mu_a, t, mu_b);
hleg1=legend('state a', 'state b');


Note that there are chains with stationary distributions that don't converge (the chain might not naturally reach the stationary distribution, and isn't limiting). An example of this is:

$P = \begin{pmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \end{pmatrix}, \mu_0 = \begin{pmatrix} 1/3 & 1/3 & 1/3 \end{pmatrix}$

$\!\mu_0$ is a stationary distribution, so $\!\mu P$ is the same for all iterations.

But,

$P^{1000} = \begin{pmatrix} 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{pmatrix} \ne \begin{pmatrix} \mu \\ \mu \\ \mu \end{pmatrix}$

So $\!\mu$ is not a limiting distribution. Also, if

$\mu = \begin{pmatrix} 0.2 & 0.1 & 0.7 \end{pmatrix}$

Then $\!\mu \neq \mu P$.

This can be observed through the following Matlab code.

P = [0 0 1; 1 0 0; 0 1 0];
mu = [0.2 0.1 0.7];
for i= 1:4
mu = mu * P;
disp(mu);
end


Output:

0.1000    0.7000    0.2000
0.7000    0.2000    0.1000
0.2000    0.1000    0.7000
0.1000    0.7000    0.2000


Note that $\!\mu_1 = \!\mu_4$, which indicates that $\!\mu$ will cycle forever.

This means that this chain has a stationary distribution, but is not limiting. A chain has a limiting distribution iff it is ergodic, that is, aperiodic and positive recurrent. While cycling breaks detailed balance and limiting distribution on the whole state space does not exist, the cycling behavior itself is the "limiting distribution". Also, for each cycles (closed class), we will have a mini-limiting distribution which is useful in analyzing small scale long-term behavior.

Page Rank

Page Rank was the original ranking algorithm used by Google's search engine to rank web pages.<ref> http://ilpubs.stanford.edu:8090/422/ </ref> The algorithm was created by the founders of Google, Larry Page and Sergey Brin as part of Page's PhD thesis. When a query is entered in a search engine, there are a set of web pages which are matched by this query, but this set of pages must be ordered by their "importance" in order to identify the most meaningful results first. Page Rank is an algorithm which assigns importance to every web page based on the links in each page.

Intuition

We can represent web pages by a set of nodes, where web links are represented as edges connecting these nodes. Based on our intuition, there are three main factors in deciding whether a web page is important or not.

1. A web page is important if many other pages point to it.
2. The more important a webpage is, the more weight is placed on its links.
3. The more links a webpage has, the less weight is placed on its links.

Modelling

We can model the set of links as a N-by-N matrix L, where N is the number of web pages we are interested in:

$L_{ij} = \left\{ \begin{array}{lr} 1 : \text{if page j points to i}\\ 0 : \text{otherwise} \end{array} \right.$

The number of outgoing links from page j is

$c_j = \sum_{i=1}^N L_{ij}$

For example, consider the following set of links between web pages:

According to the factors relating to importance of links, we can consider two possible rankings :

$\displaystyle 3 \gt 2 \gt 1 \gt 4$

or

$\displaystyle 3\gt 1\gt 2\gt 4$ if we consider that the high importance of the link from 3 to 1 is more influent than the fact that there are two outgoing links from page 1 and only one from page 2.

We have $L = \begin{bmatrix} 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 1 \\ 0 & 0 & 0 & 0 \end{bmatrix}$, and $c = \begin{pmatrix}2 & 1 & 1 & 1\end{pmatrix}$

We can represent the ranks of web pages as the vector P, where the ith element is the rank of page i:

$P_i = (1-d) + d\sum_j \frac{L_{ij}}{c_j} P_j$

Here we take the sum of the weights of the incoming links, where links are reduced in weight if the linking page has a lot of outgoing links, and links are increased in weight if the linking page has a lot of incoming links.

We don't want to completely ignore pages with no incoming links, which is why we add the constant (1 - d).

If

$L = \begin{bmatrix} L_{11} & \cdots & L_{1N} \\ \vdots & \ddots & \vdots \\ L_{N1} & \cdots & L_{NN} \end{bmatrix}$

$D = \begin{bmatrix} c_1 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & c_N \end{bmatrix}$

Then $D^{-1} = \begin{bmatrix} c_1^{-1} & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & c_N^{-1} \end{bmatrix}$

$\!P = (1-d)e + dLD^{-1}P$

where $\!e = \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}$ is the vector with all 1's

To simplify the problem, we let $\!e^T P = N \Rightarrow \frac{e^T P}{N} = 1$. This means that the average importance of all pages on the internet is 1.

Then $\!P = (1-d)\frac{ee^TP}{N} + dLD^{-1}P$

$\! = \left [ (1-d)\frac{ee^T}{N} + dLD^{-1} \right ] P$
$\! = \left [ \left ( \frac{1-d}{N} \right ) E + dLD^{-1} \right ] P$, where $E$ is an NxN matrix filled with ones.

Let $\!A = \left [ \left ( \frac{1-d}{N} \right ) E + dLD^{-1} \right ]$

Then $\!P = AP$.

Note that P is a stationary distribution and, more importantly, P is an eigenvector of A with eigenvalue 1. Therefore, we can find the ranks of all web pages by solving this equation for P.

We can find the vector P for the example above, using the following Matlab code:

L = [0 0 1 0; 1 0 0 0; 1 1 0 1; 0 0 0 0];
D = [2 0 0 0; 0 1 0 0; 0 0 1 0; 0 0 0 1];
d = 0.8 ;% pages with no links get a weight of 0.2
N = 4 ;

A = ((1-d)/N) * ones(N) + d * L * inv(D);
[EigenVectors, EigenValues] = eigs(A)
s=sum(EigenVectors(:,1));% we should note that the average entry of P should be 1 according to our assumption
P=(EigenVectors(:,1))/s*N


This outputs:

EigenVectors =
-0.6363             0.7071             0.7071            -0.0000
-0.3421            -0.3536 + 0.3536i  -0.3536 - 0.3536i  -0.7071
-0.6859            -0.3536 - 0.3536i  -0.3536 + 0.3536i   0.0000
-0.0876             0.0000 + 0.0000i   0.0000 - 0.0000i   0.7071


EigenValues =
1.0000                  0                  0                  0
0            -0.4000 - 0.4000i        0                  0
0                  0            -0.4000 + 0.4000i        0
0                  0                  0             0.0000


P =

   1.4528
0.7811
1.5660
0.2000


Note that there is an eigenvector with eigenvalue 1. The reason why there always exist a 1-eigenvector is that A is a stochastic matrix.

Thus our vector P is $\begin{bmatrix}1.4528 \\ 0.7811 \\ 1.5660\\ 0.2000 \end{bmatrix}$

However, this method is not practical, because there are simply too many web pages on the internet. So instead Google uses a method to approximate an eigenvector with eigenvalue 1.

Note that page three has the rank with highest magnitude and page four has the rank with lowest magnitude, as expected.

Markov Chain Monte Carlo - Metropolis-Hastings - October 25th, 2011

We want to find $\int h(x)f(x)\, \mathrm dx$, but we don't know how to sample from $\,f$.

We have seen simple techniques earlier in the course; here is an example of a real life application. It consists of the search for a Markov Chain such that its stationary distribution is $\,f$.

Main procedure

Let us suppose that $\,q(y|x)$ is a friendly distribution: we can sample from this function.

1. Initialize the chain with $\,x_{i}$ and set $\,i=0$.

2. Draw a point from $\,q(y|x)$ i.e. $\,Y \backsim q(y|x_{i})$.

3. Evaluate $\,r(x,y)=min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\}$

4. Draw a point $\,U \backsim Unif[0,1]$.

5. $\,x_{i+1}=\begin{cases}y & \text{ if } U\lt r \\x_{i} & \text{ otherwise } \end{cases}$.

6. $\,i=i+1$. Go back to 2.

Remark 1

A very common choice for $\,q(y|x)$ is $\,N(y;x,b^{2})$, a normal distribution centered at the current point.

Note : In this case $\,q(y|x)$ is symmetric i.e. $\,q(y|x)=q(x|y)$.

(Because $\,q(y|x)=\frac{1}{\sqrt{2\pi}b}e^{-\frac{1}{2b^{2}}(y-x)^{2}}$ and $\,(y-x)^{2}=(x-y)^{2}$).

Thus we have $\,\frac{q(x|y)}{q(y|x)}=1$, which implies :

$\,r(x,y)=min\left\{\frac{f(y)}{f(x)},1\right\}$.

In general, if $\,q(x|y)$ is symmetric then the algorithm is called Metropolis in reference to the original algorithm (made in 1953)<ref>http://en.wikipedia.org/wiki/Equations_of_State_Calculations_by_Fast_Computing_Machines</ref>.

Remark 2

The value y is accepted if $\,u\lt min\left\{\frac{f(y)}{f(x)},1\right\}$ so it is accepted with the probability $\,min\left\{\frac{f(y)}{f(x)},1\right\}$.

Thus, if $\,f(y)\gt f(x)$, then $\,y$ is always accepted.

The higher that value of the pdf is in the vicinity of a point $\,y_1$, the more likely it is that a random variable will take on values around $\,y_1$. As a result it makes sense that we would want a high probability of acceptance for points generated near $\,y_1$.

Remark 3

One strength of the Metropolis-Hastings algorithm is that normalizing constants, which are often quite difficult to determine, can be cancelled out in the ratio $r$. For example, consider the case where we want to sample from the beta distribution, which has the pdf:

\begin{align} f(x;\alpha,\beta)& = \frac{1}{\mathrm{B}(\alpha,\beta)}\, x^{\alpha-1}(1-x)^{\beta-1}\end{align}

The beta function, B, appears as a normalizing constant but it can be simplified by construction of the method.

Example

$\,f(x)=\frac{1}{\pi^{2}}\frac{1}{1+x^{2}}$

Then, we have $\,f(x)\propto\frac{1}{1+x^{2}}$.

And let us take $\,q(x|y)=\frac{1}{\sqrt{2\pi}b}e^{-\frac{1}{2b^{2}}(y-x)^{2}}$.

Then $\,q(x|y)$ is symmetric.

Therefore Y can be simplified.

We get :

\,\begin{align} \displaystyle r(x,y) & =min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} \\ & =min\left\{\frac{f(y)}{f(x)},1\right\} \\ & =min\left\{ \frac{ \frac{1}{1+y^{2}} }{ \frac{1}{1+x^{2}} },1\right\}\\ & =min\left\{ \frac{1+x^{2}}{1+y^{2}},1\right\}\\ \end{align}.

The Matlab code of the algorithm is the following :

clear all
close all
clc
b=2;
x(1)=randn;
for i=2:10000
y=b*randn+x(i-1);
r=min((1+x(i-1)^2)/(1+y^2),1);
u=rand;
if u<r
x(i)=y;
else
x(i)=x(i-1);
end

end
hist(x(5000:end));
%The Markov Chain usually takes some time to converge and this is known as the "burning time".
%Therefore, we don't display the first 5000 points because they don't show the limiting behaviour of the Markov
Chain.


As we can see, the choice of the value of b is made by us.

Changing this value has a significant impact on the results we obtain. There is a pitfall when b is too big or too small.

Example with $\,b=0.1$ (Also with graph after we run j=5000:10000; plot(j,x(5000:10000)):

With $\,b=0.1$, the chain takes small steps so the chain doesn't explore enough of the sample space. It doesn't give an accurate report of the function we want to sample.

Example with $\,b=10$ :

With $\,b=10$, jumps are very unlikely to be accepted as they deviate far from the mean (i.e. $\,y$ is rejected as $\ u\gt r$ and $\,x(i)=x(i-1)$ most of the time, hence most sample points stay fairly close to the origin. The third graph that resembles white noise (as in the case of $\,b=2$) indicates better sampling as more points are covered and accepted. For $\,b=0.1$, we have lots of jumps but most values are not repeated, hence the stationary distribution is less obvious; whereas in the $\,b=10$ case, many points remains around 0. Approximately 73% were selected as x(i-1).

Example with $\,b=2$ :

With $\,b=2$, we get a more accurate result as we avoid these extremes. Approximately 37% were selected as x(i-1).

If the sample from the Markov Chain starts to look like the target distribution quickly, we say the chain is mixing well.

Theory and Applications of Metropolis-Hastings - October 27th, 2011

As mentioned in the previous section, the idea of the Metropolis-Hastings (MH) algorithm is to produce a Markov chain that converges to a stationary distribution $\!f$ which we are interested in sampling from.

Convergence

One important fact to check is that $\displaystyle f$ is indeed a stationary distribution in the MH scheme. For this, we can appeal to the implications of the detailed balance property:

Given a probability vector $\!\pi$ and a transition matrix $\displaystyle P$, $\!\pi$ has the detailed balance property if $\!\pi_iP_{ij} = P_{ji}\pi_j$

If $\!\pi$ satisfies detailed balance, then it is a stationary distribution.

The above definition applies to the case where the states are discrete. In the continuous case, $\displaystyle f$ satisfies detailed balance if $\displaystyle f(x)p(x,y)=f(y)p(y,x)$. Where $\displaystyle p(x,y)$ and $\displaystyle p(y,x)$ are the probabilities of transitioning from x to y and y to x respectively. If we can show that $\displaystyle f$ has the detailed balance property, we can conclude that it is a stationary distribution. Because $\int^{}_y f(y)p(y,x)dy=\int^{}_y f(x)p(x,y)dy=f(x)$.

In the MH algorithm, we use a proposal distribution to generate $\, y$~$\displaystyle q(y|x)$, and accept y with probability $min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\}$

Suppose, without loss of generality, that $\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)} \le 1$. This implies that $\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)} \ge 1$

Let $\,r(x,y)$ be the chance of accepting point y given that we are at point x.

So $\,r(x,y) = min\left\{\frac{f(y)}{f(x)}\frac{q(x|y)}{q(y|x)},1\right\} = \frac{f(y)}{f(x)} \frac{q(x|y)}{q(y|x)}$

Let $\,r(y,x)$ be the chance of accepting point x given that we are at point y.

So $\,r(y,x) = min\left\{\frac{f(x)}{f(y)}\frac{q(y|x)}{q(x|y)},1\right\} = 1$

$\,p(x,y)$ is the probability of generating and accepting y, while at point x.

So $\,p(x,y) = q(y|x)r(x,y) = q(y|x) \frac{f(y)}{f(x)} \frac{q(x|y)}{q(y|x)} = \frac{f(y)q(x|y)}{f(x)}$

$\,p(y,x)$ is the probability of generating and accepting x, while at point y.

So $\,p(y,x) = q(x|y)r(y,x) = q(x|y)$

$\,f(x)p(x,y) = f(x)\frac{f(y)q(x|y)}{f(x)} = f(y)q(x|y) = f(y)p(y,x)$

Thus, detailed balance holds.

i.e. $\,f(x)$ is stationary distribution

It can be shown (although not here) that $\!f$ is a limiting distribution as well. Therefore, the MH algorithm generates a sequence whose distribution converges to $\!f$, the target.

Implementation

In the implementation of MH, the proposal distribution is commonly chosen to be symmetric, which simplifies the calculations and makes the algorithm more intuitively understandable. The MH algorithm can usually be regarded as a random walk along the distribution we want to sample from. Suppose we have a distribution $\!f$:

Suppose we start the walk at point $\!x$. The point $\!y_{1}$ is in a denser region than $\!x$, therefore, the walk will always progress from $\!x$ to $\!y_{1}$. On the other hand, $\!y_{2}$ is in a less dense region, so it is not certain that the walk will progress from $\!x$ to $\!y_{2}$. In terms of the MH algorithm:

$r(x,y_{1})=min(\frac{f(y_{1})}{f(x)},1)=1$ since $\!f(y_{1})\gt f(x)$. Thus, any generated value with a higher density will be accepted.

$r(x,y_{2})=\frac{f(y_{2})}{f(x)}$. The lower the density of $\!y_{2}$ is, the less chance it will have of being accepted.

A certain class of proposal distributions can be written in the form:

$\,y|x_i = x_i + \epsilon_i$

where $\,\epsilon_i = g(|x-y|)$

The density depends only on the distance between the current point and the next one (which can be seen as the "step" being taken). These proposal distributions give the Markov chain the random walk nature. The normal distribution that we frequently use in our examples satisfies the above definition.

In actual implementations of the MH algorithm, the proposal distribution needs to be chosen judiciously, because not all proposals will work well with all target distributions we want to sample from. Take a trimodal distribution for example:

If we choose the proposal distribution to be a standard normal as we have done before, problems will arise. The low densities between the peaks means that the MH algorithm will almost never walk to any points generated in these regions and get stuck at one peak. One way to address this issue is to increase the variance, so that the steps will be large enough to cross the gaps. Of course, in this case, it would probably be beneficial to come up with a different proposal function. As a rule of thumb, such functions should result in an approximately 50% acceptance rate for generated points.

Simulated Annealing

Metropolis-Hastings is very useful in simulation methods for solving optimization problems. One such application is simulated annealing, which addresses the problems of minimizing a function $\!h(x)$. This method will not always produce the global solution, but it is intuitively simple and easy to implement.

Consider $e^{\frac{-h(x)}{T}}$, maximizing this expression is equivalent to minimizing $\!h(x)$. Suppose $\mu$ is the maximizing value and $h(x)=(x-\mu)^2$, then the maximization function is a gaussian distribution $\!e^{-\frac{(x-\mu)^2}{T}}$. When many samples are taken from this distribution, the mean will converge to the desired maximizing value. The annealing comes into play by lowering T (the temperature) as the sampling progresses, making the distribution narrower. The steps of simulated annealing are outlined below:

1. start with a random $\!x$ and set T to a large number

2. generate $\!y$ from a proposal distribution $\!q(y|x)$, which should be symmetric

3. accept $\!y$ with probability $min(\frac{f(y)}{f(x)},1)$ . (Note: $f(x) = e^{\frac{-h(x)}{T}}$)

4. decrease T, and then go to step 2

To decrease T in Step 4, a variety of functions can be used. For example, a common temperature function used is with geometric decline, given by an initial temperature $\! T_o$, final temperature $\! T_f$, the number of steps n, and the temperature function $\ T(t) = T_0(\frac{T_f}{T_o})^{t/n}$

The following plot and Matlab code illustrates the simulated annealing procedure as temperature T, the variance, decreases for a Gaussian distribution with zero mean. Starting off with a large value for the temperature T allows the Metropolis-Hastings component of the procedure to capture the mean, before gradually decreasing the temperature T in order to converge to the mean.

x=-10:0.1:10;
mu=0;
T=5;
colour = ['b', 'g', 'm', 'r', 'k'];
for i=1:5
pdfNormal=normpdf(x, mu, T);
plot(x, pdfNormal, colour(i));
T=T-1;
hold on
end
hleg1=legend('T=5', 'T=4', 'T=3', 'T=2', 'T=1');
title('Simulated Annealing Illustration');


<references/>

Simulated Annealing and Gibbs Sampling - November 1, 2011

continued from previous lecture...

Recall $\ r(x,y)=min(\frac{f(y)}{f(x)},1)$ where $\frac{f(y)}{f(x)} = \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}} = e^{\frac{h(x)-h(y)}{T}}$ where $\ r(x,y)$ represents the probability of accepting $\ y$.

We will now look at a couple cases where $\displaystyle h(y) \gt h(x)$ or $\displaystyle h(y) \lt h(x)$, and explore whether to accept or reject $\ y$.

Cases

Case a) Suppose $\displaystyle h(y) \lt h(x)$. Since we want to find the minimum value for $\displaystyle h(x)$, and the point $\displaystyle y$ creates a lower value than our previous point, we accept the new point. Mathematically, $\displaystyle h(y) \lt h(x)$ implies that:

$\frac{f(y)}{f(x)} \gt 1$. Therefore, $\displaystyle r = 1$. So, we will always accept $\displaystyle y$.

Case b) Suppose $\displaystyle h(y) \gt h(x)$. This is bad, since our goal is to minimize $\displaystyle h(x)$. However, we may still accept $\displaystyle y$ with some chance:

$\frac{f(y)}{f(x)} \lt 1$. Therefore, $\displaystyle r \lt 1$. So, we may accept $\displaystyle y$ with probability $\displaystyle r$.

Next, we will look at these cases as $\displaystyle T\to0$.

As $\displaystyle T\to0$ and case a) happens, $e^{\frac{h(x)-h(y)}{T}} \to \infty$, so we will always accept $\displaystyle y$.

As $\displaystyle T\to0$ and case b) happens, $e^{\frac{h(x)-h(y)}{T}} \to \ 0$, so the probability that $\displaystyle y$ will be accepted gets extremely small.

It is worth noting that if we simply start with a small value of T, we may end up rejecting all the generated points, and hence we will get stuck somewhere in the function (due to case b)), which might be a maximum value in some intervals (local maxima), but not in the whole domain (global maxima). It is therefore necessary to start with a large value of T in order to explore the whole function. At the same time, a good estimation of $\ x_0$ < is needed (at least cannot differ from the maximum point too much).

Example

Let $\displaystyle h(x) = (x-2)^2$. The graph of it is:

Then, $e^{\frac{-h(x)}{T}} = e^{\frac{-(x-2)^2}{T}}$ . Take an initial value of T = 20. A graph of this is:

In comparison, we look a graph of T = 0.2:

One can see that with a low T value, the graph has a lot of r = 0, and r >1, while having a bigger T value gives smoother transitions in the graph.

The MATLAB code for the above graphs are:

ezplot('(x-2)^2',[-6,10])
ezplot('exp((-(x-2)^2)/20)',[-6,10])
ezplot('exp((-(x-2)^2)/0.2)',[-6,10])

Travelling Salesman Problem

The simulated annealing method can be applied to compute the solution to the travelling salesman problem. Suppose there are N cities and the salesman only have to visit each city once. The objective is to find out the shortest path (i.e. shortest total length of journey) connecting the cities. An algorithm using simulated annealing on the problem can be found here (Reference).

Gibbs Sampling

Gibbs sampling is another Markov chain Monte Carlo method, similar to Metropolis-Hastings. There are two main differences between Metropolis-Hastings and Gibbs sampling. First, the candidate state is always accepted as the next state in Gibbs sampling. Second, it is assumed that the full conditional distributions are known, i.e. $P(X_i=x|X_j=x_j, \forall j\neq i)$ for all $\displaystyle i$. The idea is that it is easier to sample from conditional distributions which are sets of one dimensional distributions than to sample from a joint distribution which is a higher dimensional distribution. Gibbs is a way to turn the joint distribution into multiple conditional distributions.

- Sampling from conditional distributions may be easier than sampling from joint distributions

- We do not necessarily know the conditional distributions

For example, if we want to sample from $\, f_{X,Y}(x,y)$, we need to know how to sample from $\, f_{X|Y}(x|y)$ and $\, f_{Y|X}(y|x)$. Suppose the chain starts with $\,(X_0,Y_0)$ and $(X_1,Y_1), \dots , (X_n,Y_n)$ have been sampled. Then,

$\, (X_{n+1},Y_{n+1})=(f_{X|Y}(x|Y_n),f_{Y|X}(y|X_{n+1}))$

Gibbs sampling turns a multi-dimensional distribution into a set of one-dimensional distributions. If we want to sample from

$P_{X^1,\dots ,X^p}(x^1,\dots ,x^p)$

and the full conditionals are known, then:

$X^1_{n+1}=f(X^1|X^2_n,\dots ,X^p_n)$

$X^2_{n+1}=f(X^2|X^1_{n+1},X^3_n\dots ,X^p_n)$

$\vdots$

$X^{p-1}_{n+1}=f(X^{p-1}|X^1_{n+1},\dots ,X^{p-2}_{n+1},X^p_n)$

$X^p_{n+1}=f(X^p|X^1_{n+1},\dots ,X^{p-1}_{n+1})$

With Gibbs sampling, we can simulate $\displaystyle n$ random variables sequentially from $\displaystyle n$ univariate conditionals rather than generating one $\, n$-dimensional vector using the full joint distribution, which could be a lot more complicated.

Computational inference deals with probabilistic graphical models. Gibbs sampling is useful here: graphical models show the dependence relations among random variables. For instance, Bayesian networks are graphical models represented using directed acyclic graphs. Looking at such a graphical model tells us on which random variable the distribution of a certain random variable depends (i.e. its parent). The model can be used to "factor" a joint distribution into conditional distributions.

File:stat341 nov 1 graphical model.png
Sample graphical model of five RVs

For example, consider the five random variables A, B, C, D, and E. Without making any assumptions about dependence relations among them, all we know is

$\, P(A,B,C,D,E)=$$\, P(A|B,C,D,E) P(B|C,D,E) P(C|D,E) P(D|E) P(E)$

However, if we know the relation between the random variables, e.g. given the graphical model on the left, we can simplify this expression:

$\, P(A,B,C,D,E)=P(A) P(B|A) P(C|A) P(D|C) P(E|C)$

Although the joint distribution may be very complicated, the conditional distributions may not be.

Check out the following notes on Gibbs sampling:

Example of Gibbs sampling: Multi-variate normal

We'd like to generate samples from a bivariate normal with parameters

$\mu = \begin{bmatrix}1\\ 2 \end{bmatrix} = \begin{bmatrix}\mu_1 \\ \mu_2 \end{bmatrix}$ and $\sigma = \begin{bmatrix}1 && 0.9 \\ 0.9 && 1 \end{bmatrix}= \begin{bmatrix}1 && \rho \\ \rho && 1 \end{bmatrix}$

The conditional distributions of multi-variate normal random variables are also normal:

$\, f(x_1|x_2)=N(\mu_1 + \rho(x_2-\mu_2), 1-\rho^2)$

$\, f(x_2|x_1)=N(\mu_2 + \rho(x_1-\mu_1), 1-\rho^2)$

In general, if the joint distribution has parameters

$\mu = \begin{bmatrix}\mu_1 \\ \mu_2 \end{bmatrix}$ and $\Sigma = \begin{bmatrix} \Sigma _{1,1} && \Sigma _{1,2} \\ \Sigma _{2,1} && \Sigma _{2,2} \end{bmatrix}$

then the conditional distribution $\, f(x_1|x_2)$ has mean $\, \mu_{1|2} = \mu_1 + \Sigma _{1,2}(\Sigma _{2,2})^{-1}(x_2-\mu_2)$ and variance $\, \Sigma _{1|2} = \Sigma _{1,1}-\Sigma _{1,2}(\Sigma _{2,2})^{-1}\Sigma _{2,1}$.

Thus, the algorithm for Gibbs sampling is:

• 1) Set i = 1
• 2) Draw $X_1^{(i)}$ ~ $N(\mu_1 + \rho(X_2^{(i-1)}-\mu_2), 1-\rho^2)$
• 3) Draw $X_2^{(i)}$ ~ $N(\mu_2 + \rho(X_1^{(i)}-\mu_1), 1-\rho^2)$
• 4) Set $X^{(i)} = [X_1^{(i)}, X_2^{(i)}]^T$

The Matlab code implementation of this algorithm is as follows:

   mu = [1; 2];
sigma = [1 0.9; 0.9 1];
X(:,1) = [1; 2];
r = 1 - 0.9^2;
for i = 2:2000
X(1,i) = 1 + 0.9*(X(2,i-1) - mu(2)) + r*randn;
X(2,i) = 2 + 0.9*(X(1,i) - mu(1)) + r*randn;
end
plot(X(1,:),X(2,:),'.')


Which gives the following plot:

Principal Component Analysis (PCA) - November 8, 2011

Principal Component Analysis is a 100 year old algorithm used for reducing the dimensionality of data. As the number of dimensions increase, the number of data points needed to sample accurately increase by an exponential factor.

$\, x\in \mathbb{R}^D \rarr y\in \mathbb{R}^d$

$\ d \le D$

We want to transform $\, x$ to $\, y$ such that we reduce the dimensionality yet lose little information. Generally, variation in the data provides information. Thus we would like to reduce the dimensionality but keep as much variation, or information, as the original set of data. Also, covariance amongst the data reduces the amount of information we can infer from the data. Therefore, we would like to also reduce covariance when we reduce dimensionality.

For example, consider dots in a three dimensional space. By unrolling the 2D manifold that they are on, we can reduce the data to 2D while losing little information. Note: This is not an application of PCA, but it simply illustrates one way we can reduce dimensionality.

Principle Component Analysis allows us to reduce data to a linear subspace of its original space. It works best when data is in a lower dimensional subspace of its original space.

Probabilistic View

We can see a data set $\, x$ as a high dimensional random variable governed by a low dimensional random variable $\, y$. Given $\, x$, we are trying to estimate $\, y$.

We can see this in 2D linear regression, as the locations of data points in a scatter plot are governed by its approximate linear regression. The subspace that we have reduced the data to here is in the direction of variation in the data.

Principal Component Analysis

Principal component analysis is an orthogonal linear transformation on a data set. It associates the data coordinates with a new set of orthogonal vectors, each representing the direction of the maximum variance of the data. That is, the first principal component is the direction of the maximum variance, the second principal component is the direction of the maximum variance orthogonal to the first principal component, the third principal component is the direction of the maximum variance orthogonal to the first and second principal component and so on, until we have D principal components, where D is the dimension of the original data.

Suppose we have data represented by $\, X \in \mathbb{R}^{D \times n}$

Note that we are assuming that the data is mean centered. In other words, the average of every row is zero. If it isn't, we shift the data to have a mean of zero by subtracting the mean of every row from each $\ x$ value in that row. This pre-processing step essentially alters the data such that each row in $\ X$ includes only how the data differs from the mean of the sample, hence ensuring that the first PC describes the direction of maximum variance.

To find the first principal component, we want to find a unit vector $\ W \in \mathbb{R}^{D}$ that maximizes the variance of $\,W^TX$. We restrict $\,W$ to unit vectors since we are only looking for the direction of the vector of maximum variation: the actual scale of it is unnecessary. So $\,W^TW = 1$.

The variance of $\,W^TX$ is $\,W^TSW$ where $\,S$ is the covariance matrix of X.

$\, S = (X-\mu)(X-\mu)^T = XX^T$, since $\ \mu$ is just the zero vector after we center the data around the mean.

So we have to solve the problem

$\, \text {Max } W^TSW \text{ such that } W^TW = 1$

Using the method of Lagrange multipliers, we have

$\,L(W, \lambda) = W^TSW - \lambda(W^TW - 1)$

We set

$\, \frac{\partial L}{\partial W} = 0$

Note that $\, W^TSW$ is a quadratic form. So we have

$\, \frac{\partial L}{\partial W} = 2SW - 2\lambda W = 0$

$\, SW = \lambda W$

Since S is a matrix and lambda is a scaler, W is an eigenvector of S and lambda is its corresponding eigenvalue.

Suppose that

$\, \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_D$ are eigenvalues of S and $\, u_1, u_2, \cdots u_D$ are their corresponding eigenvectors.

We want to choose some $\, W = u$

$\,u^TSu =u^T\lambda u = \lambda u^Tu = \lambda$

So to maximize $\, u^TSu$, choose the eigenvector corresponding to the max eiegenvalue, e.g. $\, u_1$.

So we let $\, W = u_1$ be the first principal component.

The principal components decompose the total variance in the data.

$\, \sum_{i=1}^D \text{Var}(u_i) = \sum_{i=1}^D \lambda_i = \text{Tr}(S) = \sum_{i=1}^D \text{Var}(x_i)$

Singular Value Decomposition

Singular value decomposition is a "generalization" of eigenvalue decomposition "to rectangular matrices of size mxn."<ref name="Abdel_SVD">Abdel-Rahman, E. (2011). Singular Value Decomposition [Lecture notes]. Retrieved from http://uwace.uwaterloo.ca</ref> Singular value decomposition solves:

$\ A_{m\times n}\ v_{n\times 1}=s\ u_{m\times 1}$

"for the right singular vector v, the singular value s, and the left singular vector u. There are n singular values si and n right and left singular vectors that must satisfy the following conditions"<ref name="Abdel_SVD"/>:

1. "All singular values are non-negative"<ref name="Abdel_SVD"/>,
$\ s_i \ge 0.$
2. All "right singular vectors are pairwise orthonormal"<ref name="Abdel_SVD"/>,
$\ v_i^{T}v_j=\delta_{i,j}.$
3. All "left singular vectors are pairwise orthonormal"<ref name="Abdel_SVD"/>,
$\ u_i^{T}u_j=\delta_{i,j}.$

where

$\delta_{i,j}=\left\{\begin{matrix}1 & \mathrm{if}\ i=j \\ 0 & \mathrm{if}\ i\neq j\end{matrix}\right.$

Procedure to find the singular values and vectors

Observe the following about the eigenvalue decomposition of a real square matrix A where v is the unit eigenvector:

\begin{align} & Av=\lambda v \\ & (Av)^T=(\lambda v)^T \\ & (Av)^TAv=(\lambda v)^T\lambda v \\ & v^TA^TAv=\lambda^2v^Tv \\ & vv^TA^TAv=v\lambda^2 \\ & A^TAv=\lambda^2v \end{align}

As a result:

1. "The matrices A and ATA have the same eigenvectors."<ref name="Abdel_SVD"/>
2. "The eigenvalues of matrix ATA are the square of the eigenvalues of matrix A."<ref name="Abdel_SVD"/>
3. Since matrix ATA is symmetric for any matrix A,
1. "all the eigenvalues of matrix ATA are real and distinct."<ref name="Abdel_SVD"/>
2. "the eigenvectors of matrix ATA are orthogonal and can be chosen to be orthonormal."<ref name="Abdel_SVD"/>
4. "The eigenvalues of matrix ATA are non-negative"<ref name="Abdel_SVD"/> since $\ \lambda^2_i \ge 0.$

Conclusions 3 and 4 are "true even for a rectangular matrix A since ATA is still a square symmetric matrix"<ref name="Abdel_SVD"/> and its eigenvalues and eigenvectors can be found.

Therefore, for a rectangular matrix A, assuming m>n, the singular values and vectors can be found by:

1. "Form the nxn symmetric matrix ATA."<ref name="Abdel_SVD"/>
2. Perform an eigenvalue decomposition to get n eigenvalues and their "corresponding eigenvectors, ordered such that"<ref name="Abdel_SVD"/>
$\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_n \ge 0$ and $\{v_1, v_2, \dots, v_n\}.$
3. "The singular values are"<ref name="Abdel_SVD"/>:
$s_1=\sqrt{\lambda_1} \ge s_2=\sqrt{\lambda_2} \ge \dots \ge s_n=\sqrt{\lambda_n} \ge 0.$
"The non-zero singular values are distinct; the equal sign applies only to the singular values that are equal to zero."<ref name="Abdel_SVD"/>
4. "The n-dimensional right singular vectors are"<ref name="Abdel_SVD"/>
$\{v_1, v_2, \dots, v_n\}.$
5. "For the first $r \le n$ singular values such that si > 0, the left singular vectors are obtained as unit vectors"<ref name="Abdel_SVD"/> by $\tfrac{1}{s_i}Av_i=u_i.$
6. Select "the $\ m-r$ left singular vectors corresponding to the zero singular values such that they are unit vectors orthogonal to each other and to the first r left singular vectors"<ref name="Abdel_SVD"/> $\{u_1, u_2, \dots, u_r\}.$

Finding the Singular Value Decomposition Using MATLAB Code

Formal definition

"We can now decompose the rectangular matrix A in terms of singular values and vectors as follows"<ref name="Abdel_SVD"/>:

$A_{m\times n} \begin{bmatrix} v_1 & | & \cdots & | & v_n \end{bmatrix}_{n\times n} = \begin{bmatrix} u_1 & | & \cdots & | & u_n & | I_{n+1} & | & \cdots & | & I_m \end{bmatrix}_{m\times m} \begin{bmatrix} s_1 & 0 & \cdots & 0 \\ 0 & s_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & s_n \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix}_{m\times n}$

Where $\ AV=US$

Since "the matrices V and U are orthogonal"<ref name="Abdel_SVD"/>, V -1=VT and U -1=UT:

$\ A=USV^T$

"which is the formal definition of the singular value decomposition."<ref name="Abdel_SVD"/>

Relevance to PCA

In order to perform PCA, one needs to do eigenvalue decomposition on the covariance matrix. By transforming the mean for all attributes to zero, the covariance matrix can be simplified to:

$\ S=XX^T$

Since the eigenvalue decomposition of ATA gives the same eigenvectors as the singular value decomposition of A, an additional and more consistent method (if a matrix has eigenvectors that are not invertible, its eigenvalue decomposition does not exist) for performing PCA is through the singular value decomposition of X.

The following MATLAB code uses singular value decomposition for performing PCA; 20 principal components, and thus the top 20 maximum variation directions, are selected for reconstructing facial images that have had noise applied to them:

load noisy.mat
%first noisy image; each image has a resolution of 20x28
imagesc(reshape(X(:,1),20,28)')
%to grayscale
colormap gray
%singular value decomposition
[u s v]=svd(X);
%reduced feature space: 20 principal components
Xh=u(:,1:20)*s(1:20,1:20)*v(:,1:20)';
figure
imagesc(reshape(Xh(:,1),20,28)')
colormap gray


Since the reduced feature space image is noiseless, the added noise feature has less variation than the 20 principal components.

<references/>

PCA and Introduction to Kernel Function - November 10, 2011

(Continue from the last lecture)

Some notations:

Let $\displaystyle X$ be a d by n matrix.

Let $\displaystyle X_j\in\R^d$ be the j th column of $\displaystyle X, \forall j=1,2,...,n$.

Let $\displaystyle Q$ be the covariance matrix of $\displaystyle X$. So $\displaystyle Q=(X_j-\bar{X})(X_j-\bar{X})^T$, where $\bar{X}=\frac{1}{n}\sum_{j=1}^n X_j$. Strictly speaking, $\displaystyle Q$ is only the observed covariance matrix from the data collected as the real covariance matrix is unknown.

But we are assuming that we have already centered the data, which means that $\bar{X}=0$.

So $\displaystyle Q_{ij}=(X_i)(X_j)^T=[X X^T]_{ij}$.

Now the principal components of $\displaystyle X$ are the eigenvectors of $\displaystyle Q$. If we do the singular value decomposition, setting $\ [u\ s\ v] = svd(Q)$, then the columns of $\ u$ are eigenvectors of $\displaystyle Q=X X^T$.

Let $\displaystyle u$ be a $\displaystyle d\times p$ matrix, where $\displaystyle p \lt d$, that is composed of the $\ p$ eigenvectors of $\displaystyle Q$ corresponding to the largest eigenvalues.

To map the data in a lower dimensional space we project $\displaystyle X$ onto the $\ p$ dimensional subspace defined by the columns of $\displaystyle u$, which are the first $\ p$ principal components of $\displaystyle X$.

Let $\displaystyle Y_{p\times n}={u^T}_{p\times d} X_{d\times n}$. So $\displaystyle Y_{p\times n}$ is a lower dimensional approximation of our original data $\displaystyle X_{d\times n}$

We can also approximately reconstruct the original data using the dimension-reduced data. However, we will lose some information because when we map those points into lower dimension, we throw away the last $\ (d - p)$ eigenvectors which contain some of the original information.

$\displaystyle X'_{d\times n} = {u}_{d\times p} Y_{p\times n}$, where $\displaystyle X'_{d\times n}$ is an approximate reconstruction of $\displaystyle X_{d\times n}$.

Example Using Handwritten 2s and 3s

The data X is a 64 by 400 matrix. Every column represents an 8 by 8 image of either a handwritten "2" or "3". The first 200 columns are 2s and the last 200 columns are 3s. First we center the data, and then we find the first 2 eigenvectors of $\displaystyle XX^T$ using singular value decomposition. Finally we calculate $\displaystyle Y = u^TX$ and plot the 200 data points in $\displaystyle Y$

MATLAB CODE:

MU = repmat(mean(X,2),1,400);
% mean(X,2) is the average of all the rows stored in a column vector.
% In order to center the data, we extend mean(X,2), which is a 64 by 1 matrix, into a 64 by 400 matrix.

Xt = X-MU;
% We have modified the data to zero mean data by subtracting the average of each row from every entry in that row.

[u s v] = svd(Xt);
% Note that size(u) == 64*64, and the columns of u are eigenvectors of VCM.

Y = u(:,1:2)'*X;
% We project X onto the subspace defined by the first two PCs to get Y.
% This transforms the high dimensional points to lower 2 dimensional ones.

plot(Y(1,1:200)',Y(2,1:200)','d')
hold on
plot(Y(1,201:400)',Y(2,201:400)','ro')
% We now plot the lower dimensional projection of X.
% Essentially we are plotting each point based on the magnitude of Principle Component #1
% and Principle Component #2 in that point.
% Note that the first 200 columns represent 2s and are recorded by blue diamonds.
% Note that the next 200 columns represent 3s and are recorded with red "o"s.


The result is as follows, we can see clearly there are two classes - the 2s and 3s are generally divided into two sections:

In order to analyze the projection in more detail we can plot the original images on the graph.

image = reshape(X,8,8,400);
plotdigits(image,Y,.1,1);


The result can now be seen more clearly from the following picture:

By examining this plot we can infer the approximate "meaning" of the first two principal components in this data. For instance, the points at the top of the plot tend to be slanted to the right, while the ones at the bottom are slanted to the left. So the second principal component quantifies the amount of slant in the number.

Introduction to Kernel Function

PCA is useful when those data points spread in or close to a plane. This means that PCA is powerful when dealing with linear problems. But when data points spread in a manifold space, PCA is hard to implement. But there is a solution to this problem - we can use a method to change the linear algorithm into a nonlinear one. This is called the "Kernel Trick".

An intuitive example

From the picture, we can see the red circles are in the middle of the blue Xs. However, it is hard to separate those two classes by using any linear function (lines in the two dimensional space). But we can use a Kernal function $\ \phi$ to project the points onto a three dimensional space. Once the blue Xs and red circles have been mapped to a three dimensional space in this way it is easy to separate them using a linear function.

The significance of a Kernel Function, $\displaystyle \phi$, is that we can implicitly change the data into a high dimension. Let's look at how this is possible:

$Z_1= \begin{bmatrix} x_1\\ y_1 \end{bmatrix}\xrightarrow{\phi}$ $\phi(Z_1)= \begin{bmatrix} x_1^2\\ y_1^2\\ \sqrt2x_1y_1 \end{bmatrix}$

$Z_2= \begin{bmatrix} x_2\\ y_2 \end{bmatrix}\xrightarrow{\phi}$ $\phi(Z_2)= \begin{bmatrix} x_2^2\\ y_2^2\\ \sqrt2x_2y_2 \end{bmatrix}$

The inner product of $\displaystyle \phi(Z1)$ and $\displaystyle\phi(Z2)$, which is denoted as $\displaystyle\phi(Z1)^T\phi(Z2)$, is equal to:

$\begin{bmatrix} x_1^2&y_1^2&\sqrt2x_1y_1 \end{bmatrix} \begin{bmatrix} x_2^2\\ y_2^2\\ \sqrt2x_2y_2 \end{bmatrix}=$ $\displaystyle (x_1x_2+y_1y_2)^2=(Z_1^TZ_2)^2=\lt Z_1,Z_2\gt ^2=K(Z_1,Z_2)$.

The most common Kernel functions are as follows:

• Linear: $\displaystyle K_{ij}=\lt X_i,X_j\gt$
• Polynomial:$\displaystyle K_{ij}=(1+\lt X_i,X_j\gt )^p$
• Gaussian:$\displaystyle K_{ij}=e^\frac{-{\left\Vert X_i-X_j\right\|}^2}{2\sigma^2}$
• Exponential:$\displaystyle K_{ij}=e^\frac{-{\left\Vert X_i-X_j\right\|}}{2\sigma^2}$

Note: ${\left\Vert X_i-X_j\right\|}$ denotes the distance between $\displaystyle X_i$ and $\displaystyle X_j$.

Since the kernel trick brings a lower dimensional problem to a higher dimensional space, it is affected by the curse of dimensionality. The curse of dimensionality states that it takes exponentially more points to estimate a solution in higher dimensions. ex) if $\R^1$ needs 10 points, then $\R^2$ needs 100, $\R^3$ needs 1000 etc.

Kernel PCA - November 15, 2011

PCA doesn't work well when the directions of variation in our data are nonlinear. Especially in the case where the dataset has very high dimensions, the dataset lies near or on a nonlinear manifold which prevents PCA from determining principal components correctly. To deal with this problem, we apply kernels to PCA. By transforming the original data with a nonlinear mapping, we can obtain much better principal components.

First we look at the algorithm for PCA and see how we can kernelize PCA:

PCA

Find eigenvectors of $\ XX^T$, call it $\ U$

to map data points to a lower dimensional space:
$\ Y = U^{T}X$
to reconstruct points:
$\ \hat{X} = UY$
to map a new point:
$\ y = U^{T}x$
to reconstruct point:
$\ \hat{x} = Uy$

Dual PCA

Consider the singular value decomposition of n-by-m matrix X:

\begin{align} \left[ U\ \Sigma\ V \right] & = svd(X) \\ X & = U\Sigma{V^T} \end{align}

Where:

• The columns of U are the eigenvectors of $XX^T$ corresponding to the eigenvalues in decreasing order.
• The columns of V are the eigenvectors of $X^T{X}$.
• The diagonal matrix $\Sigma$ contains the square roots of the eigenvalues of $XX^T$ in decreasing order.

Now we want to kernelize this classical version of PCA. We would like to express everything based on V which can be kernelized. The reason why we want to do this is because if n >> m (i.e. the feature space is much larger than the number of sample points) then the original PCA algorithm would be impractical. We want an algorithm that depends less on n. This is called Dual PCA.

U in terms of V:

\begin{align} X &= U\Sigma V^T \\ XV &= U\Sigma V^T V = U\Sigma \\ XV\Sigma^{-1} &= U\Sigma\Sigma^{-1} = U \\ U &= XV\Sigma^{-1} \end{align}

Y in terms of V:

\begin{align} X&=U \Sigma V^T \\ U^TX &= U^TU\Sigma V^T \\ U^TX &= \Sigma V^T \\ Y&=\Sigma V^T \\ \end{align}

Reconstructed points in terms of V:

\begin{align} \hat{X}&=UY \\ &=XV\Sigma^{-1}\Sigma{V^T} \\ &= XVV^T \\ &= X \end{align}

The value of y, a single point in the sample reduced to low-dimensional space, in terms of V:

\begin{align} y &= U^Tx \\ & = (XV\Sigma^{-1})^Tx \\ & = (\Sigma^{-1})^T{V^T}{x^T}x \\ & = \Sigma^{-1}{V^T}{x^T}x \end{align}

A single reconstructed point from the sample in terms of V:

\begin{align} \hat{x} &= Uy \\ & = UU^Tx \\ & =XV\Sigma^{-1}\Sigma^{-1}V^T{x^T}x \\ &= XV\Sigma^{-2}V^T{x^T}x \end{align}

Kernel PCA

The nonlinear mapping $\displaystyle\phi$ allows for very high dimensional spaces and is never calculated explicitly.

$k(\mathbf{x},\mathbf{x}) = \phi^T(\mathbf{x})\phi(\mathbf{x})$

$X_{d\times n}$, $X^TX_{n\times n}$, and $K(X,X)_{n\times n}$ could have many different kernels.

Generally, we want to replace $X^TX$ with a kernel. The idea in Kernel PCA is that instead of finding the eigenvectors of $X^TX$, we can find the eigenvectors of a kernel.

Example.$\displaystyle K(x_1,x_2)=e^\frac{-(X_1-X_2)^2}{\sigma}$

Find the eigenvectors of the matrix $K=\left(\begin{matrix}0&0\dots&\dots\\ \vdots&\ddots\\\vdots\\\end{matrix}\right)$

However, since the $\displaystyle\phi$ is never calculated explicitly, kernel PCA cannot reconstruct points using the equation $\hat{X} = XVV^T$ where $X=\phi$.

In kernel PCA, we replace $k$ for $\phi(X)^T \phi(X)$. This is correct if $\phi(X)$ has a mean of zero.

We need to find a way to centralize $\phi(X)$.

Centralizing Kernel PCA

Recall in regular PCA, our variance-covariance matrix $\Sigma$ was defined as $\Sigma = (X-\mu)(X-\mu)^T$.

With kernel PCA, we will never explicitly define $\phi(x)$ so we need a method to calculate its mean without calculating the actual transformation of $\phi(x)$ itself.

We define:

$\tilde{\phi}(x) = \phi(x) - E_x[\phi(x)]$

And thus, our kernel function becomes:

$\tilde{k}(\mathbf{x},\mathbf{y}) = \tilde{\phi}(\mathbf{x})^T\tilde{\phi}(\mathbf{y})$
$\! = (\phi(x) - E_x[\phi(x)])^T(\phi(y) - E_y[\phi(y)])$
$\! = \phi(x)^T\phi(y) - \phi(y)E_x[\phi(x)^T] - \phi(x)^TE_y[\phi(y)] + E_x[\phi(x)^T]E_y[\phi(y)]$
$\! = k(\mathbf{x},\mathbf{y}) - E_x[k(\mathbf{x},\mathbf{y})] - E_y[k(\mathbf{x},\mathbf{y})] + E_x[E_y[k(\mathbf{x},\mathbf{y})]]$

In practice, we would do the following:

1. Start with our matrix of data: $X_{d\times n}$

2. Choose a kernel function $k$

3. Compute $K$, an nxn matrix: $K=\left(\begin{matrix}k(x_1,x_1)&k(x_1,x_2)\dots&k(x_1,x_n)\\ \vdots&\ddots&\vdots\\k(x_n,x_1)&\dots&k(x_n,x_n)\\\end{matrix}\right)$

4. Find $\tilde{K} = K - \frac{1}{n}\sum_{i=1}^{n}K(:,i) - \frac{1}{n}\sum_{j=1}^{n}K(j,:) + \frac{1}{n^2}\sum_{i=1}^{n}\sum_{j=1}^{n}K(i,j)$

Note: On the right hand side, the second item is the average of the columns of $K$, the third item is the average of the rows of $K$, and the last item is the average of all the entries of $K$. As a result, they are vectors of dimension $1\times n$, $n\times 1$, and $1\times 1$ (a scalar) respectively. To do the actual arithmetic, all matrix dimensions must match. In MATLAB, this is accomplished through the repmat() function.

5. Find the eigenvectors of $\tilde{K}$. Take the first few and combine them in a matrix $V$, and use that as the $V$ in the dual PCA mapping and reconstruction steps shown above.

Multidimensional Scaling (MDS)

Most of the common linear and polynomial kernels do not give very good results in practice. For real data, there is an alternate approach that can yield much better results but, as we will see later, is equivalent to a kernel PCA process.

Multidimensional Scaling (MDS) is an algorithm, much like PCA, that maps the original high dimensional space to a lower dimensional space.

Introduction to MDS

The main purpose of MDS is to try to map high dimensional data to a lower dimensional space while preserving the pairwise distances between points. ie: MDS addresses the problem of constructing a configuration of $\ n$ points in Euclidean space by using information about the distance between the $\ n$ patterns.

It may not be possible to preserve the exact distances between points, so in this case we need to find a representation that is as close as possible.

Definition: A $\ n\times n$ matrix $\ D$ is called the distance matrix if:

$\ D$ is symmetric and its entries $\ d_{ij}$ have the properties:

$\ d_{ii} = 0$ and $\ d_{ij} \gt 0$ $\forall i\neq j$

Given a distance matrix $\ D^{(X)}$, MDS attempts to find $\ n$ data points $\ y_1,...,y_n$ in $\ d$ dimensions, such that if $\ d_{ij}^{(Y)}$ denotes the Euclidean distance between $\ y_i, y_j$ then $\ D^{(Y)}$ is similar to $\ D^{(X)}$.

Metric MDS

One of the possible methods to preserve the pairwise distance is to use Metric MDS, which attempts to minimize: $\min_Y \sum_{i=1}^{n}\sum_{j=1}^{n}(d_{ij}^{(X)} - d_{ij}^{(Y)})^2$

where $d_{ij}^{(X)} = ||x_i-x_j||^2$ and $d_{ij}^{(Y)} = ||y_i-y_j||^2$

Continued with MDS, Isomap and Classification - November 22, 2011

The distance matrix can be converted to a kernel matrix of inner products $\ X^TX$ by $\ X^{T}X=-\frac{1}{2}HD^{(X)}H$,

where $H=I-\frac{1}{n}ee^{T}$ and $\ e$ is a column vector of all 1.
$e=\left[\begin{array}{c} 1\\ \vdots\\ 1 \end{array}\right]_{(n\times1)}$ $ee^{T}=\left[\begin{array}{ccc} 1 & \cdots & 1\\ \vdots & \ddots & \vdots\\ 1 & \cdots & 1 \end{array}\right]_{(n\times n)}$ $H=\left[\begin{array}{ccc} 1 & \cdots & 0\\ \vdots & \ddots & \vdots\\ 0 & \cdots & 1 \end{array}\right]-\frac{1}{n}\left[\begin{array}{ccc} 1 & \cdots & 1\\ \vdots & \ddots & \vdots\\ 1 & \cdots & 1 \end{array}\right]$

Now $\min_Y \sum_{i=1}^{n}\sum_{j=1}^{n}(d_{ij}^{(X)} - d_{ij}^{(Y)})^2$ can be reduced to $\min_Y \sum_{i=1}^{n}\sum_{j=1}^{n}(x_{i}^{T}x_{j}-y_{i}^{T}y_{j})^{2}$ $\Leftrightarrow$$\min_Y \ Tr(X^{T}X-Y^{T}Y)^{2}$.

So far we have rewritten the objective function, but we must remember that our definition of $\ D$ introduces constraints on its components. We must ensure that these constraints are respected in the way we have expressed the minimization problem with traces. Luckily, we can capture these constraints with a single requirement thanks to the following theorem:

Theorem: Let $\ D$ be a distance matrix and $\ K=X^{T}X=-\frac{1}{2}HD^{(X)}H$. Then $\ D$ is Euclidean if and only if $\ K$ is a positive semi-definite matrix.

Therefore, to complete the rewriting of our original minimization problem from norms to traces, it suffices to impose that $\ K$ be $\ p.s.d.$, as this guarantees that $\ D$ is a distance matrix and the components of $\ X^{T}X$ and $\ Y^{T}Y$ satisfy the original constraints.

Proceeding with singular value decomposition, $\ X^{T}X$ and $\ Y^{T}Y$ can be decomposed as:

$\ X^{T}X=V\Lambda V^{T}$
$Y^{T}Y=Q\hat{\Lambda} Q^{T}$

Since $\ Y^{T}Y$ is $\ p.s.d.$ , $\hat{\Lambda}$ has no negative value and therefore: $Y=\hat{\Lambda}^{\frac{1}{2}}Q^{T}$.

The above definitions help rewrite the cost function as:

$\min_{Q,\hat{\Lambda}}Tr(V\Lambda V^{T}-Q\hat{\Lambda}Q^{T})^{2}$

Multiply $\ V^{T}$ on the left and $\ V$ on the right, we will get

$\min_{Q,\hat{\Lambda}}Tr(\Lambda-V^{T}Q\hat{\Lambda} Q^{T}V)^{2}$

Then let $\ G=V^{T}Q$

We can rewrite the target function as:

$\min_{Q,\hat{\Lambda}}Tr(\Lambda-G\hat{\Lambda}G^{T})^{2}$ $=$$\min_{G,\hat{\Lambda}}Tr(\Lambda^{2}+G{\Lambda}G^{T}G\hat{\Lambda}G^{T}-2\Lambda G\hat{\Lambda}G)$

For a fixed $\hat{\Lambda}$ we can minimize for G. The result is that $\ G=I$. Then we can simplify the target function:

$\min_{\hat{\Lambda}}Tr(\Lambda^{2}+\hat{\Lambda}^{2}-2\Lambda\hat{\Lambda})$ $=$ $\min_{\hat{\Lambda}}Tr(\Lambda-\hat{\Lambda})^{2}$

Obviously, $\hat{\Lambda}=\Lambda$.

If $\ X$ is $d\times n$ matrix, $\ (d\lt n)$, the rank of $\ X$ is no greater than $\ d$. Then the rank of $X^{T}X_{(n\times n)}$ is no greater than $\ d$ since $\ rank(X^{T}X) = rank(X)$. The dimension of $\ Y$ is smaller than $\ X$, so therefore the rank of $\ Y^{T}Y$ is smaller than $\ d$.

Since we want to do dimensionality reduction and make $\ \Lambda$ and $\hat{\Lambda}$ as similar as possible, we can let $\hat{\Lambda}$ be the top $\ d$ diagonal elements of $\ \Lambda$.

We also have $\ G = V^{T}Q$, obviously $\ Q=V$ since $\ G = I$.

The solution is:

$Y=\Lambda^{\frac{1}{2}}V^{T}$

where $\ V$ is the eigenvector of $\ X^{T}X$ corresponding to the top $\ d$ eigenvalues, and $\Lambda$ is the top $\ d$ eigenvalues of $\ X^{T}X$.

Compare this with dual PCA.

In dual PCA, the result is

$\ Y=\Sigma V^{T}$.

Clearly, the result of dual PCA is the same with MDS. Actually, one property of PCA is to preserve the pairwise distances between data points in both high dimensional and low dimensional space.

Now as an appendix, we provide a short proof to the first equation $\ X^{T}X=-\frac{1}{2}HD^{(X)}H$.
Let $\ S$ be a vector where $S_{i}=X_i^TX_i,D=Se^T+eS^T-2X^TX$.
First notice $\ Xe=0$ as data are centered.
$HD^{(X)}H=(I-\frac{1}{n}ee^{T})(Se^T+eS^T-2X^TX)(I-\frac{1}{n}ee^{T})$
$=-2X^TX+\frac{1}{n}ee^{T}2X^TX+\frac{1}{n}2X^TXee^{T}-\frac{1}{n^2}ee^{T}2X^TXee^{T}$
$\ =-2X^TX$,
as all other terms contains $\ Xe$ or $\ e^TX^T$.
Hence, $\ X^{T}X=-\frac{1}{2}HD^{(X)}H$.

Isomap (As per Handout - Section 1.6)

The Isomap algorithm is a nonlinear generalization of classical MDS with the main idea being that MDS is perfomed on the geodesic space of the non-linear data manifold as opposed to being performed on the input space. Isomap applies geodesic distances in the distance matrix for MDS rather than the straight line distances between the 2 points in order to find the low-dimensional mapping that preserves the pairwise distances. This geodesic distance is approximated by building a k-neighbourhood graph of all the points on the manifold and finding the shortest path to a given point. This gives a much better distance measurement in a manifold such as the 'Swiss Roll' manifold than using Euclidean distances.

Note: Geodesic distance is the shortest path along the curved surface of the manifold measured as if the surface was flat

Similarly to LLE, the Isomap algorithm proceeds in three steps:

1. Finding the neighbours of each data points in high-dimensional data space

2. Compute the geodesic pairwise distance between all points

3. Integrate the data via MDS in order to preserve the distances

Process:
1. Identify the k nearest neighbours or choose points from a fixed radius

2. Neighbour relations are represented by a graph G conneted with edges of weights $\ d_{X}(i,j)$

3. The geodesic distances, $\ d_{M}(i,j)$ between all pair points on the manifold, M, are then estimated

Note: Isomap approximates $\ d_{M}(i,j)$ as the shortest path distance $\ d_{G}(i,j)$ in Graph G. The k in the algorithm must be chosen carefully, too small a k and the graph will not be connected, but to large a k, and the algorithm will be closer to the euclidean distance instead.

4. Isomap applies classical MDS to $\ D^{(G)}$ to generate an embedding of the data in d-dimensional Euclidean space Y

Classification

Classification is a technique in pattern recognition. It can yield very complex decision boundaries as they are very suitable for ordered data, categorical data or a mixture of the two types. A decision or classification represents a multi-stage decision process where a binary decision is made at each stage.

E.g: A hand-written object can be scanned and recognized by the classification technique. The model realized the class it belongs to and pairs it with the corresponding object in its library.

Mathematically, each object have a set of features $\ X$ and a corresponding label $\ Y$ which is the class it belongs to:

$\ \{(x_1,y_1),(x_2,y_2),...,(x_n,y_n)\}$

Since the training set is labelled with the correct answers, classification is called a "supervised learning" method.

In contrast, the Clustering technique is used to explore a data set whereby the main objective is to separate the sample into groups or to provide an understanding about the underlying structure or nature of the data. Clustering is an "unsupervised classification" method, because we do not know the groups that are in the data or any group characteristics of any unit observation. There are no labels to classify the data points, all we have is the feature set $\ X$:

$\ \{(x_1),(x_2),...,(x_n)\}$

Classification - November 24, 2011

Classification

Classification is predicting a discrete random variable Y (the label) from another random variable X. It is analogous to regression, but the difference between them is that regression uses continuous values, while classification uses discrete values (labels).

Consider iid data $\displaystyle (X_1,Y_1),(X_2,Y_2),...,(X_n,Y_n)$ where
$X_i = (X_{i1},...,X_{id}) \in \mathbb{R}^{d}$, representing an object,
$\ Y_i \in Y$ is the label of the i-th object, and $\ Y$ is some finite set.

We wish to determine a function $\ h : \mathbb{R}^{d} \rightarrow Y$ that can predict the value of $\ Y_i$ given the value $\ X_i$. When we observe a new $\displaystyle X$, predict $\displaystyle Y$ to be $\displaystyle h(X)$.(We use h(x)) for the following discussion).

The difference between classification and clustering is that clustering do not have $\ Y_i$ and it only puts $\ X_i$ into different classes.

Examples

Object: An image of a pepper'
The features are defined to be colour, length, diameter, and weight.

200px
Features(X): (Red,6,2,3.5)
Labels(Y): Red Pepper

The objective of classification is to use the classification function for unseen data - data for which we do not know the classification. Given the features, we want to classify the object, or in other words find the label. This is the typical classification problem.

Real Life Examples: Face Recognition - Pictures are just a points in high dimensional space; we can represent them in vectors. Sound waves can also be classified in a similar manner using Fourier Transforms. Classification is also used to find drugs which cure disease. Molecules are classified as either a good fit into the cavity of a protein, or not.

In machine learning, classification is also known as supervised learning. Often we separate the data set into two parts. One is called a training set and the other, a testing set. We use the training set to establish a classifier rule and use the testing set to test its effectiveness.

Error Rate

Definition: The true error rate of a classifier h is $L(h) = P(h(X) \neq Y)$ and the empirical error rate or training error rate is:
$\hat{L}_h = \frac{1}{n} \sum_{i=1}^{n}I(h(X_i)\neq Y_i)$

Where $\!I$ is the indicator function. That is, $\!I( ) = 1$ if the statement inside the bracket is true, and $\!I( ) = 0$ otherwise.

The empirical error rate is the the proportion of points that have not been classified correctly. It can be shown that this estimation always underestimates the true error rate, although we do not cover it in this course. A way to get a better estimate of the error is to construct the classifier with half of the given data and use the other half to calculate the error rate.

Bayesians vs Frequentists

Bayesians view probability as the measure of confidence that a person holds in a proposition given certain information. They state a "prior probability" exist, which represents the possibility of a event's occurrence given no information. As we add new information regarding this event, our belief adjusts according to the new information, which gives us a posterior probability. Frequentists interpret probability as a "propensity" of some event.

Bayes Classifier

Consider the special case where $y \in {0,1}$, then

$r(x) = P(Y=1|X=x) = \frac{P(X=x|Y=1)P(Y=1)}{P(X=x)} = \frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)}$

Definition: The Bayes classification rule h* is:
$h^*(x)=\left\{\begin{matrix}1 & \mathrm{if}\ \hat{r}(x) \gt 0.5 \\ 0 & otherwise\end{matrix}\right.$

The set $\displaystyle D(h) = \{x: P(Y=1|X=x)=P(Y=0|X=x)\}$ is called the decision boundary.

$h^*(x)=\left\{\begin{matrix}1 & \mathrm{if}\ P(Y=1|X=x) \gt P(Y=0|X=x) \\ 0 & otherwise\end{matrix}\right.$

Theorem: The Bayes rule is optimal, that is, if $\ h$ is any other classification rule, then $L(h^*) \leq L(h)$

So if Bayes rule is optimal, why do we need any other method? Well, we don't always know the distributions of the data. The Bayes rule depends on unknown quantities, so we need to use the data to find some approximation to the Bayes rule.

Three Main Approaches

1. Empirical Risk Minimization: Choose a set of classifiers H and find a $h^* \in H$ that minimizes the expected value of some loss function $\ L(h)$. The distribution of (X,Y) is not known, therefore $\ E[L(h)]$ must be estimated using empirical data. For more conceptual knowledge on how this approach works, please refer to http://en.wikipedia.org/wiki/Empirical_risk_minimization

2. Regression: Find an estimate $\hat{r}(x)$ of the function $\displaystyle r$ and define:
$h^*(x)=\left\{\begin{matrix}1 & \mathrm{if}\ \hat{r}(x) \gt 0.5 \\ 0 & otherwise\end{matrix}\right.$

3. Density Estimation: Estimate $\displaystyle P(X=x|Y=0)$ from the $\displaystyle X_i's$ for which $\displaystyle Y_i=0$, and estimate $\displaystyle P(X=x|Y=1)$ from the $\displaystyle X_i's$ for which $\displaystyle Y_i=1$ and let:
$\displaystyle P(Y=1)=\frac{1}{n} \sum_{i=1}^{n}Y_i$
Define
$\hat{r}(x)=\hat{P}(Y=1|X=x)$
and define
$h(x)=\left\{\begin{matrix}1 & \mathrm{if}\ \hat{r}(x) \gt 0.5 \\ 0 & otherwise\end{matrix}\right.$
So we have
$P(Y=1|X=x) = \frac{P(X=x|Y=1)P(Y=1)}{P(X)}$
where
$\displaystyle P(X) = P(X=x|Y=1)P(Y=1) + P(X=x|Y=0)P(Y=0)$

Multi-class Classification

We want to generalize to the case that Y takes on more than two values
Theorem: Suppose that $Y \in y = \{1,...,k\}$, the optimal rule is
$\displaystyle h^*(x) = {\operatorname{arg\,max}}_k\{P(Y=k|X=x)\}$
where
$P(Y=k|X=x)=\frac{f_k(x)\pi_k}{\sum_{k}f_k(x)\pi_k}$
with
$\displaystyle f_k(x) = P(X=x|Y=k)$
$\displaystyle \pi_k = P(Y=k)$

A direct interpretation of this method called "naive Bayes classifier", is that the classifier chooses an y value which maximizes the probability of getting the current x value. In other words, the value y here can be viewed as a parameter as in maximum likelihood estimation, where the optimal value is $\displaystyle y^* = {\operatorname{arg\,max}}\{f(y|x)\}$.

LDA, QDA and FDA - November 29, 2011

Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA)

The simplest approach to classification is to use the third approach listed above and assume a parametric model for densities.

Applying Bayes rule:
$P(Y=k|X=x) = \frac{P(X=x|Y=k)P(Y=k)}{P(X=x)} = \frac{P(X=x|Y=k)P(Y=k)}{\sum_{i}P(X=x|Y=i)P(Y=i)}$

We notice that the denominator value is always the same independent of k; thus will be cancelled out. Hence we will only need to evaluate the numerator value.

Define class conditional distribution, prior distribution and posterior distribution as follows:
$\ f_k(x)=P(X=x|Y=k),{\mathbf\pi_k}=P(Y=k), P(Y=k|X=x)$

LDA (Linear Discriminant Analysis)
LDA is a classifier which makes the 2 assumptions:

• The class conditional distribution is Gaussian
$\ f_k(x)= \frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}} \exp\left(-\frac{1}{2}({\mathbf x}-{\mathbf\mu_k})^T{\mathbf\Sigma_k}^{-1}({\mathbf x}-{\mathbf\mu_k}) \right),x\in \mathbb{R}^d$

where $|\mathbf\Sigma_k|$ is the determinant of $\mathbf\Sigma_k$

• The two classes share the same covariance matrix
${\mathbf\Sigma}_0={\mathbf\Sigma}_1={\mathbf\Sigma}$

Thus the class conditional distribution for $k$ in LDA is as follows:
$\ f_k(x)= \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}({\mathbf x}-{\mathbf\mu_k})^T{\mathbf\Sigma}^{-1}({\mathbf x}-{\mathbf\mu_k}) \right),x\in \mathbb{R}^d$

To demonstrate, suppose we have a two class scenario, i.e $y \in \{0,1 \}$, and we want to compute the decision boundary for this LDA. Recall that the decision boundary is a set of points such that:
$\ P(Y=0|X=x)=P(Y=1|X=x)$

where the probability is given by:
$P(Y=0|X=x) = \frac{f_0(x){\mathbf\pi_0}}{\sum_{k=0}^{1}f_k(x){\mathbf\pi_k}},P(Y=1|X=x) = \frac{f_1(x){\mathbf\pi_1}}{\sum_{k=0}^{1}f_k(x){\mathbf\pi_k}}$

Hence we want to find ${\mathbf x}$ such that
$\ P(Y=0|X=x)=P(Y=1|X=x)$

$\frac{f_0(x){\mathbf\pi_0}}{\sum_{k=0}^{1}f_k(x){\mathbf\pi_k}} = \frac{f_1(x){\mathbf\pi_1}}{\sum_{k=0}^{1}f_k(x){\mathbf\pi_k}}$

$f_0(x){\mathbf\pi_0}=f_1(x){\mathbf\pi_1}$

Upon expanding $\ f_k(x)$, we get
$\frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}({\mathbf x}-{\mathbf\mu_0})^T{\mathbf\Sigma}^{-1}({\mathbf x}-{\mathbf\mu_0}) \right) * \pi_0$ $=\frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\left(-\frac{1}{2}({\mathbf x}-{\mathbf\mu_1})^T{\mathbf\Sigma}^{-1}({\mathbf x}-{\mathbf\mu_1}) \right) * \pi_1$

Rearranging the equation and taking logarithm on both sides we get:
$-\frac{1}{2}({\mathbf x}-{\mathbf\mu_0})^T{\mathbf\Sigma}^{-1}({\mathbf x}-{\mathbf\mu_0})+\log({\mathbf\pi_0}) =-\frac{1}{2}({\mathbf x}-{\mathbf\mu_1})^T{\mathbf\Sigma}^{-1}({\mathbf x}-{\mathbf\mu_1})+\log({\mathbf\pi_1})$

That is, after expanding the quadratic form of our vector values,
$\frac{1}{2} \left [-{\mathbf x}^T{\mathbf\Sigma}^{-1}{\mathbf x} -{\mathbf\mu_1}^T{\mathbf\Sigma}^{-1}{\mathbf\mu_1} +2{\mathbf x}^T{\mathbf\Sigma}^{-1}{\mathbf\mu_1} +{\mathbf x}^T{\mathbf\Sigma}^{-1}{\mathbf x} +{\mathbf\mu_0}^T{\mathbf\Sigma}^{-1}{\mathbf\mu_0} -2{\mathbf x}^T{\mathbf\Sigma}^{-1}{\mathbf\mu_0} \right] +\log(\frac{\mathbf\pi_1}{\mathbf\pi_0}) =0$

Cancelling out the terms and notice that ${\mathbf\mu_k}^T{\mathbf\Sigma}^{-1}{\mathbf\mu_k}$ is constant and ${\mathbf x}^T{\mathbf\Sigma}^{-1}{\mathbf\mu_k}$ a linear expression of $\ x$, the final equation is simply linear system of equations:
${\mathbf x}^T{\mathbf\Sigma}^{-1}({\mathbf\mu_1}-{\mathbf\mu_0}) +\frac{1}{2}\left(-{\mathbf\mu_1}^T{\mathbf\Sigma}^{-1}{\mathbf\mu_1} +{\mathbf\mu_0}^T{\mathbf\Sigma}^{-1}{\mathbf\mu_0}\right) +\log(\frac{\mathbf\pi_1}{\mathbf\pi_0}) =0$

If we relax the assumption in LDA that the covariances are identical between classes, the decision boundary becomes quadratic and we get QDA. Mathematically we mean that ${\mathbf\Sigma}_0\ne{\mathbf\Sigma}_1$. In fact, we may also drop the assumption of a bi-class problem. QDA is therefore a more generalized version of LDA.

The general form of the boundary for a K-class problem is

$\delta_k (x) = -\frac{1}{2}ln|{\mathbf\Sigma}_k| -\frac{1}{2}(x-{\mathbf\mu_k})^T{\mathbf\Sigma_k}^{-1} (x-{\mathbf\mu_k}) + ln({\mathbf\pi_k})$

and the classifier will be
$\ h(x) = \underset{k}{\operatorname{arg\,max}} \, (\delta_k)$

In other words we would assign each $\ x$ to the label $\ k$ that produces the largest $\ \delta_k (x)$.

The reason for this becomes more obvious if we consider the components of $\ \delta_k (x)$.

Notice that the middle component (without the coefficient), $\ (x-{\mathbf\mu_k})^T{\mathbf\Sigma_k}^{-1} (x-{\mathbf\mu_k})$, is expressing the squared Mahalanobis distance between $\ x$ and $\ {\mathbf\mu_k}$, the mean of class $\ k$. Therefore, by requiring that $\ \delta_k (x)$ is maximized, we are requiring this distance to be minimized. This makes sense from a classification perspective since it is intuitive to label $\ x$ with the class that it is closest to (on an average basis).

Also, to illustrate the role that $\ ln({\mathbf\pi_k})$ plays, consider the case where each class is comprised of the same number of points so that $\ ln({\mathbf\pi_i})\ =\ ln({\mathbf\pi_j})\ \forall i,j \in y={1,...,K}$. Then this term makes no difference in determining the appropriate label. But if a particular class $\ k$ is comprised of many more points than others, then $\ ln({\mathbf\pi_k})$ is larger and increases the probability for this class (label) being assigned. This is intuitive as $\ \pi_k$ represents the probability of a class being $\ k$. Therefore it must contribute to the probability that a point is in the class.

In practice when we estimate the parameters,
$\hat{\mathbf\pi_k}=\frac{n_k}{n},\hat{\mathbf\mu_k}=\frac{1}{n_k}\sum_{i:y_i=k}{x_i}$

$\hat{\mathbf\Sigma}_k=\frac{1}{n_k}\sum_{i:y_i=k}({\mathbf x}_i-{\mathbf\mu_k})({\mathbf x}_i-{\mathbf\mu_k})^T$

If we assume ${\mathbf\Sigma}_k={\mathbf\Sigma}$ for all $\ k$, then
$\hat{\mathbf\Sigma}=\frac{\sum_{r=1}^{k}n_r{\mathbf\Sigma_r}}{n}$
which is the weighted average of all ${\mathbf\Sigma_k}$.

Fischer Discriminant Analysis (optional)

Recall that PCA finds the direction of maximum variance. Imagine PCA, MDS or Isomap as a pre-step to classification as they all reduce the dimension of the data. But if the classification is not based on the maximum variance, PCA will not work well in this case. Suppose we want to reduce a set of two-dimensional and two-class data to one dimension in order to classify them. One way to achieve this is to draw the data set in each class as close as possible (almost collapse into one point) and set these classes far apart to distinguish them. This is the intuitive interpretation of FDA. FDA does not carry the assumption of QDA and LDA, where the conditional probability distributions are a normally distributed, this allows a more flexible approach to classifying a data set.

We want to find vector $\!\omega$ to project every point $\!x$ to $\!{{\omega }^{T}}x$ so that maximum variation occurs between classes. Say there are two classes 0 and 1. The points that label 0 has a mean of $\!{{\mu }_{0}}$ and the points that label 1 has a mean of $\!{{\mu }_{1}}$. Then ${{\mu }_{0}}\to {{\omega }^{T}}{{\mu }_{0}}$, and ${{\mu }_{1}}\to {{\omega }^{T}}{{\mu }_{1}}$

To make the two classes as far away as possible, we want to $\underset{\omega }{\mathop{\max }}\,||{{\omega }^{T}}{{\mu }_{0}}-{{\omega }^{T}}{{\mu }_{1}}||^2$

In other words, $\underset{\omega }{\mathop{\max }}\,{{\omega }^{T}}({{\mu }_{0}}-{{\mu }_{1}}){{({{\mu }_{0}}-{{\mu }_{1}})}^{T}}\omega$

We define ${{S}_{B}}:=({{\mu }_{0}}-{{\mu }_{1}}){{({{\mu }_{0}}-{{\mu }_{1}})}^{T}}$, which is the between-class covariance.

On the other hand, we want the variance within the same class as small as possible, we want to $\underset{w}{\mathop{\min }}\,{{\omega }^{T}}{{\sum }_{0}}\omega +{{\omega }^{T}}{{\sum }_{1}}\omega$

In other words, $\underset{w}{\mathop{\min }}\,{{\omega }^{T}}({{\sum }_{0}}+{{\sum }_{1}})\omega$

We define ${{S}_{W}}={{\sum }_{0}}+{{\sum }_{1}}$, the within-class covariance.

Our question becomes, $\underset{w}{\mathop{\max }}\,=\frac{{{\omega }^{T}}{{S}_{B}}\omega }{{{\omega }^{T}}{{S}_{W}}\omega }$

Or equivalently, $\underset{{}}{\mathop{\max }}\,{{\omega }^{T}}{{S}_{B}}\omega$ such that $\!{{\omega }^{T}}{{S}_{W}}\omega =1$

Using the langrage multiplier, $\!L(\omega ,\lambda )={{\omega }^{T}}{{S}_{B}}\omega -\lambda ({{\omega }^{T}}{{S}_{W}}\omega -1)$, we have

$\frac{\partial L}{\partial \omega }=0$

$\Rightarrow {{S}_{B}}\omega =\lambda {{S}_{W}}\omega$

$\Rightarrow S_{W}^{-1}{{S}_{B}}\omega =\lambda \omega$

Therefore, $\!\omega$ is the eigenvector of $S_{W}^{-1}{{S}_{B}}$ (the eigenvector of the largest $\lambda$)

Generally, FDA is a better choice than PCA in classification.

Solutions to Assignment 4 - December 1, 2011

Problem 4

Show that the Gibbs sampling algorithm satisfies detailed balance. Note: You can show it for the simple case that we sample from a bivariate distribution.
Solution
Suppose that we sample from a bivariate distribution using the Gibbs sampling method.
Let $\ (X_1, X_2)$ represent the original state from last time series.
Let $\ (X^'_1, X^'_2)$ represent the new state to be sampled.
Using the Gibbs sampling, $\ (X^'_1, X^'_2)$ is obtained in two steps:
1. $\ (X_1, X_2) \to (X^'_1, X_2)$
2. $\ (X^'_1, X_2) \to (X^'_1, X^'_2)$
Note: Gibbs sampling only changes one element at a time.
To prove that the Gibbs sampling satisfies detailed balance, we need to show that $\ P(X_1, X_2)Q((X_1, X_2) \to (X^'_1, X_2)) = Q((X^'_1, X_2) \to (X_1, X_2))P(X_1', X_2)$
and
$\ P(X^'_1, X_2)Q((X^'_1, X_2) \to (X^'_1, X^'_2)) = Q((X^'_1, X^'_2) \to (X^'_1, X_2))P(X^'_1, X^'_2)$

1. $\ P(X_1, X_2)Q((X_1, X_2) \to (X^'_1, X_2))$

$\ = P(X_1, X_2)P(X^'_1, X_2|X_1, X_2)$ (Gibbs)

$\ = P(X_1, X_2)\frac{P(X_1, X_2|X^'_1, X_2)P(X^'_1, X_2)}{P(X_1, X_2)}$ (Bayes)

$\ = P(X_1, X_2|X^'_1, X_2)P(X^'_1, X_2)$

$\ = Q((X^'_1, X_2) \to (X_1, X_2))P(X^'_1, X_2)$ (Gibbs)

2. $\ P(X^'_1, X_2)Q((X^'_1, X_2) \to (X^'_1, X^'_2))$

$\ = P(X^'_1, X_2)P(X^'_1, X^'_2|X^'_1, X_2)$ (Gibbs)

$\ = P(X^'_1, X_2)\frac{P(X^'_1, X_2|X^'_1, X^'_2)P(X^'_1, X^'_2)}{P(X^'_1, X_2)}$ (Bayes)

$\ = P(X^'_1, X_2|X^'_1, X^'_2)P(X^'_1, X^'_2)$

$\ = Q((X^'_1, X^'_2) \to (X^'_1, X_2))P(X^'_1, X^'_2)$ (Gibbs) as required.

Problem 5

Consider the problem of

$\ min V(x) = sin(3x)exp(-x^2/10)$

a) Plot the objective function $\ V(x)$ on $\ x \in [-10, 10]$ and the transformed

version $\ exp(-V(x) = T )$ for different values of $\ T$, showing how the global

minimum will dominate as $\ T$ get smaller.

Solution

The MATLAB code and the graph of the objective function:

ezplot('sin(3*x)*exp(-(x.^2)/10)',[-10,10])


The MABLAB code and the graphs of the transformed version of the function for different values of T:

ezplot('exp(-sin(3*x)*exp(-(x.^2)/10)/50)',[-10,10])
ezplot('exp(-sin(3*x)*exp(-(x.^2)/10)/1)',[-10,10])
ezplot('exp(-sin(3*x)*exp(-(x.^2)/10)/0.01)',[-10,10])


$\ with \ \ T = 50$
$\ with \ \ T = 1$
$\ with \ \ T = 0.01$

b) Implement and use simulated annealing to find the minimum of $\ V(x)$.

Solution

We use a geometrically decreasing temperature and a Gaussian jumping distribution.

Procedure

1. Set $\! x_0 \sim N(0,1), T = T_0, i = 1$

2. Draw $\! y \sim N(x_{i-1},0.5^2)$

3. Draw $\! u \sim Unif[0,1]$

4. Set $\! x_i = \begin{cases} y & \mbox{if } u \lt exp(\frac{V(x_{i-1})-V(y)}{T}) \\ x_{i-1} & \mbox{otherwise} \end{cases}$

5. Set $\! T = rate*T$

6. Set $\! i = i+1$, return to 2)

Matlab code for the above algorithm:

    X(1) = randn;
T = T0;         % Set initial temp~erature
for j = 2:n
T = T*R;    % Decrease temperature by a constant rate R
y = 0.5*randn+X(j-1);   % Used Gaussian(X(j-1),0.5) as distribution for y (note: it is symmetric)
if rand < exp((calcV(X(j-1))-calcV(y))/T)    % Where calcV() is the function V applied to ()
X(j) = y;           % Accept y
else
X(j) = X(j-1);      % Reject y
end
end


Using this algorithm, the global minimum is found to be approximately -0.51

c) Run your function again for different constant values of $\ T$. Show that by choosing too small a value of $\ T$ you can get stuck' in a local minimum.

Solution

Matlab code for plotting time series of X:

    f = figure;
set(f,'Position',[100,100,500,300])
p = plot(1:n,X(:));
set(p,'Color','b')
title(['a4q5: Time series of x for n=',int2str(n),', T_0=', ...
num2str(T0), ', rate = ' num2str(R), '. Result: ',num2str(X(n))])