# Difference between revisions of "bayesian and Frequentist Schools of Thought"

### Bayesian and Frequentist Schools of Thought - May 21, 2009

In this lecture we will continue to discuss sampling from specific distributions , introduce Monte Carlo Integration, and also talk about the differences between the Bayesian and Frequentist views on probability, along with references to Bayesian Inference.

#### Binomial Distribution

A Binomial distribution $X \sim~ Bin(n,p)$ is the sum of $n$ independent Bernoulli trials, each with probability of success $p$ $(0 \leq p \leq 1)$. For each trial we generate an independent uniform random variable: $U_1, \ldots, U_n \sim~ Unif(0,1)$. Then X is the number of times that $U_i \leq p$. In this case if n is large enough, by the central limit theorem, the Normal distribution can be used to approximate a Binomial distribution.

Sampling from Binomial distribution in Matlab is done using the following code:

n=3;
p=0.5;
trials=1000;
X=sum((rand(trials,n))'<=p);
hist(X)


Where the histogram is a Binomial distribution, and for higher $n$, it would resemble a Normal distribution.

#### Monte Carlo Integration

Monte Carlo Integration is a numerical method of approximating the evaluation of integrals using random numbers generated from simulations. In this course we will mainly look at three methods for approximating integrals:

1. Basic Monte Carlo Integration
2. Importance Sampling
3. Markov Chain Monte Carlo (MCMC)

#### Bayesian VS Frequentists

During the history of statistics, two major schools of thought emerged along the way and have been locked in an on-going struggle in trying to determine which one has the correct view on probability. These two schools are known as the Bayesian and Frequentist schools of thought. Both the Bayesians and the Frequentists holds a different philosophical view on what defines probability. Below are some fundamental differences between the Bayesian and Frequentist schools of thought:

Frequentist

• Probability is objective and refers to the limit of an event's relative frequency in a large number of trials. For example, a coin with a 50% probability of heads will turn up heads 50% of the time.
• Parameters are all fixed and unknown constants.
• Any statistical process only has interpretations based on limited frequencies. For example, a 95% C.I. of a given parameter will contain the true value of the parameter 95% of the time.

Bayesian

• Probability is subjective and can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can refer to tomorrow's weather as having 50% of rain, whereas this would not make sense to a Frequentist because tomorrow is just one unique event, and cannot be referred to as a relative frequency in a large number of trials.
• Parameters are random variables that has a given distribution, and other probability statements can be made about them.
• Probability has a distribution over the parameters, and point estimates are usually done by either taking the mode or the mean of the distribution.

#### Bayesian Inference

Example: If we have a screen that only displays single digits from 0 to 9, and this screen is split into a 4x5 matrix of pixels, then all together the 20 pixels that make up the screen can be referred to as $\vec{X}$, which is our data, and the parameter of the data for this case, which we will refer to as $\theta$, would be a discrete random variable that can take on the values of 0 to 9. In this example, a Bayesian would be interested in finding $Pr(\theta=a|\vec{X}=\vec{x})$, whereas a Frequentist would be more interested in finding $Pr(\vec{X}=\vec{x}|\theta=a)$

##### Bayes' Rule
$f(\theta|X) = \frac{f(X | \theta)\, f(\theta)}{f(X)}.$

Note: In this case $f (\theta|X)$ is referred to as posterior, $f (X | \theta)$ as likelihood, $f (\theta)$ as prior, and $f (X)$ as the marginal, where $\theta$ is the parameter and $X$ is the observed variable.

Procedure in Bayesian Inference

• First choose a probability distribution as the prior, which represents our beliefs about the parameters.
• Then choose a probability distribution for the likelihood, which represents our beliefs about the data.
• Lastly compute the posterior, which represents an update of our beliefs about the parameters after having observed the data.

As mentioned before, for a Bayesian, finding point estimates usually involves finding the mode or the mean of the parameter's distribution.

Methods

• Mode: $\theta = \arg\max_{\theta} f(\theta|X) \gets$ value of $\theta$ that maximizes $f(\theta|X)$
• Mean: $\bar\theta = \int^{}_\theta \theta \cdot f(\theta|X)d\theta$

If it is the case that $\theta$ is high-dimensional, and we are only interested in one of the components of $\theta$, for example, we want $\theta_1$ from $\vec{\theta}=(\theta_1,\dots,\theta_n)$, then we would have to calculate the integral: $\int^{} \int^{} \dots \int^{}f(\theta|X)d\theta_2d\theta_3 \dots d\theta_n$

This sort of calculation is usually very difficult or not feasible to compute, and thus we would need to do it by simulation.

Note:

1. $f(x)=\int^{}_\theta f(X | \theta)f(\theta) d\theta$ is not a function of $\theta$, and is called the Normalization Factor
2. Therefore, since f(x) is like a constant, the posterior is proportional to the likelihood times the prior: $f(\theta|X)\propto f(X | \theta)f(\theta)$