# STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20

## Wiki Course Notes

Students will need to contribute to the wiki for 20% of their grade. Access via wikicoursenote.com Go to editor sign-up, and use your UW userid for your account name, and use your UW email.

primary (10%) Post a draft of lecture notes within 48 hours. You will need to do this 1 or 2 times, depending on class size.

secondary (10%) Make improvements to the notes for at least 60% of the lectures. More than half of your contributions should be technical rather than editorial. There will be a spreadsheet where students can indicate what they've done and when. The instructor will conduct random spot checks to ensure that students have contributed what they claim.

## Classification

### definitions

classification: Predict a discrete random variable $Y$ by using another random variable $X$ iid

$X_i = (X_{i1}, X_{i2}, ... X_{id}) \in \mathcal{X} \subset \mathbb{R}^d$ ($d$-dimensional vector) $Y_i$ in some finite set $\mathcal{Y}$

classification rule: $h : \mathcal{X} \rightarrow \mathcal{Y}$ Take new observation $X$ and use function $h(x)$ to generate a label $Y$. In other words, If we fit the function $h(x)$ with a random varaible $X$, it generates $Y$ as the class to which $X$ belongs.

Example: Let $\mathcal{X}$ be a set of 2D images and $\mathcal{Y}$ be a finite set of people. We want to learn a classification rule $h:\mathcal{X}\rightarrow\mathcal{Y}$ that with small error predicts the person who appears in the image.

true error rate for classifier $h$

$L(h) = (P(h(X) \neq Y )$

empirical error rate (or training error rate)

$\hat{L}_n(h) = (1/n) \sum_{i=1}^{n} \mathbf{I}(h(X_i) \neq Y_i)$

where $\mathbf{I}()$ is an indicator function. Indicator function is defined by

$\mathbf{I}(x) = \begin{cases} 1 & \text{if } x \text{ is true} \\ 0 & \text{if } x \text{ is false} \end{cases}$

So in this cases $\mathbf{I}(h(X_i)\neq Y_i) = 1$ when $h(X_i)\neq Y_i$ i.e. when misclassifications happens.

e.g., 100 new data points with known (true) labels

$y_1 = h(x_1)$

...

$y_{100} = h(x_{100})$

compare the known-true against the classified-true, count, divide by n=100

### Bayes Classifier

First recall Bayes' Rule, in the format $P(Y|X) = \frac{P(X|Y) P(Y)} {P(X)}$

P(Y|X)  : posterior , probability of $Y$ given $X$

P(X|Y)  : likelihood, probability of $X$ generating $Y$ or probability of $Y$ being generated by $X$

P(Y)  : prior, probability of $Y$ being selected

P(X)  : marginal, probability of obtaining $X$

We will start with the simplest case: $\mathcal{Y} = \{0,1\}$

$r(x) = P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x)} = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}$

Which of the following do we prefer to work with,

• P(Y=1|X=x) or
• P(X=x|Y=1) ?

The former reflects a Bayesian approach - degree of belief. So, Y is not a RV. The latter reflects a frequentist approach - frequency (of observation). (Ease of computation and validity were proposed and discussed by students.)

The question remains unresolved.

The Bayes Classifier uses $P(Y=1|X=x)$

$P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}$

P(Y=1) : the prior, based on belief/evidence beforehand

denominator : marginalized by summation

$h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases}$

The set $\mathcal{D}(h) = \{ x : P(Y=1|X=x) = P(Y=0|X=x)... \}$

which defines a decision boundary.

Theorem: Bayes rule is optimal. I.e., if h is any other classification rule, then $L(h^*) \lt = L(h)$ (This is to be proved in homework.)

Therefore, Why do we need any other method? A: Because X densities are often/typically unknown. I.e., $f_k(x)$ and/or $\pi_k$ unknown.

$P(Y=k|X=x) = \frac{P(X=x|Y=k) ...} {...} = \frac{f_k(x) \pi_k} {\sum_k f_k(x) \pi_k}$ f_k(x) is referred to as the class conditional distribution (~likelihood).

Therefore, we rely on some data to estimate quantities.

### Three Main Approaches

1. Empirical Risk Minimization: Choose a set of classifier H (e.g., line, neural network) and find $h^* \in H$ that minimizes (some estimate of) L(h).

2. Regression: Find an estimate ($\hat{r}$) of function $r$ and define $h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases}$

The problem is more difficult, because of restricted domain (discrete label values).

3. Density Estimation: Estimate $P(X=x|Y=0)$ from $X_i$'s for which $Y_i = 0$ Estimate $P(X=x|Y=1)$ from $X_i$'s for which $Y_i = 1$ and let $P(Y=?) = (1/n) \sum_{i=1}^{n} Y_i$

Define $\hat{r}(x) = \hat{P}(Y=1|X=x)$ and $h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases}$

Problems?

not enough data to estimate? - possibly

high error rate?

big one: not good in high-dimensional space

As the dimension of the space goes up, the learning requirements go up exponentially.

### Multi-Class Classification

Generalize to case Y takes on k>2 values.

Theorem: $Y \in \mathcal{Y} = {1,2,..., k}$ optimal rule

$h*(x) = argmax_k P$

where $P(Y=k|X=x) = \frac{f_k(x) \pi_k} {\sum ...}$

## LDA and QDA

(linear discriminant analysis, and quadratic discriminant analysis)

Simplest: Use approach 3 (above) and assume a parametric model for densities. Assume class conditional is Gaussian.

LDA (also known as FDA (Fisher's), which is in fact not really the same thing)

$\mathcal{Y} = \{ 0,1 \}$ assumed (i.e., 2 labels)

$h(x) = \begin{cases} 1 \ \ P(Y=1|X=x) \gt P(Y=0|X=x) \\ 0 \ \ otherwise \end{cases}$

$P(Y=1|X=x) = \frac{f_1(x) \pi_1} {\sum_k f_k \pi_k} \ \$ (denom = P(x))

1) Assume Gaussian distributions

$f_k(x) = [(2\pi)^{-d/2} |\Sigma_k|^{-1/2}] exp(-(1/2)(\mathbf{x} - \mathbf{\mu_k}) \Sigma_k^{-1}(\mathbf{x}-\mathbf{\mu_k}) )$

must compare $\frac{f_1(x) \pi_1} {p(x)} with \frac{f_0(x) \pi_0} {p(x)}$ Note that the p(x) denom can be ignored: $f_1(x) \pi_1$ with $f_0(x) \pi_0$

To find the decision boundary, set $f_1(x) \pi_1 = f_0(x) \pi_0$

Because we are assuming $\Sigma_1 = \Sigma_0$, we can use $\Sigma = \Sigma_0 = \Sigma_1$.

Cancel $(2\pi)^{-d/2} |\Sigma_k|^{-1/2}$ from both sides.

Take log of both sides.

Subtract one side from both sides, leaving zero on one side.

$-(1/2)(\mathbf{x} - \mathbf{\mu_1})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_1}) + log(\pi_1) - [-(1/2)(\mathbf{x} - \mathbf{\mu_0})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_0}) + log(\pi_0)] = 0$

$(1/2)[-\mathbf{x}^T \Sigma^{-1} - \mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + 2\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{x} + \mathbf{x}^T \Sigma^{-1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} - 2\mathbf{\mu_0}^T \Sigma^{-1} \mathbf{x} ] + log(\pi_1/\pi_0) = 0$

Which reduces to $(1/2)[\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} + (2\mathbf{\mu_1}^T \Sigma^{-1} - 2\mathbf{\mu_1}^T \Sigma^{-1}) \mathbf{x}] + log(\pi_1/\pi_0) = 0$

And we see that the first pair of terms is constant, and the second pair is linear on x. So that we end up with something of the form $ax + b = 0$.

Jgpitt - 2011/09/21