stat841f11

From statwiki
Revision as of 16:41, 21 September 2011 by Jgpitt (talk | contribs) (841 lecture - 2011/09/20 - initial submission)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20

Wiki Course Notes

Students will need to contribute to the wiki for 20% of their grade. Access via wikicoursenote.com Go to editor sign-up, and use your UW userid for your account name, and use your UW email.

primary (10%) Post a draft of lecture notes within 48 hours. You will need to do this 1 or 2 times, depending on class size.

secondary (10%) Make improvements to the notes for at least 60% of the lectures. More than half of your contributions should be technical rather than editorial. There will be a spreadsheet where students can indicate what they've done and when. The instructor will conduct random spot checks to ensure that students have contributed what they claim.


Classification

definitions

classification: Predict a discrete random variable [math]\displaystyle{ Y }[/math] by using another random variable [math]\displaystyle{ X }[/math] iid

[math]\displaystyle{ X_i = (X_{i1}, X_{i2}, ... X_{id}) \in \mathcal{X} \subset \mathbb{R}^d }[/math] ([math]\displaystyle{ d }[/math]-dimensional vector) [math]\displaystyle{ Y_i }[/math] in some finite set [math]\displaystyle{ \mathcal{Y} }[/math]


classification rule: [math]\displaystyle{ h : \mathcal{X} \rightarrow \mathcal{Y} }[/math] Take new observation [math]\displaystyle{ X }[/math] and make new [math]\displaystyle{ Y }[/math] prediction [math]\displaystyle{ h(X) }[/math].

Example: an image (2-d), which can be represented by a long vector [math]\displaystyle{ \mathbf{X} }[/math]


true error rate for classifier [math]\displaystyle{ h }[/math]

[math]\displaystyle{ L(h) = (P(h(X) \neq Y ) }[/math]


empirical error rate (or training error rate)

[math]\displaystyle{ \hat{L}_n(h) = (1/n) \sum_{i=1}^{n} I(h(X_i) \neq Y_i) }[/math]

where [math]\displaystyle{ I() }[/math] indicates misclassification, i.e.,

[math]\displaystyle{ I = 1 \ \rightarrow \ h() }[/math] misclassifies this [math]\displaystyle{ X }[/math]

[math]\displaystyle{ I = 0 \ \rightarrow \ h() }[/math] does not misclassify this [math]\displaystyle{ X }[/math]

e.g., 100 new data points with known (true) labels

[math]\displaystyle{ y_1 = h(x_1) }[/math]

...

[math]\displaystyle{ y_{100} = h(x_{100}) }[/math]

compare the known-true against the classified-true, count, divide by n=100


Bayes Classifier

First recall Bayes' Rule, in the format [math]\displaystyle{ P(Y|X) = \frac{P(X|Y) P(Y)} {P(X)} }[/math]

P(Y|X)  : posterior

P(X|Y)  : likelihood

P(Y)  : prior

P(X)  : marginal

We will start with the simplest case: [math]\displaystyle{ \mathcal{Y} = \{0,1\} }[/math]

[math]\displaystyle{ r(x) = P(Y=1|X=x) }[/math] [math]\displaystyle{ = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x)} }[/math] [math]\displaystyle{ = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)} }[/math]

Which of the following do we prefer to work with,

  • P(Y=1|X=x) or
  • P(X=x|Y=1) ?

The former reflects a Bayesian approach - degree of belief. So, Y is not a RV. The latter reflects a frequentist approach - frequency (of observation). (Ease of computation and validity were proposed and discussed by students.)

The question remains unresolved.


The Bayes Classifier uses [math]\displaystyle{ P(Y=1|X=x) }[/math]

[math]\displaystyle{ P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)} }[/math]

P(Y=1) : the prior, based on belief/evidence beforehand

denominator : marginalized by summation

[math]\displaystyle{ h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases} }[/math]

The set [math]\displaystyle{ \mathcal{D}(h) = \{ x : P(Y=1|X=x) = P(Y=0|X=x)... \} }[/math]

which defines a decision boundary.


Theorem: Bayes rule is optimal. I.e., if h is any other classification rule, then [math]\displaystyle{ L(h^*) \lt = L(h) }[/math] (This is to be proved in homework.)

Therefore, Why do we need any other method? A: Because X densities are often/typically unknown. I.e., [math]\displaystyle{ f_k(x) }[/math] and/or [math]\displaystyle{ \pi_k }[/math] unknown.

[math]\displaystyle{ P(Y=k|X=x) = \frac{P(X=x|Y=k) ...} {...} = \frac{f_k(x) \pi_k} {\sum_k f_k(x) \pi_k} }[/math] f_k(x) is referred to as the class conditional distribution (~likelihood).

Therefore, we rely on some data to estimate quantities.


Three Main Approaches

1. Empirical Risk Minimization: Choose a set of classifier H (e.g., line, neural network) and find [math]\displaystyle{ h^* \in H }[/math] that minimizes (some estimate of) L(h).

2. Regression: Find an estimate ([math]\displaystyle{ \hat{r} }[/math]) of function [math]\displaystyle{ r }[/math] and define [math]\displaystyle{ h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases} }[/math]

The problem is more difficult, because of restricted domain (discrete label values).

3. Density Estimation: Estimate [math]\displaystyle{ P(X=x|Y=0) }[/math] from [math]\displaystyle{ X_i }[/math]'s for which [math]\displaystyle{ Y_i = 0 }[/math] Estimate [math]\displaystyle{ P(X=x|Y=1) }[/math] from [math]\displaystyle{ X_i }[/math]'s for which [math]\displaystyle{ Y_i = 1 }[/math] and let [math]\displaystyle{ P(Y=?) = (1/n) \sum_{i=1}^{n} Y_i }[/math]

Define [math]\displaystyle{ \hat{r}(x) = \hat{P}(Y=1|X=x) }[/math] and [math]\displaystyle{ h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases} }[/math]

Problems?

not enough data to estimate? - possibly

high error rate?

big one: not good in high-dimensional space


As the dimension of the space goes up, the learning requirements go up exponentially.


Multi-Class Classification

Generalize to case Y takes on k>2 values.


Theorem: [math]\displaystyle{ Y \in \mathcal{Y} = {1,2,..., k} }[/math] optimal rule

[math]\displaystyle{ h*(x) = argmax_k P }[/math]

where [math]\displaystyle{ P(Y=k|X=x) = \frac{f_k(x) \pi_k} {\sum ...} }[/math]


LDA and QDA

(linear discriminant analysis, and quadratic discriminant analysis)

Simplest: Use approach 3 (above) and assume a parametric model for densities. Assume class conditional is Gaussian.

LDA (also known as FDA (Fisher's), which is in fact not really the same thing)

[math]\displaystyle{ \mathcal{Y} = \{ 0,1 \} }[/math] assumed (i.e., 2 labels)

[math]\displaystyle{ h(x) = \begin{cases} 1 \ \ P(Y=1|X=x) \gt P(Y=0|X=x) \\ 0 \ \ otherwise \end{cases} }[/math]

[math]\displaystyle{ P(Y=1|X=x) = \frac{f_1(x) \pi_1} {\sum_k f_k \pi_k} \ \ }[/math] (denom = P(x))

1) Assume Gaussian distributions

[math]\displaystyle{ f_k(x) = [(2\pi)^{-d/2} |\Sigma_k|^{-1/2}] exp(-(1/2)(\mathbf{x} - \mathbf{\mu_k}) \Sigma_k^{-1}(\mathbf{x}-\mathbf{\mu_k}) ) }[/math]

must compare [math]\displaystyle{ \frac{f_1(x) \pi_1} {p(x)} with \frac{f_0(x) \pi_0} {p(x)} }[/math] Note that the p(x) denom can be ignored: [math]\displaystyle{ f_1(x) \pi_1 }[/math] with [math]\displaystyle{ f_0(x) \pi_0 }[/math]

To find the decision boundary, set [math]\displaystyle{ f_1(x) \pi_1 = f_0(x) \pi_0 }[/math]

Because we are assuming [math]\displaystyle{ \Sigma_1 = \Sigma_0 }[/math], we can use [math]\displaystyle{ \Sigma = \Sigma_0 = \Sigma_1 }[/math].

Cancel [math]\displaystyle{ (2\pi)^{-d/2} |\Sigma_k|^{-1/2} }[/math] from both sides.

Take log of both sides.

Subtract one side from both sides, leaving zero on one side.


[math]\displaystyle{ -(1/2)(\mathbf{x} - \mathbf{\mu_1})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_1}) + log(\pi_1) - [-(1/2)(\mathbf{x} - \mathbf{\mu_0})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_0}) + log(\pi_0)] = 0 }[/math]


[math]\displaystyle{ (1/2)[-\mathbf{x}^T \Sigma^{-1} - \mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + 2\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{x} + \mathbf{x}^T \Sigma^{-1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} - 2\mathbf{\mu_0}^T \Sigma^{-1} \mathbf{x} ] + log(\pi_1/\pi_0) = 0 }[/math]


Which reduces to [math]\displaystyle{ (1/2)[\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} + (2\mathbf{\mu_1}^T \Sigma^{-1} - 2\mathbf{\mu_1}^T \Sigma^{-1}) \mathbf{x}] + log(\pi_1/\pi_0) = 0 }[/math]


And we see that the first pair of terms is constant, and the second pair is linear on x. So that we end up with something of the form [math]\displaystyle{ ax + b = 0 }[/math].

Jgpitt - 2011/09/21


Editor Sign Up