Difference between revisions of "stat841f11"
(Created page with "== Editor Sign Up==") |
(841 lecture - 2011/09/20 - initial submission) |
||
Line 1: | Line 1: | ||
+ | = STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20 = | ||
+ | == Wiki Course Notes == | ||
+ | Students will need to contribute to the wiki for 20% of their grade. | ||
+ | Access via wikicoursenote.com | ||
+ | Go to editor sign-up, and use your UW userid for your account name, and use your UW email. | ||
+ | |||
+ | primary (10%) | ||
+ | Post a draft of lecture notes within 48 hours. | ||
+ | You will need to do this 1 or 2 times, depending on class size. | ||
+ | |||
+ | secondary (10%) | ||
+ | Make improvements to the notes for at least 60% of the lectures. | ||
+ | More than half of your contributions should be technical rather than editorial. | ||
+ | There will be a spreadsheet where students can indicate what they've done and when. | ||
+ | The instructor will conduct random spot checks to ensure that students have contributed what they claim. | ||
+ | |||
+ | |||
+ | == Classification == | ||
+ | === definitions === | ||
+ | '''classification''': Predict a discrete random variable <math>Y</math> by using another random variable <math>X</math> | ||
+ | iid | ||
+ | |||
+ | <math>X_i = (X_{i1}, X_{i2}, ... X_{id}) \in \mathcal{X} \subset \mathbb{R}^d</math> (<math>d</math>-dimensional vector) | ||
+ | <math>Y_i</math> in some finite set <math>\mathcal{Y}</math> | ||
+ | |||
+ | |||
+ | '''classification rule''': | ||
+ | <math>h : \mathcal{X} \rightarrow \mathcal{Y}</math> | ||
+ | Take new observation <math>X</math> and make new <math>Y</math> prediction <math>h(X)</math>. | ||
+ | |||
+ | Example: an image (2-d), which can be represented by a long vector <math>\mathbf{X}</math> | ||
+ | |||
+ | |||
+ | '''true error rate''' for classifier <math>h</math> | ||
+ | |||
+ | <math>L(h) = (P(h(X) \neq Y )</math> | ||
+ | |||
+ | |||
+ | '''empirical error rate''' (or training error rate) | ||
+ | |||
+ | <math>\hat{L}_n(h) = (1/n) \sum_{i=1}^{n} I(h(X_i) \neq Y_i)</math> | ||
+ | |||
+ | where <math>I()</math> indicates misclassification, i.e., | ||
+ | |||
+ | <math>I = 1 \ \rightarrow \ h()</math> misclassifies this <math>X</math> | ||
+ | |||
+ | <math>I = 0 \ \rightarrow \ h()</math> does not misclassify this <math>X</math> | ||
+ | |||
+ | e.g., 100 new data points with known (true) labels | ||
+ | |||
+ | <math>y_1 = h(x_1)</math> | ||
+ | |||
+ | ... | ||
+ | |||
+ | <math>y_{100} = h(x_{100})</math> | ||
+ | |||
+ | compare the known-true against the classified-true, count, divide by n=100 | ||
+ | |||
+ | |||
+ | === Bayes Classifier === | ||
+ | First recall Bayes' Rule, in the format | ||
+ | <math>P(Y|X) = \frac{P(X|Y) P(Y)} {P(X)} </math> | ||
+ | |||
+ | P(Y|X) : ''posterior'' | ||
+ | |||
+ | P(X|Y) : ''likelihood'' | ||
+ | |||
+ | P(Y) : ''prior'' | ||
+ | |||
+ | P(X) : ''marginal'' | ||
+ | |||
+ | We will start with the simplest case: <math>\mathcal{Y} = \{0,1\}</math> | ||
+ | |||
+ | <math>r(x) = P(Y=1|X=x)</math> | ||
+ | <math> = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x)}</math> | ||
+ | <math> = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}</math> | ||
+ | |||
+ | Which of the following do we prefer to work with, | ||
+ | * P(Y=1|X=x) or | ||
+ | * P(X=x|Y=1) ? | ||
+ | |||
+ | The former reflects a ''Bayesian'' approach - degree of belief. So, Y is not a RV. | ||
+ | The latter reflects a ''frequentist'' approach - frequency (of observation). | ||
+ | (Ease of computation and validity were proposed and discussed by students.) | ||
+ | |||
+ | The question remains unresolved. | ||
+ | |||
+ | |||
+ | The Bayes Classifier uses <math>P(Y=1|X=x)</math> | ||
+ | |||
+ | <math> P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}</math> | ||
+ | |||
+ | P(Y=1) : the prior, based on belief/evidence beforehand | ||
+ | |||
+ | denominator : marginalized by summation | ||
+ | |||
+ | <math>h(x) = | ||
+ | \begin{cases} | ||
+ | 1 \ \ \hat{r}(x) > 1/2 \\ | ||
+ | 0 \ \ otherwise | ||
+ | \end{cases} | ||
+ | </math> | ||
+ | |||
+ | The set <math>\mathcal{D}(h) = \{ x : P(Y=1|X=x) = P(Y=0|X=x)... \} </math> | ||
+ | |||
+ | which defines a ''decision boundary''. | ||
+ | |||
+ | |||
+ | ''Theorem'': Bayes rule is optimal. I.e., if h is any other classification rule, | ||
+ | then <math>L(h^*) <= L(h)</math> | ||
+ | (This is to be proved in homework.) | ||
+ | |||
+ | Therefore, Why do we need any other method? | ||
+ | A: Because X densities are often/typically unknown. I.e., <math>f_k(x)</math> and/or <math>\pi_k</math> unknown. | ||
+ | |||
+ | <math>P(Y=k|X=x) = \frac{P(X=x|Y=k) ...} {...} = \frac{f_k(x) \pi_k} {\sum_k f_k(x) \pi_k}</math> | ||
+ | f_k(x) is referred to as the class conditional distribution (~likelihood). | ||
+ | |||
+ | Therefore, we rely on some data to estimate quantities. | ||
+ | |||
+ | |||
+ | === Three Main Approaches === | ||
+ | |||
+ | '''1. Empirical Risk Minimization''': | ||
+ | Choose a set of classifier H (e.g., line, neural network) and find <math>h^* \in H</math> | ||
+ | that minimizes (some estimate of) L(h). | ||
+ | |||
+ | '''2. Regression''': | ||
+ | Find an estimate (<math>\hat{r}</math>) of function <math>r</math> and define | ||
+ | <math>h(x) = | ||
+ | \begin{cases} | ||
+ | 1 \ \ \hat{r}(x) > 1/2 \\ | ||
+ | 0 \ \ otherwise | ||
+ | \end{cases} | ||
+ | </math> | ||
+ | |||
+ | The problem is more difficult, because of restricted domain (discrete label values). | ||
+ | |||
+ | '''3. Density Estimation''': | ||
+ | Estimate <math>P(X=x|Y=0)</math> from <math>X_i</math>'s for which <math>Y_i = 0</math> | ||
+ | Estimate <math>P(X=x|Y=1)</math> from <math>X_i</math>'s for which <math>Y_i = 1</math> | ||
+ | and let <math>P(Y=?) = (1/n) \sum_{i=1}^{n} Y_i</math> | ||
+ | |||
+ | Define <math>\hat{r}(x) = \hat{P}(Y=1|X=x)</math> and | ||
+ | <math>h(x) = | ||
+ | \begin{cases} | ||
+ | 1 \ \ \hat{r}(x) > 1/2 \\ | ||
+ | 0 \ \ otherwise | ||
+ | \end{cases} | ||
+ | </math> | ||
+ | |||
+ | Problems? | ||
+ | |||
+ | not enough data to estimate? - possibly | ||
+ | |||
+ | high error rate? | ||
+ | |||
+ | big one: not good in high-dimensional space | ||
+ | |||
+ | |||
+ | As the dimension of the space goes up, the learning requirements go up exponentially. | ||
+ | |||
+ | |||
+ | === Multi-Class Classification === | ||
+ | Generalize to case Y takes on k>2 values. | ||
+ | |||
+ | |||
+ | ''Theorem'': <math>Y \in \mathcal{Y} = {1,2,..., k}</math> optimal rule | ||
+ | |||
+ | <math>h*(x) = argmax_k P</math> | ||
+ | |||
+ | where <math>P(Y=k|X=x) = \frac{f_k(x) \pi_k} {\sum ...}</math> | ||
+ | |||
+ | |||
+ | == LDA and QDA == | ||
+ | (linear discriminant analysis, and quadratic discriminant analysis) | ||
+ | |||
+ | Simplest: Use approach 3 (above) and assume a parametric model for densities. | ||
+ | Assume class conditional is Gaussian. | ||
+ | |||
+ | LDA (also known as FDA (Fisher's), which is in fact not really the same thing) | ||
+ | |||
+ | <math>\mathcal{Y} = \{ 0,1 \}</math> assumed (i.e., 2 labels) | ||
+ | |||
+ | <math>h(x) = | ||
+ | \begin{cases} | ||
+ | 1 \ \ P(Y=1|X=x) > P(Y=0|X=x) \\ | ||
+ | 0 \ \ otherwise | ||
+ | \end{cases} | ||
+ | </math> | ||
+ | |||
+ | <math>P(Y=1|X=x) = \frac{f_1(x) \pi_1} {\sum_k f_k \pi_k} \ \ </math> (denom = P(x)) | ||
+ | |||
+ | 1) Assume Gaussian distributions | ||
+ | |||
+ | <math>f_k(x) = [(2\pi)^{-d/2} |\Sigma_k|^{-1/2}] exp(-(1/2)(\mathbf{x} - \mathbf{\mu_k}) \Sigma_k^{-1}(\mathbf{x}-\mathbf{\mu_k}) )</math> | ||
+ | |||
+ | must compare | ||
+ | <math>\frac{f_1(x) \pi_1} {p(x)} with \frac{f_0(x) \pi_0} {p(x)}</math> | ||
+ | Note that the p(x) denom can be ignored: | ||
+ | <math>f_1(x) \pi_1</math> with <math>f_0(x) \pi_0 </math> | ||
+ | |||
+ | To find the decision boundary, set | ||
+ | <math>f_1(x) \pi_1 = f_0(x) \pi_0 </math> | ||
+ | |||
+ | Because we are assuming <math>\Sigma_1 = \Sigma_0</math>, we can use <math>\Sigma = \Sigma_0 = \Sigma_1</math>. | ||
+ | |||
+ | Cancel <math>(2\pi)^{-d/2} |\Sigma_k|^{-1/2}</math> from both sides. | ||
+ | |||
+ | Take log of both sides. | ||
+ | |||
+ | Subtract one side from both sides, leaving zero on one side. | ||
+ | |||
+ | |||
+ | <math>-(1/2)(\mathbf{x} - \mathbf{\mu_1})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_1}) + log(\pi_1) - [-(1/2)(\mathbf{x} - \mathbf{\mu_0})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_0}) + log(\pi_0)] = 0 </math> | ||
+ | |||
+ | |||
+ | <math>(1/2)[-\mathbf{x}^T \Sigma^{-1} - \mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + 2\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{x} | ||
+ | + \mathbf{x}^T \Sigma^{-1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} - 2\mathbf{\mu_0}^T \Sigma^{-1} \mathbf{x} ] | ||
+ | + log(\pi_1/\pi_0) = 0 </math> | ||
+ | |||
+ | |||
+ | Which reduces to | ||
+ | <math>(1/2)[\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} | ||
+ | + (2\mathbf{\mu_1}^T \Sigma^{-1} - 2\mathbf{\mu_1}^T \Sigma^{-1}) \mathbf{x}] | ||
+ | + log(\pi_1/\pi_0) = 0 </math> | ||
+ | |||
+ | |||
+ | And we see that the first pair of terms is constant, and the second pair is linear on x. | ||
+ | So that we end up with something of the form | ||
+ | <math>ax + b = 0</math>. | ||
+ | |||
+ | [[User:Jgpitt|Jgpitt]] - 2011/09/21 | ||
+ | |||
+ | |||
==[[f11Stat841EditorSignUp| Editor Sign Up]]== | ==[[f11Stat841EditorSignUp| Editor Sign Up]]== |
Revision as of 16:41, 21 September 2011
Contents
STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20
Wiki Course Notes
Students will need to contribute to the wiki for 20% of their grade. Access via wikicoursenote.com Go to editor sign-up, and use your UW userid for your account name, and use your UW email.
primary (10%) Post a draft of lecture notes within 48 hours. You will need to do this 1 or 2 times, depending on class size.
secondary (10%) Make improvements to the notes for at least 60% of the lectures. More than half of your contributions should be technical rather than editorial. There will be a spreadsheet where students can indicate what they've done and when. The instructor will conduct random spot checks to ensure that students have contributed what they claim.
Classification
definitions
classification: Predict a discrete random variable [math]Y[/math] by using another random variable [math]X[/math] iid
[math]X_i = (X_{i1}, X_{i2}, ... X_{id}) \in \mathcal{X} \subset \mathbb{R}^d[/math] ([math]d[/math]-dimensional vector) [math]Y_i[/math] in some finite set [math]\mathcal{Y}[/math]
classification rule:
[math]h : \mathcal{X} \rightarrow \mathcal{Y}[/math]
Take new observation [math]X[/math] and make new [math]Y[/math] prediction [math]h(X)[/math].
Example: an image (2-d), which can be represented by a long vector [math]\mathbf{X}[/math]
true error rate for classifier [math]h[/math]
[math]L(h) = (P(h(X) \neq Y )[/math]
empirical error rate (or training error rate)
[math]\hat{L}_n(h) = (1/n) \sum_{i=1}^{n} I(h(X_i) \neq Y_i)[/math]
where [math]I()[/math] indicates misclassification, i.e.,
[math]I = 1 \ \rightarrow \ h()[/math] misclassifies this [math]X[/math]
[math]I = 0 \ \rightarrow \ h()[/math] does not misclassify this [math]X[/math]
e.g., 100 new data points with known (true) labels
[math]y_1 = h(x_1)[/math]
...
[math]y_{100} = h(x_{100})[/math]
compare the known-true against the classified-true, count, divide by n=100
Bayes Classifier
First recall Bayes' Rule, in the format [math]P(Y|X) = \frac{P(X|Y) P(Y)} {P(X)} [/math]
P(Y|X) : posterior
P(X|Y) : likelihood
P(Y) : prior
P(X) : marginal
We will start with the simplest case: [math]\mathcal{Y} = \{0,1\}[/math]
[math]r(x) = P(Y=1|X=x)[/math] [math] = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x)}[/math] [math] = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}[/math]
Which of the following do we prefer to work with,
- P(Y=1|X=x) or
- P(X=x|Y=1) ?
The former reflects a Bayesian approach - degree of belief. So, Y is not a RV. The latter reflects a frequentist approach - frequency (of observation). (Ease of computation and validity were proposed and discussed by students.)
The question remains unresolved.
The Bayes Classifier uses [math]P(Y=1|X=x)[/math]
[math] P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}[/math]
P(Y=1) : the prior, based on belief/evidence beforehand
denominator : marginalized by summation
[math]h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases} [/math]
The set [math]\mathcal{D}(h) = \{ x : P(Y=1|X=x) = P(Y=0|X=x)... \} [/math]
which defines a decision boundary.
Theorem: Bayes rule is optimal. I.e., if h is any other classification rule,
then [math]L(h^*) \lt = L(h)[/math]
(This is to be proved in homework.)
Therefore, Why do we need any other method? A: Because X densities are often/typically unknown. I.e., [math]f_k(x)[/math] and/or [math]\pi_k[/math] unknown.
[math]P(Y=k|X=x) = \frac{P(X=x|Y=k) ...} {...} = \frac{f_k(x) \pi_k} {\sum_k f_k(x) \pi_k}[/math] f_k(x) is referred to as the class conditional distribution (~likelihood).
Therefore, we rely on some data to estimate quantities.
Three Main Approaches
1. Empirical Risk Minimization: Choose a set of classifier H (e.g., line, neural network) and find [math]h^* \in H[/math] that minimizes (some estimate of) L(h).
2. Regression: Find an estimate ([math]\hat{r}[/math]) of function [math]r[/math] and define [math]h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases} [/math]
The problem is more difficult, because of restricted domain (discrete label values).
3. Density Estimation: Estimate [math]P(X=x|Y=0)[/math] from [math]X_i[/math]'s for which [math]Y_i = 0[/math] Estimate [math]P(X=x|Y=1)[/math] from [math]X_i[/math]'s for which [math]Y_i = 1[/math] and let [math]P(Y=?) = (1/n) \sum_{i=1}^{n} Y_i[/math]
Define [math]\hat{r}(x) = \hat{P}(Y=1|X=x)[/math] and [math]h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases} [/math]
Problems?
not enough data to estimate? - possibly
high error rate?
big one: not good in high-dimensional space
As the dimension of the space goes up, the learning requirements go up exponentially.
Multi-Class Classification
Generalize to case Y takes on k>2 values.
Theorem: [math]Y \in \mathcal{Y} = {1,2,..., k}[/math] optimal rule
[math]h*(x) = argmax_k P[/math]
where [math]P(Y=k|X=x) = \frac{f_k(x) \pi_k} {\sum ...}[/math]
LDA and QDA
(linear discriminant analysis, and quadratic discriminant analysis)
Simplest: Use approach 3 (above) and assume a parametric model for densities. Assume class conditional is Gaussian.
LDA (also known as FDA (Fisher's), which is in fact not really the same thing)
[math]\mathcal{Y} = \{ 0,1 \}[/math] assumed (i.e., 2 labels)
[math]h(x) = \begin{cases} 1 \ \ P(Y=1|X=x) \gt P(Y=0|X=x) \\ 0 \ \ otherwise \end{cases} [/math]
[math]P(Y=1|X=x) = \frac{f_1(x) \pi_1} {\sum_k f_k \pi_k} \ \ [/math] (denom = P(x))
1) Assume Gaussian distributions
[math]f_k(x) = [(2\pi)^{-d/2} |\Sigma_k|^{-1/2}] exp(-(1/2)(\mathbf{x} - \mathbf{\mu_k}) \Sigma_k^{-1}(\mathbf{x}-\mathbf{\mu_k}) )[/math]
must compare [math]\frac{f_1(x) \pi_1} {p(x)} with \frac{f_0(x) \pi_0} {p(x)}[/math] Note that the p(x) denom can be ignored: [math]f_1(x) \pi_1[/math] with [math]f_0(x) \pi_0 [/math]
To find the decision boundary, set [math]f_1(x) \pi_1 = f_0(x) \pi_0 [/math]
Because we are assuming [math]\Sigma_1 = \Sigma_0[/math], we can use [math]\Sigma = \Sigma_0 = \Sigma_1[/math].
Cancel [math](2\pi)^{-d/2} |\Sigma_k|^{-1/2}[/math] from both sides.
Take log of both sides.
Subtract one side from both sides, leaving zero on one side.
[math]-(1/2)(\mathbf{x} - \mathbf{\mu_1})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_1}) + log(\pi_1) - [-(1/2)(\mathbf{x} - \mathbf{\mu_0})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_0}) + log(\pi_0)] = 0 [/math]
[math](1/2)[-\mathbf{x}^T \Sigma^{-1} - \mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + 2\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{x}
+ \mathbf{x}^T \Sigma^{-1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} - 2\mathbf{\mu_0}^T \Sigma^{-1} \mathbf{x} ]
+ log(\pi_1/\pi_0) = 0 [/math]
Which reduces to
[math](1/2)[\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0}
+ (2\mathbf{\mu_1}^T \Sigma^{-1} - 2\mathbf{\mu_1}^T \Sigma^{-1}) \mathbf{x}]
+ log(\pi_1/\pi_0) = 0 [/math]
And we see that the first pair of terms is constant, and the second pair is linear on x.
So that we end up with something of the form
[math]ax + b = 0[/math].
Jgpitt - 2011/09/21