stat841f11: Difference between revisions
No edit summary |
No edit summary |
||
Line 18: | Line 18: | ||
== Classification == | == Classification == | ||
=== | === Definitions === | ||
'''classification''': Predict a discrete random variable <math>Y</math> by using another random variable <math>X</math> | '''classification''': Predict a discrete random variable <math>Y</math> (a label) by using another random variable <math>X</math> | ||
iid | (new data point) picked iid from a distribution | ||
<math>X_i = (X_{i1}, X_{i2}, ... X_{id}) \in \mathcal{X} \subset \mathbb{R}^d</math> (<math>d</math>-dimensional vector) | <math>X_i = (X_{i1}, X_{i2}, ... X_{id}) \in \mathcal{X} \subset \mathbb{R}^d</math> (<math>d</math>-dimensional vector) | ||
Line 28: | Line 28: | ||
'''classification rule''': | '''classification rule''': | ||
<math>h : \mathcal{X} \rightarrow \mathcal{Y}</math> | <math>h : \mathcal{X} \rightarrow \mathcal{Y}</math> | ||
Take new observation <math>X</math> and use function <math>h(x)</math> to generate a label <math>Y</math>. In other words, If we fit the function <math>h(x)</math> with a random | Take new observation <math>X</math> and use a classification function <math>h(x)</math> to generate a label <math>Y</math>. In other words, If we fit the function <math>h(x)</math> with a random variable <math>X</math>, it generates the label <math>Y</math> which is the class to which we predict <math>X</math> belongs. | ||
Example: Let <math> \mathcal{X}</math> be a set of 2D images and <math>\mathcal{Y}</math> be a finite set of people. We want to learn a classification rule <math>h:\mathcal{X}\rightarrow\mathcal{Y}</math> that with small error predicts the person who appears in the image. | Example: Let <math> \mathcal{X}</math> be a set of 2D images and <math>\mathcal{Y}</math> be a finite set of people. We want to learn a classification rule <math>h:\mathcal{X}\rightarrow\mathcal{Y}</math> that with small ''true'' error predicts the person who appears in the image. | ||
'''true error rate''' for classifier <math>h</math> | '''true error rate''' for classifier <math>h</math> is the error with respect to the underlying distribution (that we do not know). | ||
<math>L(h) = (P(h(X) \neq Y )</math> | <math>L(h) = (P(h(X) \neq Y )</math> | ||
'''empirical error rate''' (or training error rate) | '''empirical error rate''' (or training error rate) is the amount of error that our classification function <math>h(x)</math> makes on the training data. | ||
<math>\hat{L}_n(h) = (1/n) \sum_{i=1}^{n} \mathbf{I}(h(X_i) \neq Y_i)</math> | <math>\hat{L}_n(h) = (1/n) \sum_{i=1}^{n} \mathbf{I}(h(X_i) \neq Y_i)</math> | ||
Line 49: | Line 49: | ||
\end{cases}</math> | \end{cases}</math> | ||
So in this | So in this case <math>\mathbf{I}(h(X_i)\neq Y_i) = 1</math> when <math>h(X_i)\neq Y_i</math> i.e. when misclassifications happens. | ||
e.g., 100 new data points with known (true) labels | e.g., 100 new data points with known (true) labels | ||
Line 59: | Line 59: | ||
<math>y_{100} = h(x_{100})</math> | <math>y_{100} = h(x_{100})</math> | ||
To calculate the empirical error we count how many labels our function <math>h(x)</math> assigned incorrectly and divide by n=100 | |||
=== Bayes Classifier === | === Bayes Classifier === | ||
Line 199: | Line 199: | ||
1) Assume Gaussian distributions | 1) Assume Gaussian distributions | ||
<math>f_k(x) = | <math>f_k(x) = \frac{1}{(2\pi)^{-d/2} |\Sigma_k|^{-1/2}} exp(-(1/2)(\mathbf{x} - \mathbf{\mu_k}) \Sigma_k^{-1}(\mathbf{x}-\mathbf{\mu_k}) )</math> | ||
must compare | must compare | ||
<math>\frac{f_1(x) \pi_1} {p(x)} with \frac{f_0(x) \pi_0} {p(x)}</math> | <math>\frac{f_1(x) \pi_1} {p(x)}</math> with <math>\frac{f_0(x) \pi_0} {p(x)}</math> | ||
Note that the p(x) denom can be ignored: | Note that the p(x) denom can be ignored: | ||
<math>f_1(x) \pi_1</math> with <math>f_0(x) \pi_0 </math> | <math>f_1(x) \pi_1</math> with <math>f_0(x) \pi_0 </math> |
Revision as of 13:35, 22 September 2011
Editor Sign Up
STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20
Wiki Course Notes
Students will need to contribute to the wiki for 20% of their grade. Access via wikicoursenote.com Go to editor sign-up, and use your UW userid for your account name, and use your UW email.
primary (10%) Post a draft of lecture notes within 48 hours. You will need to do this 1 or 2 times, depending on class size.
secondary (10%) Make improvements to the notes for at least 60% of the lectures. More than half of your contributions should be technical rather than editorial. There will be a spreadsheet where students can indicate what they've done and when. The instructor will conduct random spot checks to ensure that students have contributed what they claim.
Classification
Definitions
classification: Predict a discrete random variable [math]\displaystyle{ Y }[/math] (a label) by using another random variable [math]\displaystyle{ X }[/math] (new data point) picked iid from a distribution
[math]\displaystyle{ X_i = (X_{i1}, X_{i2}, ... X_{id}) \in \mathcal{X} \subset \mathbb{R}^d }[/math] ([math]\displaystyle{ d }[/math]-dimensional vector) [math]\displaystyle{ Y_i }[/math] in some finite set [math]\displaystyle{ \mathcal{Y} }[/math]
classification rule:
[math]\displaystyle{ h : \mathcal{X} \rightarrow \mathcal{Y} }[/math]
Take new observation [math]\displaystyle{ X }[/math] and use a classification function [math]\displaystyle{ h(x) }[/math] to generate a label [math]\displaystyle{ Y }[/math]. In other words, If we fit the function [math]\displaystyle{ h(x) }[/math] with a random variable [math]\displaystyle{ X }[/math], it generates the label [math]\displaystyle{ Y }[/math] which is the class to which we predict [math]\displaystyle{ X }[/math] belongs.
Example: Let [math]\displaystyle{ \mathcal{X} }[/math] be a set of 2D images and [math]\displaystyle{ \mathcal{Y} }[/math] be a finite set of people. We want to learn a classification rule [math]\displaystyle{ h:\mathcal{X}\rightarrow\mathcal{Y} }[/math] that with small true error predicts the person who appears in the image.
true error rate for classifier [math]\displaystyle{ h }[/math] is the error with respect to the underlying distribution (that we do not know).
[math]\displaystyle{ L(h) = (P(h(X) \neq Y ) }[/math]
empirical error rate (or training error rate) is the amount of error that our classification function [math]\displaystyle{ h(x) }[/math] makes on the training data.
[math]\displaystyle{ \hat{L}_n(h) = (1/n) \sum_{i=1}^{n} \mathbf{I}(h(X_i) \neq Y_i) }[/math]
where [math]\displaystyle{ \mathbf{I}() }[/math] is an indicator function. Indicator function is defined by
[math]\displaystyle{ \mathbf{I}(x) = \begin{cases} 1 & \text{if } x \text{ is true} \\ 0 & \text{if } x \text{ is false} \end{cases} }[/math]
So in this case [math]\displaystyle{ \mathbf{I}(h(X_i)\neq Y_i) = 1 }[/math] when [math]\displaystyle{ h(X_i)\neq Y_i }[/math] i.e. when misclassifications happens.
e.g., 100 new data points with known (true) labels
[math]\displaystyle{ y_1 = h(x_1) }[/math]
...
[math]\displaystyle{ y_{100} = h(x_{100}) }[/math]
To calculate the empirical error we count how many labels our function [math]\displaystyle{ h(x) }[/math] assigned incorrectly and divide by n=100
Bayes Classifier
First recall Bayes' Rule, in the format [math]\displaystyle{ P(Y|X) = \frac{P(X|Y) P(Y)} {P(X)} }[/math]
P(Y|X) : posterior , probability of [math]\displaystyle{ Y }[/math] given [math]\displaystyle{ X }[/math]
P(X|Y) : likelihood, probability of [math]\displaystyle{ X }[/math] generating [math]\displaystyle{ Y }[/math] or probability of [math]\displaystyle{ Y }[/math] being generated by [math]\displaystyle{ X }[/math]
P(Y) : prior, probability of [math]\displaystyle{ Y }[/math] being selected
P(X) : marginal, probability of obtaining [math]\displaystyle{ X }[/math]
We will start with the simplest case: [math]\displaystyle{ \mathcal{Y} = \{0,1\} }[/math]
[math]\displaystyle{ r(x) = P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x)} = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)} }[/math]
Bayes' rule can be approached by computing either:
- the posterior: P(Y=1|X=x) and P(Y=0|X=x) or
- the likelihood: P(X=x|Y=1) and P(X=x|Y=0)
The former reflects a Bayesian approach - based on the degree of belief. The latter reflects a Frequentist approach - based on the frequency of observation.
There was some class discussion on which approach should be used. Both the ease of computation and the validity of both approaches were discussed. Overall, frequentists consider X to be a random variable. But they do not consider Y to be a random variable, because it has to take on one of the values from a fixed set (in the above case it would be either 0 or 1 and there is only one correct label for a given value X=x). Thus, from a frequentist's perspective it does not make sense to talk about the probability of Y.
This is actually a grey area and sometimes Bayesians and Frequentists use each others' approaches. So using Bayes' rule doesn't necessarily mean you're a Bayesian. Overall, the question remains unresolved.
The Bayes Classifier uses [math]\displaystyle{ P(Y=1|X=x) }[/math]
[math]\displaystyle{ P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)} }[/math]
P(Y=1) : the prior, based on belief/evidence beforehand
denominator : marginalized by summation
[math]\displaystyle{ h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases} }[/math]
The set [math]\displaystyle{ \mathcal{D}(h) = \{ x : P(Y=1|X=x) = P(Y=0|X=x)... \} }[/math]
which defines a decision boundary.
Theorem: Bayes rule is optimal. I.e., if h is any other classification rule,
then [math]\displaystyle{ L(h^*) \lt = L(h) }[/math]
(This is to be proved in homework.)
Therefore, Why do we need any other method? A: Because X densities are often/typically unknown. I.e., [math]\displaystyle{ f_k(x) }[/math] and/or [math]\displaystyle{ \pi_k }[/math] unknown.
[math]\displaystyle{ P(Y=k|X=x) = \frac{P(X=x|Y=k) ...} {...} = \frac{f_k(x) \pi_k} {\sum_k f_k(x) \pi_k} }[/math] f_k(x) is referred to as the class conditional distribution (~likelihood).
Therefore, we rely on some data to estimate quantities.
Three Main Approaches
1. Empirical Risk Minimization: Choose a set of classifier H (e.g., line, neural network) and find [math]\displaystyle{ h^* \in H }[/math] that minimizes (some estimate of) L(h).
2. Regression: Find an estimate ([math]\displaystyle{ \hat{r} }[/math]) of function [math]\displaystyle{ r }[/math] and define [math]\displaystyle{ h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases} }[/math]
The problem is more difficult, because of restricted domain (discrete label values).
3. Density Estimation: Estimate [math]\displaystyle{ P(X=x|Y=0) }[/math] from [math]\displaystyle{ X_i }[/math]'s for which [math]\displaystyle{ Y_i = 0 }[/math] Estimate [math]\displaystyle{ P(X=x|Y=1) }[/math] from [math]\displaystyle{ X_i }[/math]'s for which [math]\displaystyle{ Y_i = 1 }[/math] and let [math]\displaystyle{ P(Y=?) = (1/n) \sum_{i=1}^{n} Y_i }[/math]
Define [math]\displaystyle{ \hat{r}(x) = \hat{P}(Y=1|X=x) }[/math] and [math]\displaystyle{ h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases} }[/math]
Problems?
not enough data to estimate? - possibly
high error rate?
big one: not good in high-dimensional space
As the dimension of the space goes up, the learning requirements go up exponentially.
Multi-Class Classification
Generalize to case Y takes on k>2 values.
Theorem: [math]\displaystyle{ Y \in \mathcal{Y} = {1,2,..., k} }[/math] optimal rule
[math]\displaystyle{ h*(x) = argmax_k P }[/math]
where [math]\displaystyle{ P(Y=k|X=x) = \frac{f_k(x) \pi_k} {\sum ...} }[/math]
LDA and QDA
(linear discriminant analysis, and quadratic discriminant analysis)
Simplest: Use approach 3 (above) and assume a parametric model for densities. Assume class conditional is Gaussian.
LDA (also known as FDA (Fisher's), which is in fact not really the same thing)
[math]\displaystyle{ \mathcal{Y} = \{ 0,1 \} }[/math] assumed (i.e., 2 labels)
[math]\displaystyle{ h(x) = \begin{cases} 1 \ \ P(Y=1|X=x) \gt P(Y=0|X=x) \\ 0 \ \ otherwise \end{cases} }[/math]
[math]\displaystyle{ P(Y=1|X=x) = \frac{f_1(x) \pi_1} {\sum_k f_k \pi_k} \ \ }[/math] (denom = P(x))
1) Assume Gaussian distributions
[math]\displaystyle{ f_k(x) = \frac{1}{(2\pi)^{-d/2} |\Sigma_k|^{-1/2}} exp(-(1/2)(\mathbf{x} - \mathbf{\mu_k}) \Sigma_k^{-1}(\mathbf{x}-\mathbf{\mu_k}) ) }[/math]
must compare [math]\displaystyle{ \frac{f_1(x) \pi_1} {p(x)} }[/math] with [math]\displaystyle{ \frac{f_0(x) \pi_0} {p(x)} }[/math] Note that the p(x) denom can be ignored: [math]\displaystyle{ f_1(x) \pi_1 }[/math] with [math]\displaystyle{ f_0(x) \pi_0 }[/math]
To find the decision boundary, set [math]\displaystyle{ f_1(x) \pi_1 = f_0(x) \pi_0 }[/math]
Because we are assuming [math]\displaystyle{ \Sigma_1 = \Sigma_0 }[/math], we can use [math]\displaystyle{ \Sigma = \Sigma_0 = \Sigma_1 }[/math].
Cancel [math]\displaystyle{ (2\pi)^{-d/2} |\Sigma_k|^{-1/2} }[/math] from both sides.
Take log of both sides.
Subtract one side from both sides, leaving zero on one side.
[math]\displaystyle{ -(1/2)(\mathbf{x} - \mathbf{\mu_1})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_1}) + log(\pi_1) - [-(1/2)(\mathbf{x} - \mathbf{\mu_0})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_0}) + log(\pi_0)] = 0 }[/math]
[math]\displaystyle{ (1/2)[-\mathbf{x}^T \Sigma^{-1} - \mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + 2\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{x}
+ \mathbf{x}^T \Sigma^{-1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} - 2\mathbf{\mu_0}^T \Sigma^{-1} \mathbf{x} ]
+ log(\pi_1/\pi_0) = 0 }[/math]
Which reduces to
[math]\displaystyle{ (1/2)[\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0}
+ (2\mathbf{\mu_1}^T \Sigma^{-1} - 2\mathbf{\mu_1}^T \Sigma^{-1}) \mathbf{x}]
+ log(\pi_1/\pi_0) = 0 }[/math]
And we see that the first pair of terms is constant, and the second pair is linear on x.
So that we end up with something of the form
[math]\displaystyle{ ax + b = 0 }[/math].
Jgpitt - 2011/09/21