Revision as of 16:41, 21 September 2011

STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20

Wiki Course Notes

Students will need to contribute to the wiki for 20% of their grade. Access via wikicoursenote.com Go to editor sign-up, and use your UW userid for your account name, and use your UW email.

primary (10%) Post a draft of lecture notes within 48 hours. You will need to do this 1 or 2 times, depending on class size.

secondary (10%) Make improvements to the notes for at least 60% of the lectures. More than half of your contributions should be technical rather than editorial. There will be a spreadsheet where students can indicate what they've done and when. The instructor will conduct random spot checks to ensure that students have contributed what they claim.

Classification

definitions

classification: Predict a discrete random variable [math]\displaystyle{ Y }[/math] by using another random variable [math]\displaystyle{ X }[/math] iid

[math]\displaystyle{ X_i = (X_{i1}, X_{i2}, ... X_{id}) \in \mathcal{X} \subset \mathbb{R}^d }[/math] ([math]\displaystyle{ d }[/math]-dimensional vector) [math]\displaystyle{ Y_i }[/math] in some finite set [math]\displaystyle{ \mathcal{Y} }[/math]

classification rule: [math]\displaystyle{ h : \mathcal{X} \rightarrow \mathcal{Y} }[/math] Take new observation [math]\displaystyle{ X }[/math] and make new [math]\displaystyle{ Y }[/math] prediction [math]\displaystyle{ h(X) }[/math].

Example: an image (2-d), which can be represented by a long vector [math]\displaystyle{ \mathbf{X} }[/math]

true error rate for classifier [math]\displaystyle{ h }[/math]

[math]\displaystyle{ L(h) = (P(h(X) \neq Y ) }[/math]

empirical error rate (or training error rate)

[math]\displaystyle{ \hat{L}_n(h) = (1/n) \sum_{i=1}^{n} I(h(X_i) \neq Y_i) }[/math]

where [math]\displaystyle{ I() }[/math] indicates misclassification, i.e.,

[math]\displaystyle{ I = 1 \ \rightarrow \ h() }[/math] misclassifies this [math]\displaystyle{ X }[/math]

[math]\displaystyle{ I = 0 \ \rightarrow \ h() }[/math] does not misclassify this [math]\displaystyle{ X }[/math]

e.g., 100 new data points with known (true) labels

[math]\displaystyle{ y_1 = h(x_1) }[/math]

...

[math]\displaystyle{ y_{100} = h(x_{100}) }[/math]

compare the known-true against the classified-true, count, divide by n=100

Bayes Classifier

First recall Bayes' Rule, in the format [math]\displaystyle{ P(Y|X) = \frac{P(X|Y) P(Y)} {P(X)} }[/math]

P(Y|X) : posterior

P(X|Y) : likelihood

P(Y) : prior

P(X) : marginal

We will start with the simplest case: [math]\displaystyle{ \mathcal{Y} = \{0,1\} }[/math]

[math]\displaystyle{ r(x) = P(Y=1|X=x) }[/math] [math]\displaystyle{ = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x)} }[/math] [math]\displaystyle{ = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)} }[/math]

Which of the following do we prefer to work with,

P(Y=1|X=x) or
P(X=x|Y=1) ?

The former reflects a Bayesian approach - degree of belief. So, Y is not a RV. The latter reflects a frequentist approach - frequency (of observation). (Ease of computation and validity were proposed and discussed by students.)

The question remains unresolved.

The Bayes Classifier uses [math]\displaystyle{ P(Y=1|X=x) }[/math]

[math]\displaystyle{ P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)} }[/math]

P(Y=1) : the prior, based on belief/evidence beforehand

denominator : marginalized by summation

[math]\displaystyle{ h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases} }[/math]

The set [math]\displaystyle{ \mathcal{D}(h) = \{ x : P(Y=1|X=x) = P(Y=0|X=x)... \} }[/math]

which defines a decision boundary.

Theorem: Bayes rule is optimal. I.e., if h is any other classification rule, then [math]\displaystyle{ L(h^*) \lt = L(h) }[/math] (This is to be proved in homework.)

Therefore, Why do we need any other method? A: Because X densities are often/typically unknown. I.e., [math]\displaystyle{ f_k(x) }[/math] and/or [math]\displaystyle{ \pi_k }[/math] unknown.

[math]\displaystyle{ P(Y=k|X=x) = \frac{P(X=x|Y=k) ...} {...} = \frac{f_k(x) \pi_k} {\sum_k f_k(x) \pi_k} }[/math] f_k(x) is referred to as the class conditional distribution (~likelihood).

Therefore, we rely on some data to estimate quantities.

Three Main Approaches

1. Empirical Risk Minimization: Choose a set of classifier H (e.g., line, neural network) and find [math]\displaystyle{ h^* \in H }[/math] that minimizes (some estimate of) L(h).

2. Regression: Find an estimate ([math]\displaystyle{ \hat{r} }[/math]) of function [math]\displaystyle{ r }[/math] and define [math]\displaystyle{ h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases} }[/math]

The problem is more difficult, because of restricted domain (discrete label values).

3. Density Estimation: Estimate [math]\displaystyle{ P(X=x|Y=0) }[/math] from [math]\displaystyle{ X_i }[/math]'s for which [math]\displaystyle{ Y_i = 0 }[/math] Estimate [math]\displaystyle{ P(X=x|Y=1) }[/math] from [math]\displaystyle{ X_i }[/math]'s for which [math]\displaystyle{ Y_i = 1 }[/math] and let [math]\displaystyle{ P(Y=?) = (1/n) \sum_{i=1}^{n} Y_i }[/math]

Define [math]\displaystyle{ \hat{r}(x) = \hat{P}(Y=1|X=x) }[/math] and [math]\displaystyle{ h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases} }[/math]

Problems?

not enough data to estimate? - possibly

high error rate?

big one: not good in high-dimensional space

As the dimension of the space goes up, the learning requirements go up exponentially.

Multi-Class Classification

Generalize to case Y takes on k>2 values.

Theorem: [math]\displaystyle{ Y \in \mathcal{Y} = {1,2,..., k} }[/math] optimal rule

[math]\displaystyle{ h*(x) = argmax_k P }[/math]

where [math]\displaystyle{ P(Y=k|X=x) = \frac{f_k(x) \pi_k} {\sum ...} }[/math]

LDA and QDA

(linear discriminant analysis, and quadratic discriminant analysis)

Simplest: Use approach 3 (above) and assume a parametric model for densities. Assume class conditional is Gaussian.

LDA (also known as FDA (Fisher's), which is in fact not really the same thing)

[math]\displaystyle{ \mathcal{Y} = \{ 0,1 \} }[/math] assumed (i.e., 2 labels)

[math]\displaystyle{ h(x) = \begin{cases} 1 \ \ P(Y=1|X=x) \gt P(Y=0|X=x) \\ 0 \ \ otherwise \end{cases} }[/math]

[math]\displaystyle{ P(Y=1|X=x) = \frac{f_1(x) \pi_1} {\sum_k f_k \pi_k} \ \ }[/math] (denom = P(x))

1) Assume Gaussian distributions

[math]\displaystyle{ f_k(x) = [(2\pi)^{-d/2} |\Sigma_k|^{-1/2}] exp(-(1/2)(\mathbf{x} - \mathbf{\mu_k}) \Sigma_k^{-1}(\mathbf{x}-\mathbf{\mu_k}) ) }[/math]

must compare [math]\displaystyle{ \frac{f_1(x) \pi_1} {p(x)} with \frac{f_0(x) \pi_0} {p(x)} }[/math] Note that the p(x) denom can be ignored: [math]\displaystyle{ f_1(x) \pi_1 }[/math] with [math]\displaystyle{ f_0(x) \pi_0 }[/math]

To find the decision boundary, set [math]\displaystyle{ f_1(x) \pi_1 = f_0(x) \pi_0 }[/math]

Because we are assuming [math]\displaystyle{ \Sigma_1 = \Sigma_0 }[/math], we can use [math]\displaystyle{ \Sigma = \Sigma_0 = \Sigma_1 }[/math].

Cancel [math]\displaystyle{ (2\pi)^{-d/2} |\Sigma_k|^{-1/2} }[/math] from both sides.

Take log of both sides.

Subtract one side from both sides, leaving zero on one side.

[math]\displaystyle{ -(1/2)(\mathbf{x} - \mathbf{\mu_1})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_1}) + log(\pi_1) - [-(1/2)(\mathbf{x} - \mathbf{\mu_0})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_0}) + log(\pi_0)] = 0 }[/math]

[math]\displaystyle{ (1/2)[-\mathbf{x}^T \Sigma^{-1} - \mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + 2\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{x} + \mathbf{x}^T \Sigma^{-1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} - 2\mathbf{\mu_0}^T \Sigma^{-1} \mathbf{x} ] + log(\pi_1/\pi_0) = 0 }[/math]

Which reduces to [math]\displaystyle{ (1/2)[\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} + (2\mathbf{\mu_1}^T \Sigma^{-1} - 2\mathbf{\mu_1}^T \Sigma^{-1}) \mathbf{x}] + log(\pi_1/\pi_0) = 0 }[/math]

And we see that the first pair of terms is constant, and the second pair is linear on x. So that we end up with something of the form [math]\displaystyle{ ax + b = 0 }[/math].

Jgpitt - 2011/09/21

stat841f11: Difference between revisions

Revision as of 16:41, 21 September 2011

Contents

STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20

Wiki Course Notes

Classification

definitions

Bayes Classifier

Three Main Approaches

Multi-Class Classification

LDA and QDA

Editor Sign Up

Navigation menu

@@ Line 1: / Line 1: @@
+= STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20 =
+== Wiki Course Notes ==
+Students will need to contribute to the wiki for 20% of their grade.
+Access via wikicoursenote.com
+Go to editor sign-up, and use your UW userid for your account name, and use your UW email.
+primary (10%)
+Post a draft of lecture notes within 48 hours.
+You will need to do this 1 or 2 times, depending on class size.
+secondary (10%)
+Make improvements to the notes for at least 60% of the lectures.
+More than half of your contributions should be technical rather than editorial.
+There will be a spreadsheet where students can indicate what they've done and when.
+The instructor will conduct random spot checks to ensure that students have contributed what they claim.
+== Classification ==
+=== definitions ===
+'''classification''': Predict a discrete random variable <math>Y</math> by using another random variable <math>X</math>
+iid
+<math>X_i = (X_{i1}, X_{i2}, ... X_{id}) \in \mathcal{X} \subset \mathbb{R}^d</math>  (<math>d</math>-dimensional vector)
+<math>Y_i</math> in some finite set <math>\mathcal{Y}</math>
+'''classification rule''':
+<math>h : \mathcal{X} \rightarrow \mathcal{Y}</math>
+Take new observation <math>X</math> and make new <math>Y</math> prediction <math>h(X)</math>.
+Example:  an image (2-d), which can be represented by a long vector <math>\mathbf{X}</math>
+'''true error rate''' for classifier <math>h</math>
+<math>L(h) = (P(h(X) \neq Y )</math>
+'''empirical error rate''' (or training error rate)
+<math>\hat{L}_n(h)  =  (1/n) \sum_{i=1}^{n} I(h(X_i) \neq Y_i)</math>
+where <math>I()</math> indicates misclassification, i.e.,
+<math>I = 1 \ \rightarrow \ h()</math> misclassifies this <math>X</math>
+<math>I = 0 \ \rightarrow \ h()</math> does not misclassify this <math>X</math>
+e.g., 100 new data points with known (true) labels
+<math>y_1 = h(x_1)</math>
+...
+<math>y_{100} = h(x_{100})</math>
+compare the known-true against the classified-true, count, divide by n=100
+=== Bayes Classifier ===
+First recall Bayes' Rule, in the format
+<math>P(Y|X)  =  \frac{P(X|Y) P(Y)} {P(X)} </math>
+P(Y|X)  :  ''posterior''
+P(X|Y)  :  ''likelihood''
+P(Y)   :  ''prior''
+P(X)   :  ''marginal''
+We will start with the simplest case:  <math>\mathcal{Y} = \{0,1\}</math>
+<math>r(x)  =  P(Y=1|X=x)</math>
+<math>  =  \frac{P(X=x|Y=1) P(Y=1)} {P(X=x)}</math>
+<math>  =  \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1)  +  P(X=x|Y=0) P(Y=0)}</math>
+Which of the following do we prefer to work with,
+* P(Y=1|X=x)  or
+* P(X=x|Y=1) ?
+The former reflects a ''Bayesian'' approach - degree of belief.  So, Y is not a RV.
+The latter reflects a ''frequentist'' approach - frequency (of observation).
+(Ease of computation and validity were proposed and discussed by students.)
+The question remains unresolved.
+The Bayes Classifier uses  <math>P(Y=1|X=x)</math>
+<math>  P(Y=1|X=x)  =  \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1)  +  P(X=x|Y=0) P(Y=0)}</math>
+P(Y=1) : the prior, based on belief/evidence beforehand
+denominator : marginalized by summation
+<math>h(x)  =
+\begin{cases}
+  \ \ \hat{r}(x) > 1/2 \\
+  \ \ otherwise
+\end{cases}
+</math>
+The set <math>\mathcal{D}(h) = \{ x : P(Y=1|X=x) = P(Y=0|X=x)... \} </math>
+which defines a ''decision boundary''.
+''Theorem'':  Bayes rule is optimal.  I.e., if h is any other classification rule,
+then  <math>L(h^*) <= L(h)</math>
+(This is to be proved in homework.)
+Therefore, Why do we need any other method?
+A:  Because X densities are often/typically unknown.  I.e., <math>f_k(x)</math> and/or <math>\pi_k</math> unknown.
+<math>P(Y=k|X=x)  =  \frac{P(X=x|Y=k) ...} {...}  =  \frac{f_k(x) \pi_k} {\sum_k f_k(x) \pi_k}</math>
+f_k(x)  is referred to as the class conditional distribution (~likelihood).
+Therefore, we rely on some data to estimate quantities.
+=== Three Main Approaches ===
+'''1. Empirical Risk Minimization''':
+Choose a set of classifier H (e.g., line, neural network) and find <math>h^* \in H</math>
+that minimizes (some estimate of) L(h).
+'''2. Regression''':
+Find an estimate (<math>\hat{r}</math>) of function <math>r</math> and define
+<math>h(x)  =
+\begin{cases}
+  \ \ \hat{r}(x) > 1/2 \\
+  \ \ otherwise
+\end{cases}
+</math>
+The problem is more difficult, because of restricted domain (discrete label values).
+'''3. Density Estimation''':
+Estimate  <math>P(X=x|Y=0)</math>  from <math>X_i</math>'s for which <math>Y_i = 0</math>
+Estimate  <math>P(X=x|Y=1)</math>  from <math>X_i</math>'s for which <math>Y_i = 1</math>
+and let  <math>P(Y=?) = (1/n) \sum_{i=1}^{n} Y_i</math>
+Define <math>\hat{r}(x) = \hat{P}(Y=1|X=x)</math>  and
+<math>h(x)  =
+\begin{cases}
+  \ \ \hat{r}(x) > 1/2 \\
+  \ \ otherwise
+\end{cases}
+</math>
+Problems?
+not enough data to estimate?  -  possibly
+high error rate?
+big one: not good in high-dimensional space
+As the dimension of the space goes up, the learning requirements go up exponentially.
+=== Multi-Class Classification ===
+Generalize to case Y takes on k>2 values.
+''Theorem'':  <math>Y \in \mathcal{Y} = {1,2,..., k}</math>  optimal rule
+<math>h*(x)  =  argmax_k P</math>
+where  <math>P(Y=k|X=x)  =  \frac{f_k(x) \pi_k} {\sum ...}</math>
+== LDA and QDA ==
+(linear discriminant analysis, and quadratic discriminant analysis)
+Simplest:  Use approach 3 (above) and assume a parametric model for densities.
+Assume class conditional is Gaussian.
+LDA (also known as FDA (Fisher's), which is in fact not really the same thing)
+<math>\mathcal{Y} = \{ 0,1 \}</math>  assumed  (i.e., 2 labels)
+<math>h(x)  =
+\begin{cases}
+  \ \  P(Y=1|X=x) > P(Y=0|X=x) \\
+  \ \ otherwise
+\end{cases}
+</math>
+<math>P(Y=1|X=x)  =  \frac{f_1(x) \pi_1} {\sum_k f_k \pi_k} \ \ </math>      (denom = P(x))
+)  Assume Gaussian distributions
+<math>f_k(x)  =  [(2\pi)^{-d/2} |\Sigma_k|^{-1/2}] exp(-(1/2)(\mathbf{x} - \mathbf{\mu_k}) \Sigma_k^{-1}(\mathbf{x}-\mathbf{\mu_k}) )</math>
+must compare
+<math>\frac{f_1(x) \pi_1} {p(x)}  with  \frac{f_0(x) \pi_0} {p(x)}</math>
+Note that the p(x) denom can be ignored:
+<math>f_1(x) \pi_1</math>  with  <math>f_0(x) \pi_0 </math>
+To find the decision boundary, set
+<math>f_1(x) \pi_1  =  f_0(x) \pi_0 </math>
+Because we are assuming <math>\Sigma_1 = \Sigma_0</math>,  we can use <math>\Sigma = \Sigma_0 = \Sigma_1</math>.
+Cancel  <math>(2\pi)^{-d/2} |\Sigma_k|^{-1/2}</math>  from both sides.
+Take log of both sides.
+Subtract one side from both sides, leaving zero on one side.
+<math>-(1/2)(\mathbf{x} - \mathbf{\mu_1})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_1}) + log(\pi_1) - [-(1/2)(\mathbf{x} - \mathbf{\mu_0})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_0}) + log(\pi_0)]  =  0  </math>
+<math>(1/2)[-\mathbf{x}^T \Sigma^{-1} - \mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1}  + 2\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{x}
+ + \mathbf{x}^T \Sigma^{-1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0}  - 2\mathbf{\mu_0}^T \Sigma^{-1} \mathbf{x} ]
+ + log(\pi_1/\pi_0)  =  0  </math>
+Which reduces to
+<math>(1/2)[\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1}  +  \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0}
+ +  (2\mathbf{\mu_1}^T \Sigma^{-1} - 2\mathbf{\mu_1}^T \Sigma^{-1}) \mathbf{x}]
+ + log(\pi_1/\pi_0)  =  0  </math>
+And we see that the first pair of terms is constant, and the second pair is linear on x.
+So that we end up with something of the form
+<math>ax + b = 0</math>.
+[[User:Jgpitt|Jgpitt]] - 2011/09/21
 ==[[f11Stat841EditorSignUp| Editor Sign Up]]==

stat841f11: Difference between revisions

Revision as of 16:41, 21 September 2011

STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20

Wiki Course Notes

Classification

definitions

Bayes Classifier

Three Main Approaches

Multi-Class Classification

LDA and QDA

Editor Sign Up

Navigation menu

Search