http://wiki.math.uwaterloo.ca/statwiki/api.php?action=feedcontributions&user=Jgpitt&feedformat=atomstatwiki - User contributions [US]2024-03-28T21:50:30ZUser contributionsMediaWiki 1.41.0http://wiki.math.uwaterloo.ca/statwiki/index.php?title=stat841f11&diff=11496stat841f112011-09-21T20:41:54Z<p>Jgpitt: 841 lecture - 2011/09/20 - initial submission</p>
<hr />
<div>= STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20 =<br />
== Wiki Course Notes ==<br />
Students will need to contribute to the wiki for 20% of their grade.<br />
Access via wikicoursenote.com<br />
Go to editor sign-up, and use your UW userid for your account name, and use your UW email.<br />
<br />
primary (10%)<br />
Post a draft of lecture notes within 48 hours. <br />
You will need to do this 1 or 2 times, depending on class size.<br />
<br />
secondary (10%)<br />
Make improvements to the notes for at least 60% of the lectures.<br />
More than half of your contributions should be technical rather than editorial.<br />
There will be a spreadsheet where students can indicate what they've done and when.<br />
The instructor will conduct random spot checks to ensure that students have contributed what they claim.<br />
<br />
<br />
== Classification ==<br />
=== definitions ===<br />
'''classification''': Predict a discrete random variable <math>Y</math> by using another random variable <math>X</math><br />
iid <br />
<br />
<math>X_i = (X_{i1}, X_{i2}, ... X_{id}) \in \mathcal{X} \subset \mathbb{R}^d</math> (<math>d</math>-dimensional vector)<br />
<math>Y_i</math> in some finite set <math>\mathcal{Y}</math><br />
<br />
<br />
'''classification rule''':<br />
<math>h : \mathcal{X} \rightarrow \mathcal{Y}</math><br />
Take new observation <math>X</math> and make new <math>Y</math> prediction <math>h(X)</math>.<br />
<br />
Example: an image (2-d), which can be represented by a long vector <math>\mathbf{X}</math><br />
<br />
<br />
'''true error rate''' for classifier <math>h</math> <br />
<br />
<math>L(h) = (P(h(X) \neq Y )</math><br />
<br />
<br />
'''empirical error rate''' (or training error rate) <br />
<br />
<math>\hat{L}_n(h) = (1/n) \sum_{i=1}^{n} I(h(X_i) \neq Y_i)</math><br />
<br />
where <math>I()</math> indicates misclassification, i.e., <br />
<br />
<math>I = 1 \ \rightarrow \ h()</math> misclassifies this <math>X</math> <br />
<br />
<math>I = 0 \ \rightarrow \ h()</math> does not misclassify this <math>X</math><br />
<br />
e.g., 100 new data points with known (true) labels<br />
<br />
<math>y_1 = h(x_1)</math><br />
<br />
...<br />
<br />
<math>y_{100} = h(x_{100})</math><br />
<br />
compare the known-true against the classified-true, count, divide by n=100<br />
<br />
<br />
=== Bayes Classifier ===<br />
First recall Bayes' Rule, in the format<br />
<math>P(Y|X) = \frac{P(X|Y) P(Y)} {P(X)} </math> <br />
<br />
P(Y|X) : ''posterior''<br />
<br />
P(X|Y) : ''likelihood''<br />
<br />
P(Y) : ''prior''<br />
<br />
P(X) : ''marginal''<br />
<br />
We will start with the simplest case: <math>\mathcal{Y} = \{0,1\}</math><br />
<br />
<math>r(x) = P(Y=1|X=x)</math><br />
<math> = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x)}</math><br />
<math> = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}</math><br />
<br />
Which of the following do we prefer to work with,<br />
* P(Y=1|X=x) or <br />
* P(X=x|Y=1) ?<br />
<br />
The former reflects a ''Bayesian'' approach - degree of belief. So, Y is not a RV.<br />
The latter reflects a ''frequentist'' approach - frequency (of observation).<br />
(Ease of computation and validity were proposed and discussed by students.)<br />
<br />
The question remains unresolved.<br />
<br />
<br />
The Bayes Classifier uses <math>P(Y=1|X=x)</math><br />
<br />
<math> P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}</math><br />
<br />
P(Y=1) : the prior, based on belief/evidence beforehand<br />
<br />
denominator : marginalized by summation<br />
<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
The set <math>\mathcal{D}(h) = \{ x : P(Y=1|X=x) = P(Y=0|X=x)... \} </math><br />
<br />
which defines a ''decision boundary''.<br />
<br />
<br />
''Theorem'': Bayes rule is optimal. I.e., if h is any other classification rule, <br />
then <math>L(h^*) <= L(h)</math><br />
(This is to be proved in homework.)<br />
<br />
Therefore, Why do we need any other method?<br />
A: Because X densities are often/typically unknown. I.e., <math>f_k(x)</math> and/or <math>\pi_k</math> unknown.<br />
<br />
<math>P(Y=k|X=x) = \frac{P(X=x|Y=k) ...} {...} = \frac{f_k(x) \pi_k} {\sum_k f_k(x) \pi_k}</math><br />
f_k(x) is referred to as the class conditional distribution (~likelihood).<br />
<br />
Therefore, we rely on some data to estimate quantities.<br />
<br />
<br />
=== Three Main Approaches ===<br />
<br />
'''1. Empirical Risk Minimization''':<br />
Choose a set of classifier H (e.g., line, neural network) and find <math>h^* \in H</math><br />
that minimizes (some estimate of) L(h).<br />
<br />
'''2. Regression''':<br />
Find an estimate (<math>\hat{r}</math>) of function <math>r</math> and define<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
The problem is more difficult, because of restricted domain (discrete label values).<br />
<br />
'''3. Density Estimation''':<br />
Estimate <math>P(X=x|Y=0)</math> from <math>X_i</math>'s for which <math>Y_i = 0</math><br />
Estimate <math>P(X=x|Y=1)</math> from <math>X_i</math>'s for which <math>Y_i = 1</math><br />
and let <math>P(Y=?) = (1/n) \sum_{i=1}^{n} Y_i</math><br />
<br />
Define <math>\hat{r}(x) = \hat{P}(Y=1|X=x)</math> and<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ \hat{r}(x) > 1/2 \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
Problems?<br />
<br />
not enough data to estimate? - possibly<br />
<br />
high error rate?<br />
<br />
big one: not good in high-dimensional space<br />
<br />
<br />
As the dimension of the space goes up, the learning requirements go up exponentially.<br />
<br />
<br />
=== Multi-Class Classification ===<br />
Generalize to case Y takes on k>2 values.<br />
<br />
<br />
''Theorem'': <math>Y \in \mathcal{Y} = {1,2,..., k}</math> optimal rule<br />
<br />
<math>h*(x) = argmax_k P</math> <br />
<br />
where <math>P(Y=k|X=x) = \frac{f_k(x) \pi_k} {\sum ...}</math><br />
<br />
<br />
== LDA and QDA ==<br />
(linear discriminant analysis, and quadratic discriminant analysis)<br />
<br />
Simplest: Use approach 3 (above) and assume a parametric model for densities.<br />
Assume class conditional is Gaussian.<br />
<br />
LDA (also known as FDA (Fisher's), which is in fact not really the same thing)<br />
<br />
<math>\mathcal{Y} = \{ 0,1 \}</math> assumed (i.e., 2 labels)<br />
<br />
<math>h(x) = <br />
\begin{cases} <br />
1 \ \ P(Y=1|X=x) > P(Y=0|X=x) \\<br />
0 \ \ otherwise<br />
\end{cases}<br />
</math><br />
<br />
<math>P(Y=1|X=x) = \frac{f_1(x) \pi_1} {\sum_k f_k \pi_k} \ \ </math> (denom = P(x))<br />
<br />
1) Assume Gaussian distributions<br />
<br />
<math>f_k(x) = [(2\pi)^{-d/2} |\Sigma_k|^{-1/2}] exp(-(1/2)(\mathbf{x} - \mathbf{\mu_k}) \Sigma_k^{-1}(\mathbf{x}-\mathbf{\mu_k}) )</math><br />
<br />
must compare <br />
<math>\frac{f_1(x) \pi_1} {p(x)} with \frac{f_0(x) \pi_0} {p(x)}</math><br />
Note that the p(x) denom can be ignored:<br />
<math>f_1(x) \pi_1</math> with <math>f_0(x) \pi_0 </math><br />
<br />
To find the decision boundary, set <br />
<math>f_1(x) \pi_1 = f_0(x) \pi_0 </math><br />
<br />
Because we are assuming <math>\Sigma_1 = \Sigma_0</math>, we can use <math>\Sigma = \Sigma_0 = \Sigma_1</math>.<br />
<br />
Cancel <math>(2\pi)^{-d/2} |\Sigma_k|^{-1/2}</math> from both sides.<br />
<br />
Take log of both sides.<br />
<br />
Subtract one side from both sides, leaving zero on one side.<br />
<br />
<br />
<math>-(1/2)(\mathbf{x} - \mathbf{\mu_1})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_1}) + log(\pi_1) - [-(1/2)(\mathbf{x} - \mathbf{\mu_0})^T \Sigma^{-1} (\mathbf{x}-\mathbf{\mu_0}) + log(\pi_0)] = 0 </math><br />
<br />
<br />
<math>(1/2)[-\mathbf{x}^T \Sigma^{-1} - \mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + 2\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{x}<br />
+ \mathbf{x}^T \Sigma^{-1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} - 2\mathbf{\mu_0}^T \Sigma^{-1} \mathbf{x} ]<br />
+ log(\pi_1/\pi_0) = 0 </math><br />
<br />
<br />
Which reduces to <br />
<math>(1/2)[\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0}<br />
+ (2\mathbf{\mu_1}^T \Sigma^{-1} - 2\mathbf{\mu_1}^T \Sigma^{-1}) \mathbf{x}]<br />
+ log(\pi_1/\pi_0) = 0 </math><br />
<br />
<br />
And we see that the first pair of terms is constant, and the second pair is linear on x.<br />
So that we end up with something of the form <br />
<math>ax + b = 0</math>.<br />
<br />
[[User:Jgpitt|Jgpitt]] - 2011/09/21<br />
<br />
<br />
==[[f11Stat841EditorSignUp| Editor Sign Up]]==</div>Jgpitthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f11Stat841EditorSignUp&diff=11485f11Stat841EditorSignUp2011-09-21T14:08:38Z<p>Jgpitt: </p>
<hr />
<div>{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="100pt"|Date<br />
|width="200pt"|Name (1)<br />
|width="200pt"|Name (2)<br />
|-<br />
|Sep 20 ||Greg Pitt || <br />
|-<br />
|Sep 22 || || <br />
|-<br />
|Sep 29 ||Guoting (Jane) Chang ||<br />
|-<br />
|Oct 4 || Cameron Davidson-Pilon ||<br />
|-<br />
|Oct 13 ||Zhikang Huang ||<br />
|-<br />
|Oct 27 ||Nika Haghtalab || <br />
|-<br />
|Nov 10|| ||<br />
|-</div>Jgpitthttp://wiki.math.uwaterloo.ca/statwiki/index.php?title=f11Stat841EditorSignUp&diff=11463f11Stat841EditorSignUp2011-09-21T02:05:24Z<p>Jgpitt: </p>
<hr />
<div>{| class="wikitable"<br />
<br />
{| border="1" cellpadding="3"<br />
|-<br />
|width="100pt"|Date<br />
|width="200pt"|Name (1)<br />
|width="200pt"|Name (2)<br />
|-<br />
|Sep 20 ||Ali Ghodsi || <br />
|-<br />
|Sep 22 ||Greg Pitt || <br />
|-<br />
|Oct 27 ||Nika Haghtalab || <br />
|-<br />
|Nov 10|| ||<br />
|-</div>Jgpitt