# Difference between revisions of "stat841"

## Classfication-2009.9.30

### Classification

With the rise of fields such as data-mining, bioinformatics, and machine learning, classification has becomes a fast-developing topic. In the age of information, vast amounts of data are generated constantly, and the goal of classification is to learn from data. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on.

Definition: The problem of Prediction a discrete random variable $\mathcal{Y}$ from another random variable $\mathcal{X}$ is called Classification.

In classification,, we attempt to approximate a function $\,h$, by using a training data set, which will then be able to accurately classify new data inputs.

Given $\mathcal{X} \subset \mathbb{R}^{d}$, a subset of the $d$-dimensional real vectors and $\mathcal{Y}$, a finite set of labels, We try to determine a 'classification rule' $\,h$ such that,

$\,h: \mathcal{X} \mapsto \mathcal{Y}$

We use $\,n$ ordered pairs of training data which are identical independent distributions, $\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}$ where $\,X_{i} \in \mathcal{X}$,$\,Y_{i} \in \mathcal{Y}$, to approximate $\,h$.

Thus, given a new input, $\,X \in \mathcal{X}$ by using the classification rule we can predict a corresponding $\,\hat{Y}=h(X)$.

Example Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, for instance, color, diameter, and weight.
Let $\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})$ and $\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}$. The goal is to find a classification rule such that when a new fruit $\,X$ is presented based on its features, $(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})$, our classification rule $\,h$ can classify it as either an apple or an orange, i.e., $\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})$ be the fruit type of $\,X$.

### Error rate

The true error rate' $\,L(h)$ of a classifier having classification rule $\,h$ is defined as the probability that $\,h$ does not correctly classify any new data input, i.e., it is defined as $\,L(h)=P(h(X) \neq Y)$. Here, $\,X \in \mathcal{X}$ and $\,Y \in \mathcal{Y}$ are the known feature values and the true class of that input, respectively.
The empirical error rate (or training error rate) of a classifier having classification rule $\,h$ is defined as the frequency at which $\,h$ does not correctly classify the data inputs in the training set, i.e., it is defined as

$\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})$, where $\,I$ is an indicator variable and $\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.$. Here, $\,X_{i} \in \mathcal{X}$ and $\,Y_{i} \in \mathcal{Y}$ are the known feature values and the true class of the $\,i_th$ training input, respectively.

### Bayes Classifier

The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability<ref> http://www.wikicoursenote.com/wiki/Stat841f11#Bayes_Classifier </ref>

Intuitively speaking, to classify $\,x\in \mathcal{X}$ we find $y \in \mathcal{Y}$ such that $\,P(Y=y|X=x)$ is maximum over all the members of $\mathcal{Y}$.


Mathematically, for $\,k$ classes and given object $\,X=x$, we find $\,y\in \mathcal{Y}$ which maximizes $\,P(Y=y|X=x)$, and classify $\,X$ into class $\,y$. In order to calculate the value of $\,P(Y=y|X=x)$, we use Bayes formula

\begin{align} P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\ &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)} \end{align}

where $\,P(Y=y|X=x)$ is referred to as the posterior probability, $\,P(Y=y)$ as the prior probability, $\,P(X=x|Y=y)$ as the likelihood, and $\,P(X=x)$ as the evidence.

For the special case that $\,Y$ has only two classes, that is, $\, \mathcal{Y}=\{0, 1\}$. Consider the probability that $\,r(X)=P\{Y=1|X=x\}$. Given $\,X=x$, By Bayes formula, we have

\begin{align} r(X)&=P(Y=1|X=x) \\ &=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\ &=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)} \end{align}

Definition:

The Bayes classification rule $\,h$ is

$\, h(X)= \left\{\begin{matrix} 1 & r(x)\gt \frac{1}{2} \\ 0 & \mathrm{otherwise} \end{matrix}\right.$

3 different approaches to classification:

1) Empirical Risk Minimization: Choose a set fo classifier $\mathcal{H}$ and find $\,h^*\in \mathcal{H}$ that minimizes some estimate of $\,L(h)$

2) Regression: Find an estimate $(\hat r)$ of the function $r$ and define

$\, h(X)= \left\{\begin{matrix} 1 & \hat r(x)\gt \frac{1}{2} \\ 0 & \mathrm{otherwise} \end{matrix}\right.$

3) Density Estimation: estimate $\,P(X=x|Y=0)$ and $\,P(X=x|Y=1)$ (less popular in high-dimension cases)

Bayes Classification Rule Optimality Theorem: The Bayes rule is optimal in true error rate, that is for any other classification rule $\, \overline{h}$, we have $\,L(h) \le L(\overline{h})$. Intuitively speaking this theorem is saying we cannot do better than classifying $\,x\in \mathcal{X}$ to $\,y$ when the probability of being of type $\,y$ for $\,x$ is more than probability of being any other type.

Definition:

The set $\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}$ is called the decision boundary.

$\, h^*(X)= \left\{\begin{matrix} 1 & if P(Y=1|X=x)\gt P(Y=0|X=x) \\ 0 & \mathrm{otherwise} \end{matrix}\right.$

Remark:

1)Bayes classification rule is optimal. Proof:[1]

2)We still need any other method, since we cannot define prior probability in realistic.

Example:
We’re going to predict if a particular student will pass STAT441/841. We have data on past student performance. For each student we know: If student’s GPA > 3.0 (G) If student had a strong math background (M) If student is a hard worker (H) If student passed or failed course

$\, \mathcal{Y}= \{ 0,1 \}$, where 1 refers to pass and 0 refers to fail. Assume that $\,P(Y=1)=P(Y=0)=0.5$
For a new student comes along with values $\,G=0, M=1, H=0$, we calculate $\,r(X)=P(Y=1|X=(0,1,0))$ as

$\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2\lt \frac{1}{2}$
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.

Notice: Although the Bayes rule is optimal, we still need other methods, since it is generally impossible for us to know the prior $\,P(Y=1)$, and class conditional density $\,P(X=x|Y=1)$ and ultimately calculate the value of $\,r(X)$, which makes Bayes rule inconvenient in practice.

Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[2], tree-augmented naive Bayes (TAN), Bayesian network augmented naive Bayes (BAN) and general Bayesian network (GBN).

### Bayesian vs. Frequentist

Intuitively, to solve a two-class problem, we may have the following two approaches:

1) If $\,P(Y=1|X=x)\gt P(Y=0|X=x)$, then $\,h(x)=1$, otherwise $\,h(x)=0$.

2) If $\,P(X=x|Y=1)\gt P(X=x|Y=0)$, then $\,h(x)=1$, otherwise $\,h(x)=0$.

One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as having objective existence. Actually, they represent two different schools in statistics.

During the history of statistics, there are two major classification methods : Bayesian and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.

Frequentist

1. Probability is objective.
2. Data is a repeatable random sample(there is a frequency).
3. Parameters are fixed and unknown constant.
4. Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.

Bayesian

1. Probability is subjective.
2. Data are fixed.
3. Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them.
4. Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of $\,50%$ of rain.

Example

Suppose there is a man named Jack. In Bayesian method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.

## Linear and Quadratic Discriminant Analysis - October 2,2009

### Introduction

#### Notation

Let us first introduce some new notation for the following sections.

Multi-class Classification:

Y takes on more than two values.

Recall that in the discussion of the Bayes Classifier, we introduced Bayes Formula:

\begin{align} P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)} \end{align}

We will use new labels for the following equivalent formula:

\begin{align} P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k} \end{align}
• $\,f_k$ is called the class conditional density; also referred to previously as the likelihood function. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.
• $\,\pi_k$ is called the prior probability. This is a probability distribution that represents what we know (or believe we know) about a population.
• $\,\Sigma_k$ is the sum with respect to all $\,k$ classes.

#### Approaches

Representing the optimal method, Bayes classifier cannot be used in the most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have been evolved. These methods fall into three general categories.

1 Empirical Risk Minimization:Choose a set fo classifier $\mathcal{H}$ and find $\,h^* \epsilon H$, minimize some estimate of $\,L(H)$.

2 Regression:Find an estimate $(\hat r)$ of the function $\ r$ and deifne

$\, h(X)= \left\{\begin{matrix} 1 & \hat r(x)\gt \frac{1}{2} \\ 0 & \mathrm{otherwise} \end{matrix}\right.$

3 Density estimation, estimate $\ P(X = x|Y = 0)$ and $\ P(X = x|Y = 1)$

Note:
The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2. However this approach is the simplest and we can assume a parametric model for the densities. Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.

### LDA

#### Motivation

The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.

The simplest way to achieve this is to assume that all the class densities are approximately a multivariate normal distribution, find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.

#### History

The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, revisited in 2000, has shown an 85-90% success rate predicting bankruptcy, and is still in use today.

Purpose

1 feature selection

2 which classification rule best seperate the classes

#### Definition

To perform LDA we make two assumptions.

• The clusters belonging to all classes each follow a multivariate normal distribution.
$x \in \mathbb{R}^d$ $f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)$

where $\ f_k(x)$ is a class conditional density

• Simplification Assumption: Each cluster has the same covariance matrix $\,\Sigma$ equal to the covariance of $\Sigma_k \forall k$.

We wish to solve for the decision boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.

So we solve $\,r_k(x)=r_l(x)$ for all the pairwise combinations of classes.

$\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)$

$\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}$ using Bayes' Theorem

$\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)$ by canceling denominators

$\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l$

$\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l$

$\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l$ Since both $\Sigma$ are equal based on the assumptions specific to LDA.

$\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)$ taking the log of both sides.

$\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0$ by expanding out

$\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0$ after canceling out like terms and factoring.

We can see that this is a linear function in $\ x$ with general form $\,ax+b=0$.

Actually, this linear log function shows that the decision boundary between class $\ k$ and class $\ l$, i.e. $\ P(G=k|X=x)=P(G=l|X=x)$, is linear in $\ x$. Given any pair of classes, decision boundaries are always linear. In $\ d$ dimensions, we separate regions by hyperplanes.

In the special case where the number of samples from each class are equal ($\,\pi_k=\pi_l$), the boundary surface or line lies halfway between $\,\mu_l$ and $\,\mu_k$

#### Limitation

• LDA implicitly assumes Gaussian distribution of data.
• LDA implicitly assumes that the mean is the discriminating factor, not variance.
• LDA may overfit the data.

### QDA

The concept uses a same idea as LDA of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance $\,\Sigma$ equal to the mean variance of $\Sigma_k \forall k$ is removed. We can use a hypothesis test with $\ H_0$: $\Sigma_k \forall k$=$\,\Sigma$.The best method is likelihood ratio test.

Following along from where QDA diverges from LDA.

$\,f_k(x)\pi_k=f_l(x)\pi_l$

$\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l$

$\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l$ by cancellation

$\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)$ by taking the log of both sides

$\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0$ by expanding out

$\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0$ this time there are no cancellations, so we can only factor

The final result is a quadratic equation specifying a curved boundary between classes with general form $\,ax^2+bx+c=0$.

It is quadratic because there is no boundaries.

## Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009

Linear discriminant analysis[3] is a statistical method used to find the linear combination of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research.

Quadratic Discriminant Analysis[4], on the other had, aims to find the quadratic combination of features. It is more general than Linear discriminant analysis. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical. When the assumption is true, the best possible test for the hypothesis that a given measurement is from a given class is the likelihood ratio test. Suppose the means of each class are known to be $\mu_{y=0},\mu_{y=1}$ and the covariances $\Sigma_{y=0}, \Sigma_{y=1}$. Then the likelihood ratio will be given by

Likelihood ratio = $\frac{ \sqrt{2 \pi |\Sigma_{y=1}|}^{-1} \exp \left( -\frac{1}{2}(x-\mu_{y=1})^T \Sigma_{y=1}^{-1} (x-\mu_{y=1}) \right) }{ \sqrt{2 \pi |\Sigma_{y=0}|}^{-1} \exp \left( -\frac{1}{2}(x-\mu_{y=0})^T \Sigma_{y=0}^{-1} (x-\mu_{y=0}) \right)} \lt t$

for some threshold t. After some rearrangement, it can be shown that the resulting separating surface between the classes is a quadratic.

### Summarizing LDA and QDA

We can summarize what we have learned on LDA and QDA so far into the following theorem.

Theorem:

Suppose that $\,Y \in \{1,\dots,k\}$, if $\,f_k(x) = Pr(X=x|Y=k)$ is Gaussian, the Bayes Classifier rule is

$\,h(X) = \arg\max_{k} \delta_k(x)$

where

$\,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k)$ (quadratic)
• Note The decision boundary between classes $k$ and $l$ is quadratic in $x$.

If the covariance of the Gaussians are the same, this becomes

$\,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k)$ (linear)
• Note $\,\arg\max_{k} \delta_k(x)$returns the set of k for which $\,\delta_k(x)$ attains its largest value.

### In practice

We need to estimate the prior, so in order to do this, we use the sample estimates of $\,\pi,\mu_k,\Sigma_k$ in place of the true values, i.e.

File:estimation.png
Estimation of the probability of belonging to either class k or l

$\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}$

$\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i$

$\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top$

In the case where we have a common covariance matrix, we get the ML estimate to be

$\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)}$

This is a Maximum Likelihood estimate.

### Computation

Case 1: (Example) $\, \Sigma_k = I$'

This means that the data is distributed symmetrically around the center $\mu$, i.e. the isocontours are all circles.

We have:

$\,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k)$

We see that the first term in the above equation, $\,\frac{1}{2}log(|I|)$, is zero since $\ |I|$ is the determine and $\ |I|=1$. The second term contains $\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k)$, which is the squared Euclidean distance between $\,x$ and $\,\mu_k$. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, $\,log(\pi_k)$. The class that has the minimum distance will maximise $\,\delta_k$. According to the theorem, we can then classify the point to a specific class $\,k$. In addition, $\, \Sigma_k = I$ implies that our data is spherical.

Case 2: (General Case) $\, \Sigma_k \ne I$

We can decompose this as:

$\, \Sigma_k = USV^\top = USU^\top$ (In general when $\,X=USV^\top$, $\,U$ is the eigenvectors of $\,XX^T$ and $\,V$ is the eigenvectors of $\,X^\top X$. So if $\, X$ is symmetric, we will have $\, U=V$. Here $\, \Sigma$ is symmetric)

and the inverse of $\,\Sigma_k$ is

$\, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top$ (since $\,U$ is orthonormal)

So from the formula for $\,\delta_k$, the second term is

\begin{align} (x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\ & = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\ & = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\ & = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\ & = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\ \end{align}

where we have the Euclidean distance between $\, S^{-\frac{1}{2}}U^\top x$ and $\, S^{-\frac{1}{2}}U^\top\mu_k$.

A transformation of all the data points can be done from $\,x$ to $\,x^*$ where $\, x^* \leftarrow S^{-\frac{1}{2}}U^\top x$.

It is now possible to do classification with $\,x^*$, treating it as in Case 1 above.

Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.

If the classes have different shapes, in another word, have different covariance $\,\Sigma_k$, can we use the same method to transform all data points $\,x$ to $\,x^*$?

The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.

Kernel QDA In real life, QDA is always better fit the data then LDA because QDA relaxes does not have the assumption made by LDA that the covariance matrix for each class is identical. However, QDA still assumes that the class conditional distribution is Gaussian which does not be the real case in practical. Another method-kernel QDA does not have the Gaussian distribution assumption and it works better.

### The Number of Parameters in LDA and QDA

Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.

LDA: Since we just need to compare the differences between one given class and remaining $\,K-1$ classes, totally, there are $\,K-1$ differences. For each of them, $\,a^{T}x+b$ requires $\,d+1$ parameters. Therefore, there are $\,(K-1)\times(d+1)$ parameters.

QDA: For each of the differences, $\,x^{T}ax + b^{T}x + c$ requires $\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1$ parameters. Therefore, there are $(K-1)(\frac{d(d+3)}{2}+1)$ parameters.

A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.

LDA:[5]

QDA:[7]

## LDA and QDA in Matlab - October 7, 2009

We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called classify that allows us to perform LDA and QDA quickly and easily.

In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.

>> load 2_3;
>> [U, sample] = princomp(X');
>> sample = sample(:,1:2);

First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.
>> plot (sample(1:200,1), sample(1:200,2), '.');
>> hold on;
>> plot (sample(201:400,1), sample(201:400,2), 'r.');

Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.
See title and legend for information on adding the title and legend.
Before using classify we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the group vector should be an empty string or NaN. (See grouping data for more information.)
>> group = ones(400,1);
>> group(201:400) = 2;

We can now classify our data.
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');

The full details of this line can be examined in the Matlab help file linked above. What we care about are class, which contains the labels that the algorithm thinks that each data point belongs to, and coeff, which contains information about the line that algorithm created to separate the data into each class.
We can see the efficacy of the algorithm by comparing class to group.
>> sum (class==group)
ans =
369

This compares the value in class to the value in group. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an empirical error rate of 0.0775.
We can see the line produced by LDA using coeff.
>> k = coeff(1,2).const;
>> l = coeff(1,2).linear;
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);

Those familiar with the programming language C will find the sprintf line refreshingly familiar; those with no exposure to C are directed to Matlab's sprintf page. Essentially, this code sets up the equation of the line in the form 0 = a + bx + cy. We then use the ezplot function to plot the line.
The 2-3 data after LDA is performed. The line shows where the two classes are split.
Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to classify, and a more complicated procedure to plot the line.
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');
>> sum (class==group)
ans =
371
>> k = coeff(1,2).const;
>> l = coeff(1,2).linear;
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);

The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.

classify can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.

Recall: An analysis of the function of princomp in matlab.
In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called princomp which can perform PCA conveniently. From the matlab help file on princomp, you can find the details about this function. But here we will analyze the code of the function of princomp() in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.

   function [pc, score, latent, tsquare] = princomp(x);
%   PRINCOMP Principal Component Analysis (centered and scaled data).
%   [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and
%   returns the principal components in PC, the so-called Z-scores in SC
%   ORES, the eigenvalues of the covariance matrix of X in LATENT,
%   and Hotelling's T-squared statistic for each data point in TSQUARE.
%   Reference: J. Edward Jackson, A User's Guide to Principal Components
%   John Wiley & Sons, Inc. 1991 pp. 1-25.
%   B. Jones 3-17-94
%   Copyright 1993-2002 The MathWorks, Inc.
%   $Revision: 2.9$  $Date: 2002/01/17 21:31:45$
[m,n] = size(x);       %  get the lengh of the rows and columns of matrix x.
r = min(m-1,n);        %  max possible rank of X
avg = mean(x);         %  the mean of every column of X
centerx = (x - avg(ones(m,1),:));
%  centers X by subtracting off column means
[U,latent,pc] = svd(centerx./sqrt(m-1),0);
%  "economy size" decomposition
score = centerx*pc;
%  the representation of X in the principal component space
if nargout < 3
return;
end
latent = diag(latent).^2;
if (r   latent = [latent(1:r); zeros(n-r,1)];
score(:,r+1:end) = 0;
end
if nargout < 4
return;
end
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';
tsquare = sum(tmp.*tmp)';


From the above code, we should pay attention to the following aspects when comparing with SVD method:

First, Rows of $\,X$ correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of $\,X$.

 >> load 2_3;
>> [U, score] = princomp(X');


Second, princomp centers X by subtracting off column means.

The third, when $\,X=UdV'$, princomp uses $\,V$ as coefficients for principal components, rather than $\,U$.

The following is an example to perform PCA using princomp and SVD respectively to get the same results.

SVD method
 >> load 2_3
>> mn=mean(X,2);
>> X1=X-repmat(mn,1,400);
>> [s d v]=svd(X1');
>> y=X1'*v;

princomp
 >>[U score]=princomp(X');


Then we can see that y=score, v=U.

useful resouces: LDA and QDA in Matlab[8],[9],[10]

## Trick: Using LDA to do QDA - October 7, 2009

There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the Kernel trick that will be discussed later in the course.

Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.

### Motivation

Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.

If we look back at the equations for LDA and QDA, we see that in LDA we must estimate $\,\mu_1$, $\,\mu_2$ and $\,\Sigma$. In QDA we must estimate all of those, plus another $\,\Sigma$; the extra $\,\frac{d(d-1)}{2}$ estimations make QDA less robust with fewer data points.

### Theoretically

Suppose we can estimate some vector $\underline{w}^T$ such that

$y = \underline{w}^Tx$

where $\underline{w}$ is a d-dimensional column vector, and $x\ \epsilon\ \mathbb{R}^d$ (vector in d dimensions).

We also have a non-linear function $g(x) = y = x^Tvx + \underline{w}^Tx$ that we cannot estimate.

Using our trick, we create two new vectors, $\,\underline{w}^*$ and $\,x^*$ such that:

$\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]$

and

$x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]$

We can then estimate a new function, $g^*(x,x^2) = y^* = \underline{w}^{*T}x^*$.

Note that we can do this for any $x$ and in any dimension; we could extend a $D \times n$ matrix to a quadratic dimension by appending another $D \times n$ matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a $\,sin(x)$ dimension.

### By Example

Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.

>> load 2_3;
>> [U, sample] = princomp(X');
>> sample = sample(:,1:2);

We start off the same way, by using PCA to reduce the dimensionality of our data to 2.
>> X_star = zeros(400,4);
>> X_star(:,1:2) = sample(:,:);
>> for i=1:400
for j=1:2
X_star(i,j+2) = X_star(i,j)^2;
end
end

This projects our sample into two more dimensions by squaring our initial two dimensional data set.
>> group = ones(400,1);
>> group(201:400) = 2;
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');
>> sum (class==group)
ans =
375

We can now display our results.
>> k = coeff(1,2).const;
>> l = coeff(1,2).linear;
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);

The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.
Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with $x^4$ (i.e. we set X_star(i,j+2) = X_star(i,j)^4) we can correctly classify 376 points.

## Introduction to Fisher's Discriminant Analysis - October 7, 2009

Fisher's Discriminant Analysis (FDA), also known as Fisher's Linear Discriminant Analysis (LDA) in some sources, is a classical feature extraction technique. It was originally described in 1936 by Sir Ronald Aylmer Fisher, an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found here; a Wikipedia article summarizing the algorithm can be found here.

LDA is for classification and FDA is used for feature extraction.

### Contrasting FDA with PCA

The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).

• In PCA, we map data to lower dimensions to maximize the variation in those dimensions.
• In FDA, we map data to lower dimensions to best separate data in different classes.
2 clouds of data, and the lines that might be produced by PCA and FDA.

Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.

Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.

### Intuitive Description of FDA

An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.

### Example in R

PCA and FDA primary dimension for normal multivariate data, using R.
>> X = matrix(nrow=400,ncol=2)
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))
>> Y = c(rep("red",200),rep("blue",200))

Create 2 multivariate normal random variables with $\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)$. Create Y, an index indicating which class they belong to.
>> s <- svd(X,nu=1,nv=1)

Calculate the singular value decomposition of X. The most significant direction is in s$v[,1], and is displayed as a black line. >> s2 <- lda(X,grouping=Y)  The lda function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in s2$scaling.

Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.

>> plot(X,col=Y,main="PCA vs. FDA example")

Plot the set of points, according to colours given in Y.
>> slope = s$v[2]/s$v[1]
>> intercept = mean(X[,2])-slope*mean(X[,1])
>> abline(a=intercept,b=slope)

Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.
>> slope2 = s2$scaling[2]/s2$scaling[1]
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])
>> abline(a=intercept2,b=slope2,col="red")

Plot the FLDA direction, again through the mean.
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)

Labeling the lines directly on the graph makes it easier to interpret.

### Distance Metric Learning VS FDA

In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This approach can be equivalently interpreted as a linear transformation of the original inputs,followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.

Some of the proposed algorithms are iterative and computationally expensive. In the paper,"Distance Metric Learning VS FDA " written by our instructor, they propose a closed-form solution to one algorithm that previously required expensive semidefinite optimization. They provide a new problem setup in which the algorithm performs better or as well as some standard methods, but without the computational complexity. Furthermore, they show a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). They also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.

## Fisher's Discriminant Analysis (FDA) - October 9, 2009

The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space. We can consider two kinds of problems:

• 2-class problem
• multi-class problem
File:graph.jpg
PCA vs FDA

### Two-class problem

In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.

Assume $\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}$ and $\displaystyle\Sigma_{1}$, represent the mean and covariance of the 1st class, and $\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}$ and $\displaystyle\Sigma_{2}$ represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:

1.To make the means of these two classes as far apart as possible

In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are $\underline{x_{i}} \in \mathbb{R}^{d}$and the projected points are $\underline{w}^T \underline{x_{i}}$ then the mean of the projected points will be $\underline{w}^T \underline{\mu_{1}}$ and $\underline{w}^T \underline{\mu_{2}}$ for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, $(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})$. The steps of this maximization are given below.

2.We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes

Notice that the variance of the projected classes 1 and 2 are given by $\underline{w}^T\Sigma_{1}\underline{w}$ and $\underline{w}^T\Sigma_{2}\underline{w}$. The second goal is to minimize the sum of these two covariances.

As is demonstrated below, both of these goals can be accomplished simultaneously.

Original points are $\underline{x_{i}} \in \mathbb{R}^{d}$
$\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \}$

Projected points are $\underline{z_{i}} \in \mathbb{R}^{1}$ with $\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}$ $\ z_i$ is a sclar

#### Between class covariance

In this particular case, we want to project all the data points in one dimensional space.

We want to maximize the Euclidean distance between projected means, which is

\begin{align} (\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\ &= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} \end{align} which is scalar

The quantity $(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T$ is called between class covariance or $\,S_{B}$.

The goal is to maximize : $\underline{w}^T S_{B} \underline{w}$

#### Within class covariance

Covariance of class 1 is $\,\Sigma_{1}$ Covariance of class 2 is $\,\Sigma_{2}$ So covariance of projected points will be $\,\underline{w}^T \Sigma_{1} \underline{w}$ and $\underline{w}^T \Sigma_{2} \underline{w}$

If we sum this two quantities we have

\begin{align} \underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w} \end{align}

The quantity $\,(\Sigma_{1} + \Sigma_{2})$ is called within class covariance or $\,S_{W}$

The goal is to minimize $\underline{w}^T S_{W} \underline{w}$

#### Objective Function

Instead of maximizing $\underline{w}^T S_{B} \underline{w}$ and minimizing $\underline{w}^T S_{W} \underline{w}$ we can define the following objective function:

$\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}$

This maximization problem is equivalent to $\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w} \equiv \max(\underline w^T S_B \underline w)$ subject to constraint $\underline{w}^T S_{W} \underline{w} = 1$, where $\ \underline w^T S_B \underline w$ is no upper bound and $\ \underline w^T S_w \underline w$ is no lower bound.

We can use the Lagrange multiplier method to solve it:

$L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)$ where $\ \lambda$ is the weight

With $\frac{\part L}{\part \underline{w}} = 0$ we get:

\begin{align} &\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\ &\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\ &\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w} \end{align}

Note that $\, S_{W}=\Sigma_1+\Sigma_2$ is sum of two positive matrices and so it has an inverse.

Here $\underline{w}$ is the eigenvector of $S_{w}^{-1}\ S_{B}$ corresponding to the largest eigenvalue $\ \lambda$.

In facts, this expression can be simplified even more.

$\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}$ with $S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T$
$\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}$

The quantity $(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}$ and $\lambda$ are scalars.
So we can say the quantity $S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})$ is proportional to $\underline{w}$

#### FDA vs. PCA Example in Matlab

We can compare PCA and FDA through the figure produced by matlab.

The following are the code to produce the figure step by step and the explanation for steps.

 >>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);
>>X=[X1;X2];

Create two multivariate normal random variables with $\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)$.
 >>plot(X(1:300,1),X(1:300,2),'.');
>>hold on
>>p1=plot(X(301:600,1),X(301:600,2),'r.');

Plot the the data of the two classes respectively.
 >>[U Y]=princomp(X);
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);

Using PCA to find the principal component and plot it.
 >>sw=2*[1 1.5;1.5 3];
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';
>>g =inv(sw)*sb;
>>[v w]=eigs(g);
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')

Using FDA to find the principal component and plot it.

Now we can compare them through the figure.

PCA and FDA primary dimension for normal multivariate data, using matlab

From the graph: when we see using PCA, we have a huge overlap for two classes, so PCA is not good. However, there is no overlap for the two classes and they are seperated pretty. Thus, FDA is better than PCA here.

#### Practical example of 2_3

In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".

X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".

>>load 2_3
>>X1 = X(:, 1:200);
>>X2 = X(:, 201:400);


Next we calculate within class covariance and between class covariance as before.

>>mu1 = mean(X1, 2);
>>mu2 = mean(X2, 2);
>>sb = (mu1 - mu2) * (mu1 - mu2)';
>>sw = cov(X1') + cov(X2');


We use the first two eigenvectors to project the dato in a two-dimensional space.

>>[v d] = eigs( inv(sw) * sb );
>>w = v(:, 1:2);
>>X_hat = w'*X;


Finally we plot the data and visualize the effect of FDA.

>> scatter(ones(1,200),X_hat(1:200))
>> hold on
>> scatter(ones(1,200),X_hat(201:400),'r')

File:fda2-3.jpg
FDA projection of data 2_3, using Matlab.

Map the data into a linear line, and the two classes are seperated perfectly here.

#### An extension of Fisher's discriminant analysis for stochastic processes

A general notion of Fisher's linear discriminant analysis can extend the classical multivariate concept to situations that allow for function-valued random elements. The development uses a bijective mapping that connects a second order process to the reproducing kernel Hilbert space generated by its within class covariance kernel. This approach provides a seamless transition between Fisher's original development and infinite dimensional settings that lends itself well to computation via smoothing and regularization.

## FDA for Multi-class Problems - October 14, 2009

### FDA method for Multi-class Problems

For the $k$-class problem, we need to find a projection from $d$-dimensional space to a $(k-1)$-dimensional space.

(It is more reasonable to have at least 2 directions)

Basically, the within class covariance matrix $\mathbf{S}_{W}$ is easily to obtain:

\begin{align} \mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i} \end{align}

where $\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j: y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} - \mathbf{\mu}_{i})^{T}$ and $\mathbf{\mu}_{i} = \frac{\sum_{j: y_{j}=i}\mathbf{x}_{j}}{n_{i}}$.

However, the between class covariance matrix $\mathbf{S}_{B}$ is not easy to obtain. One of the simplifications is that we may assume that the total covariance $\mathbf{S}_{T}$ of the data is constant, since $\mathbf{S}_{W}$ is easy to compute, we can get $\mathbf{S}_{B}$ using the following relationship:

\begin{align} \mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W} \end{align}

Actually, there is another generation for $\mathbf{S}_{B}$. Denote a total mean vector $\mathbf{\mu}$ by

\begin{align} \mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} = \frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j} \end{align}

Thus the total covariance matrix $\mathbf{S}_{T}$ is

\begin{align} \mathbf{S}_{T} = \sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T} \end{align}

Thus we obtain

\begin{align} & \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} - \mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} \\& = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+ \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T} \\& = \mathbf{S}_{W} + \sum_{i=1}^{k} n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T} \end{align}

Since the total covariance $\mathbf{S}_{T}$ is the sum of the within class covariance $\mathbf{S}_{W}$ and the between class covariance $\mathbf{S}_{B}$, we can denote the second term as the general between class covariance matrix $\mathbf{S}_{B}$, thus we obtain

\begin{align} \mathbf{S}_{B} = \sum_{i=1}^{k} n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T} \end{align}

Therefore,

\begin{align} \mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B} \end{align}

Recall that in the two class case problem, we have

\begin{align} & \mathbf{S}_{B^{\ast}} = (\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T} \\ & = (\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T} \\ & = ((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T} \\ & = (\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T} \end{align}

From the general form,

\begin{align} & \mathbf{S}_{B} = n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T} + n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T} \end{align}

Apparently, they are very similar.

Now, we are trying to find the optimal transformation. Basically, we have

\begin{align} \mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i}, i=1,2,...,k-1 \end{align}

where $\mathbf{z}_{i}$ is a $(k-1)\times 1$ vector, $\mathbf{W}$ is a $d\times (k-1)$ transformation matrix, i.e. $\mathbf{W} = [\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]$, and $\mathbf{x}_{i}$ is a $d\times 1$ column vector.

Thus we obtain

\begin{align} & \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T} \\ & = \sum_{i=1}^{k}\sum_{j: y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W} \\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W} \\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W} \end{align}

Similarly, we obtain

\begin{align} & \mathbf{S}_{B}^{\ast} = \sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T} \\ & = \sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W} \\ & = \mathbf{W}^{T}\left[ \sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W} \\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W} \end{align}

Now, we use the determinant of the matrix, i.e. the product of the eigenvalues of the matrix, as our measure.

\begin{align} \phi(\mathbf{W}) = \frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} = \frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}} \end{align}

The solution for this question is that the columns of the transformation matrix $\mathbf{W}$ are exactly the eigenvectors that correspond to largest $k-1$ eigenvalues with respect to

\begin{align} \mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} = \lambda_{i}\mathbf{w}_{i} \end{align}

Also, note that we can use

\begin{align} \sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2} \end{align}

as our measure.

Recall that

\begin{align} \|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X}) \end{align}

Thus we obtain that

\begin{align} & \sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2} \\ & = \sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}] \\ & = Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}] \\ & = Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}] \\ & = Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}] \\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}] \end{align}

Similarly, we can get $Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]$. Thus we have following criterion function

\begin{align} \phi(\mathbf{W}) = \frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]} \end{align}

Similar to the two class case problem, we have:

max $Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]$ subject to $Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1$

To solve this optimization problem a Lagrange multiplier $\Lambda$, which actually is a $d \times d$ diagonal matrix, is introduced:

\begin{align} L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\} \end{align}

Differentiating with respect to $\mathbf{W}$ we obtain:

\begin{align} \frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W} \end{align}

Note that the $\mathbf{S}_{B}$ and $\mathbf{S}_{W}$ are both symmetric matrices, thus set the first derivative to zero, we obtain:

\begin{align} \mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0 \end{align}

Thus,

\begin{align} \mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W} \end{align}

where

$\mathbf{\Lambda} = \begin{pmatrix} \lambda_{1} & & 0\\ &\ddots&\\ 0 & &\lambda_{d} \end{pmatrix}$

and $\mathbf{W} = [\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]$.

As a matter of fact, $\mathbf{\Lambda}$ must have $\mathbf{k-1}$ nonzero eigenvalues, because $rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1$.

Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix $\mathbf{W}$ are exactly the eigenvectors that correspond to largest $k-1$ eigenvalues with respect to

\begin{align} \mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} = \lambda_{i}\mathbf{w}_{i} \end{align}

### Generalization of Fisher's Linear Discriminant

Fisher's linear discriminant (Fisher, 1936) is very popular among users of discriminant analysis.Some of the reasons for this are its simplicity and unnecessity of strict assumptions. However it has optimality properties only if the underlying distributions of the groups are multivariate normal. It is also easy to verify that the discriminant rule obtained can be very harmed by only a small number of outlying observations. Outliers are very hard to detect in multivariate data sets and even when they are detected simply discarding them is not the most effcient way of handling the situation. Therefore the need for robust procedures that can accommodate the outliers and are not strongly affected by them. Then, a generalization of Fisher's linear discriminant algorithm [[12]]is developed to lead easily to a very robust procedure.

## Linear Regression Models - October 14, 2009

Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, $\,y$, changes according to changes in $\,X$.

We will start by considering a very simple regression model, the linear regression model.

General information on linear regression can be found at the University of South Florida and this MIT lecture.

For the purpose of classification, the linear regression model assumes that the regression function $\,E(Y|X)$ is linear in the inputs $\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}$.

The simple linear regression model has the general form:

\begin{align} f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0} \end{align}

where $\,\beta$ is a $1 \times d$ vector and $\ x_i$ is a $d \times 1$ vector .

Given input data $\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}$ and $\,y_{1}, ..., y_{p}$ our goal is to find $\,\beta$ and $\,\beta_0$ such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.

Note that vectors $\mathbf{x}_{i}$ could be numerical inputs, transformations of the original data, i.e. $\log \mathbf{x}_{i}$ or $\sin \mathbf{x}_{i}$, or basis expansions, i.e. $\mathbf{x}_{i}^{2}$ or $\mathbf{x}_{i}\times \mathbf{x}_{j}$.

Denote $\mathbf{X}$ as a $n\times(d+1)$ matrix with each row an input vector (with 1 in the first position), $\,\beta = (\beta_0, \beta_1,..., \beta_{d})^{T}$ and $\mathbf{y}$ as a $n \times 1$ vector of outputs. We then try to minimize the residual sum-of-squares

\begin{align} \mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T} \end{align}

This is a quadratic function in the $\,d+1$ parameters. Differentiating with respect to $\,\beta$ we obtain

\begin{align} \frac{\partial \mathrm{RSS}}{\partial \beta} = -2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta) \end{align}
\begin{align} \frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial \beta^{T}}=2\mathbf{X}^{T}\mathbf{X} \end{align}

Set the first derivative to zero

\begin{align} \mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0 \end{align}

we obtain the solution

\begin{align} \hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y} \end{align}

Thus the fitted values at the inputs are

\begin{align} \mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X} (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y} \end{align}

where $\mathbf{H} = \mathbf{X} (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}$ is called the hat matrix.

• Note For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:

$r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}$
It is clear that to make sense mathematically, $\displaystyle r(x)$ must be a value between 0 and 1. If this is estimated with the regression function $\displaystyle r(x)=E(Y|X=x)$ and $\mathbf{\hat\beta}$ is learned as above, then there is nothing that would restrict $\displaystyle r(x)$ to taking values between 0 and 1. This is more direct approach to classification since it do not need to estimate $\ f_k(x)$ and $\ \pi_k$. $\ 1 \times P(Y=1|X=x)+0 \times P(Y=0|X=x)=E(Y|X)$ This model does not classify Y between 0 and 1, so it is not good and sometimes it can lead to a decent classifier. $\ y_i=\frac{1}{n_1}$ $\ \frac{-1}{n_2}$

#### A linear regression example in Matlab

We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.

Again, we use the data in 2_3.m.

 >>load 2_3;
>>[U, sample] = princomp(X');
>>sample = sample(:,1:2);


We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.

 >>y = zeros(400,1);
>>y(201:400) = 1;


We let y represent the set of labels coded as 0 and 1.

 >>x=[sample;ones(1,400)];


Construct x by adding a row of vector 1 to data.

 >>b=inv(x*x')*x*y;


Calculate b, which represents $\beta$ in the linear regression model.

 >>x1=x';
>>for i=1:400
if x1(i,:)*b>0.5
plot(x1(i,1),x1(i,2),'.')
hold on
elseif x1(i,:)*b < 0.5
plot(x1(i,1),x1(i,2),'r.')
end
end


Plot the fitted y values.

File:linearregression.png
the figure shows that the classification of the data points in 2_3.m by the linear regression model

Linear regression model is almost the easiest and most popular way to analyze the relationship of different data sets. However, it has some disadvantages as well as its advantages. We should be clear about them before we apply the model.

Advantages: Linear least squares regression has earned its place as the primary tool for process modeling because of its effectiveness and completeness. Though there are types of data that are better described by functions that are nonlinear in the parameters, many processes in science and engineering are well-described by linear models. This is because either the processes are inherently linear or because, over short ranges, any process can be well-approximated by a linear model. The estimates of the unknown parameters obtained from linear least squares regression are the optimal estimates from a broad class of possible parameter estimates under the usual assumptions used for process modeling. Practically speaking, linear least squares regression makes very efficient use of the data. Good results can be obtained with relatively small data sets. Finally, the theory associated with linear regression is well-understood and allows for construction of different types of easily-interpretable statistical intervals for predictions, calibrations, and optimizations. These statistical intervals can then be used to give clear answers to scientific and engineering questions.

Disadvantages: The main disadvantages of linear least squares are limitations in the shapes that linear models can assume over long ranges, possibly poor extrapolation properties, and sensitivity to outliers. Linear models with nonlinear terms in the predictor variables curve relatively slowly, so for inherently nonlinear processes it becomes increasingly difficult to find a linear model that fits the data well as the range of the data increases. As the explanatory variables become extreme, the output of the linear model will also always more extreme. This means that linear models may not be effective for extrapolating the results of a process for which data cannot be collected in the region of interest. Of course extrapolation is potentially dangerous regardless of the model type. Finally, while the method of least squares often gives optimal estimates of the unknown parameters, it is very sensitive to the presence of unusual data points in the data used to fit a model. One or two outliers can sometimes seriously skew the results of a least squares analysis. This makes model validation, especially with respect to outliers, critical to obtaining sound answers to the questions motivating the construction of the model.

## Logistic Regression- October 16, 2009

The logistic regression model arises from the desire to model the posterior probabilities of the $\displaystyle K$ classes via linear functions in $\displaystyle x$, while at the same time ensuring that they sum to one and remain in [0,1].Logistic regression models are usually fit by maximum likelihood, using the conditional likelihood ,using $\displaystyle Pr(Y|X)$. Since $\displaystyle Pr(Y|X)$ completely specifies the conditional distribution, the multinomial distribution is appropriate. This model is widely used in biostatistical applications for two classes. For instance: people survive or die, have a disease or not, have a risk factor or not.

### logistic function

A logistic function or logistic curve is the most common sigmoid curve.

$y = \frac{1}{1+e^{-x}}$

1. $\frac{dy}{dx} = y(1-y)=\frac{e^{x}}{(1+e^{x})^{2}}$

2. $y(0) = \frac{1}{2}$

3. $\int y dx = ln(1 + e^{x})$

4. $y(x) = \frac{1}{2} + \frac{1}{4}x - \frac{1}{48}x^{3} + \frac{1}{48}x^{5} \cdots$

5. The logistic curve shows early exponential growth for negative t, which slows to linear growth of slope 1/4 near t = 0, then approaches y = 1 with an exponentially decaying gap.

### Intuition behind Logistic Regression

Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force $\,r(x)$ to be between 0 and 1 and sum to 1. Consider the following log odds model (for two classes):

$\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx$

Calculating $\,P(Y=1|X=x)$ leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of Generalized Linear Model (GLM).

### The Logistic Regression Model

The logistic regression model for the two class case is defined as

Class 1

$P(Y=1 | X=x)$
$P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})$

Then we have that

Class 0

$P(Y=0 | X=x)$
$P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}$

### Fitting a Logistic Regression

Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood, using Pr(Y|X). The maximum likelihood of $\underline\beta$ maximizes the probability of obtaining the data $\displaystyle{x_{1},...,x_{n}}$ from the known distribution. Combining $\displaystyle P(Y=1 | X=x)$ and $\displaystyle P(Y=0 | X=x)$ as follows, we can consider the two classes at the same time:

$p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}$

Assuming the data $\displaystyle {x_{1},...,x_{n}}$ is drawn independently, the likelihood function is

\begin{align} \mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\ &=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\ &= \prod_{i=1}^n p(x_{i};\theta) \end{align}

Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get

$\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)$

So,

\begin{align} l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\ &= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\ &= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\ &=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\ \end{align}

To maximize the log-likelihood, set its derivative to 0.

\begin{align} \frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\ &=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right] \end{align}

There are n+1 nonlinear equations in $/ \beta$. The first column is vector 1, then $\ \sum_{i=1}^n {y_i} =\sum_{i=1}^n p(\underline{x}_i;\underline{\beta})$ i.e. the expected number of class ones matches the observed number.

To solve this equation, the Newton-Raphson algorithm is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.

Logistic regression has several advantages over discriminant analysis:

• it is more robust: the independent variables don't have to be normally distributed, or have equal variance in each group
• It does not assume a linear relationship between the IV and DV
• It may handle nonlinear effects
• You can add explicit interaction and power terms
• The DV need not be normally distributed.
• There is no homogeneity of variance assumption.
• Normally distributed error terms are not assumed.
• It does not require that the independents be interval.
• It does not require that the independents be unbounded.

With all this flexibility, you might wonder why anyone would ever use discriminant analysis or any other method of analysis. Unfortunately, the advantages of logistic regression come at a cost: it requires much more data to achieve stable, meaningful results. With standard regression, and DA, typically 20 data points per predictor is considered the lower bound. For logistic regression, at least 50 data points per predictor is necessary to achieve stable results.

some resources: [15], [16]

#### Extension

• When we are dealing with a problem with more than two classes, we need to generalize our logistic regression to a Multinomial Logit model.
• Limitations of Logistic Regression:
1. We know that there is no assumptions are made about the distributions of the features of the data (i.e. the explanatory variables). However, the features should not be highly correlated with one another because this could cause problems with estimation.
2. Large number of data points (i.e.the sample sizes) are required for logistic regression to provide sufficient numbers in both classes. The more number of features/dimensions of the data, the larger the sample size required.

## Logistic Regression(2) - October 19, 2009

### Logistic Regression Model

Recall that in the last lecture, we learned the logistic regression model.

• $P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}$
• $P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}$

### Find $\underline{\beta}$

Criteria: find a $\underline{\beta}$ that maximizes the conditional likelihood of Y given X using the training data.

From above, we have the first derivative of the log-likelihood:

$\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]$ $=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]$

Newton-Raphson algorithm:
If we want to find $\ x^*$ such that $\ f(x^*)=0$

$\ X^{new} \leftarrow x^{old}-\frac {f(x^{old})}{\partial f(x^{old})}$

$\ x^{new} \rightarrow x^*$

If we want to maximize or minimize $\ f(x)$, then solve for $\ \partial f(x)=0$

$\ X^{new} \leftarrow x^{old}-\frac {\partial f(x^{old})}{\partial^2 f(x^{old})}$

The Newton-Raphson algorithm requires the second-derivative or Hessian matrix.

$\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }= \sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}$

(note: $\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T$ you can check it here, it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)

$=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}$ (by cancellation)
$=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}）)[1-P(\underline{x}_i;\underline{\beta})]）$(since $P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}$ and $1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}$)

The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity$\frac{a}{1+a}=1-\frac{1}{1+a}$

And solving $\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right]$

Starting with $\,\underline{\beta}^{old}$, the Newton-Raphson update is

$\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})$ where the derivatives are evaluated at [math]\,\underline{