# stat841

## Classfication-2009.9.30

### Classification

With the rise of fields such as data-mining, bioinformatics, and machine learning, classification has becomes a fast-developing topic. In the age of information, vast amounts of data are generated constantly, and the goal of classification is to learn from data. Potential application areas include handwritten post codes recognition, medical diagnosis, face recognition, human language processing and so on.

Definition: The problem of Prediction a discrete random variable $\mathcal{Y}$ from another random variable $\mathcal{X}$ is called Classification.

In classification,, we attempt to approximate a function $\,h$, by using a training data set, which will then be able to accurately classify new data inputs.

Given $\mathcal{X} \subset \mathbb{R}^{d}$, a subset of the $d$-dimensional real vectors and $\mathcal{Y}$, a finite set of labels, We try to determine a 'classification rule' $\,h$ such that,

$\,h: \mathcal{X} \mapsto \mathcal{Y}$

We use $\,n$ ordered pairs of training data which are identical independent distributions, $\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}$ where $\,X_{i} \in \mathcal{X}$,$\,Y_{i} \in \mathcal{Y}$, to approximate $\,h$.

Thus, given a new input, $\,X \in \mathcal{X}$ by using the classification rule we can predict a corresponding $\,\hat{Y}=h(X)$.

Example Suppose we wish to classify fruits into apples and oranges by considering certain features of the fruit, for instance, color, diameter, and weight.
Let $\mathcal{X}= (\mathrm{colour}, \mathrm{diameter}, \mathrm{weight})$ and $\mathcal{Y}=\{\mathrm{apple}, \mathrm{orange}\}$. The goal is to find a classification rule such that when a new fruit $\,X$ is presented based on its features, $(\,X_{\mathrm{color}}, X_{\mathrm{diameter}}, X{_\mathrm{weight}})$, our classification rule $\,h$ can classify it as either an apple or an orange, i.e., $\,h(X_{\mathrm{color}}, X_{\mathrm{diameter}}, X_{\mathrm{weight}})$ be the fruit type of $\,X$.

### Error rate

'True error rate' of a classifier(h) is defined as the probability that $\,h$ does not correctly classify the points of $\,\mathcal{X}$, i.e.,
$\, L(h)=P(h(X) \neq Y)$
'Empirical error rate(training error rate)' of a classifier(h) is defined as the frequency that $\,h$ does not correctly classify the points in the training set, i.e.,
$\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})$, where $\,I$ is an indicator that $\, I= \left\{\begin{matrix} 1 & h(X_i) \neq Y_i \\ 0 & h(X_i)=Y_i \end{matrix}\right.$.

### Bayes Classifier

The principle of Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes formula, and then place the object in the class with the largest posterior probability. Intuitively speaking, to classify $\,x\in \mathcal{X}$ we find $y \in \mathcal{Y}$ such that $\,P(Y=y|X=x)$ is maximum over all the members of $\mathcal{Y}$.

Mathematically, for $\,k$ classes and given object $\,X=x$, we find $\,y\in \mathcal{Y}$ which maximizes $\,P(Y=y|X=x)$, and classify $\,X$ into class $\,y$. In order to calculate the value of $\,P(Y=y|X=x)$, we use Bayes formula

\begin{align} P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\ &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)} \end{align}

where $\,P(Y=y|X=x)$ is referred to as the posterior probability, $\,P(Y=y)$ as the prior probability, $\,P(X=x|Y=y)$ as the likelihood, and $\,P(X=x)$ as the evidence.

For the special case that $\,Y$ has only two classes, that is, $\, \mathcal{Y}=\{0, 1\}$. Consider the probability that $\,r(X)=P\{Y=1|X=x\}$. Given $\,X=x$, By Bayes formula, we have

\begin{align} r(X)&=P(Y=1|X=x) \\ &=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\ &=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)} \end{align}

Definition:

The Bayes classification rule $\,h$ is

$\, h(X)= \left\{\begin{matrix} 1 & r(x)\gt \frac{1}{2} \\ 0 & \mathrm{otherwise} \end{matrix}\right.$

3 different approaches to classification:

1) Empirical Risk Minimization: Choose a set fo classifier H and find h* H that minimizes some estimate of L(h)

2) Regression

3) Density Estimation (less popular in high-dimension cases)

Bayes Classification Rule Optimality Theorem: The Bayes rule is optimal in true error rate, that is for any other classification rule $\, \overline{h}$, we have $\,L(h) \le L(\overline{h})$. Intuitively speaking this theorem is saying we cannot do better than classifying $\,x\in \mathcal{X}$ to $\,y$ when the probability of being of type $\,y$ for $\,x$ is more than probability of being any other type.

Definition:

The set $\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}$ is called the decision boundary.

$\, h*(X)= \left\{\begin{matrix} 1 & if P(Y=1|X=x)\gt P(Y=0|X=x) \\ 0 & \mathrm{otherwise} \end{matrix}\right.$

Remark:

1)Bayes classification rule is optimal. Proof:[1]

2)We still need any other method, since we cannot define prior probability in realistic.

Example:
We’re going to predict if a particular student will pass STAT441/841. We have data on past student performance. For each student we know: If student’s GPA > 3.0 (G) If student had a strong math background (M) If student is a hard worker (H) If student passed or failed course

$\, \mathcal{Y}= \{ 0,1 \}$, where 1 refers to pass and 0 refers to fail. Assume that $\,P(Y=1)=P(Y=0)=0.5$
For a new student comes along with values $\,G=0, M=1, H=0$, we calculate $\,r(X)=P(Y=1|X=(0,1,0))$ as

$\,r(X)=P(Y=1|X=(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=1)P(Y=1)+P(X=(0,1,0)|Y=0)P(Y=0)}=\frac{0.025}{0.125}=0.2\lt \frac{1}{2}$
Thus, we classify the new student into class 0, namely, we predict him to fail in this course.

Notice: Although the Bayes rule is optimal, we still need other methods, since it is generally impossible for us to know the prior $\,P(Y=1)$, and class conditional density $\,P(X=x|Y=1)$ and ultimately calculate the value of $\,r(X)$, which makes Bayes rule inconvenient in practice.

Currently, there are four primary classifier based on Bayes Classifier: Naive Bayes classifier[2], tree-augmented naive Bayes (TAN), Bayesian network augmented naive Bayes (BAN) and general Bayesian network (GBN).

### Bayesian vs. Frequentist

Intuitively, to solve a two-class problem, we may have the following two approaches:

1) If $\,P(Y=1|X=x)\gt P(Y=0|X=x)$, then $\,h(x)=1$, otherwise $\,h(x)=0$.

2) If $\,P(X=x|Y=1)\gt P(X=x|Y=0)$, then $\,h(x)=1$, otherwise $\,h(x)=0$.

One obvious difference between these two methods is that the first one considers probability as changing based on observation while the second one considers probablity as having objective existence. Actually, they represent two different schools in statistics.

During the history of statistics, there are two major classification methods : Bayesian and frequentist. The two methods represent two different ways of thoughts and hold different view to define probability. The followings are the main differences between Bayes and Frequentist.

Frequentist

1. Probability is objective.
2. Data is a repeatable random sample(there is a frequency).
3. Parameters are fixed and unknown constant.
4. Not applicable to single event. For example, a frequentist cannot predict the weather of tomorrow because tomorrow is only one unique event, and cannot be referred to a frequency in a lot of samples.

Bayesian

1. Probability is subjective.
2. Data are fixed.
3. Parameters are unknown and random variables that have a given distribution and other probability statements can be made about them.
4. Can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can predict tomorrow's weather, such as having the probability of $\,50%$ of rain.

Example

Suppose there is a man named Jack. In Bayesian method, at first, one can see this man (object), and then judge whether his name is Jack (label). On the other hand, in Frequentist method, one doesn’t see the man (object), but can see the photos (label) of this man to judge whether he is Jack.

## Linear and Quadratic Discriminant Analysis - October 2,2009

### Introduction

#### Notation

Let us first introduce some new notation for the following sections.

Recall that in the discussion of the Bayes Classifier, we introduced Bayes Formula:

\begin{align} P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall y \in \mathcal{Y}}P(X=x|Y=y)P(Y=y)} \end{align}

We will use new labels for the following equivalent formula:

\begin{align} P(Y=k|X=x) &=\frac{f_k(x)\pi_k}{\Sigma_kf_k(x)\pi_k} \end{align}
• $\,f_k$ is called the class conditional density; also referred to previously as the likelihood function. Essentially, this is the function that allows us to reason about a parameter given a certain outcome.
• $\,\pi_k$ is called the prior probability. This is a probability distribution that represents what we know (or believe we know) about a population.
• $\,\Sigma_k$ is the sum with respect to all $\,k$ classes.

#### Approaches

Representing the optima method, Bayes classifier cannot be used in most practical situations though, since usually the prior probability is unknown. Fortunately, other methods of classification have evolved. These methods fall into three general categories.

1. Empirical Risk Minimization:Choose classifiers, find $\,h^* \epsilon H$, minimize some estimate of $\,L(H)$.
2. Regression
3. Density estimation, estimate $P(X = x | Y = 0)$ and $P(X = x | Y = 1)$

The third approach, in this form, is not popular because density estimation doesn't work very well with dimension greater than 2. Linear Discriminate Analysis and Quadratic Discriminate Analysis are examples of the third approach, density estimation.

### LDA

#### Motivation

The Bayes classifier is optimal. Unfortunately, the prior and conditional density of most data is not known. Some estimation of these should be made if we want to classify some data.

The simplest way to achieve this is to assume that all the class densities are approximately a multivariate normal distribution, find the parameters of each such distribution, and use them to calculate the conditional density and prior for unknown points, thus approximating the Bayesian classifier to choose the most likely class. In addition, if the covariance of each class density is assumed to be the same, the number of unknown parameters is reduced and the model is easy to fit and use, as seen later.

#### History

The name Linear Discriminant Analysis comes from the fact that these simplifications produce a linear model, which is used to discriminate between classes. In many cases, this simple model is sufficient to provide a near optimal classification - for example, the Z-Score credit risk model, designed by Edward Altman in 1968, which is essentially a weighted LDA, revisited in 2000, has shown an 85-90% success rate predicting bankruptcy, and is still in use today.

#### Definition

To perform LDA we make two assumptions.

1. The clusters belonging to all classes each follow a multivariate normal distribution.
$x \in \mathbb{R}^d$ $f_k(x)=\frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)$
2. Each cluster has the same variance $\,\Sigma$ equal to the mean variance of $\Sigma_k \forall k$.

We wish to solve for the boundary where the error rates for classifying a point are equal, where one side of the boundary gives a lower error rate for one class and the other side gives a lower error rate for the other class.

So we solve $\,r_k(x)=r_l(x)$ for all the pairwise combinations of classes.

$\,\Rightarrow Pr(Y=k|X=x)=Pr(Y=l|X=x)$

$\,\Rightarrow \frac{Pr(X=x|Y=k)Pr(Y=k)}{Pr(X=x)}=\frac{Pr(X=x|Y=l)Pr(Y=l)}{Pr(X=x)}$ using Bayes' Theorem

$\,\Rightarrow Pr(X=x|Y=k)Pr(Y=k)=Pr(X=x|Y=l)Pr(Y=l)$ by canceling denominators

$\,\Rightarrow f_k(x)\pi_k=f_l(x)\pi_l$

$\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l$

$\,\Rightarrow \exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] \right)\pi_k=\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] \right)\pi_l$ Since both $\Sigma$ are equal based on the assumptions specific to LDA.

$\,\Rightarrow -\frac{1}{2} [x - \mu_k]^\top \Sigma^{-1} [x - \mu_k] + \log(\pi_k)=-\frac{1}{2} [x - \mu_l]^\top \Sigma^{-1} [x - \mu_l] +\log(\pi_l)$ taking the log of both sides.

$\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_k^\top\Sigma^{-1}\mu_k - 2x^\top\Sigma^{-1}\mu_k - x^\top\Sigma^{-1}x - \mu_l^\top\Sigma^{-1}\mu_l + 2x^\top\Sigma^{-1}\mu_l \right)=0$ by expanding out

$\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\left( \mu_k^\top\Sigma^{-1}\mu_k-\mu_l^\top\Sigma^{-1}\mu_l - 2x^\top\Sigma^{-1}(\mu_k-\mu_l) \right)=0$ after canceling out like terms and factoring.

We can see that this is a linear function in x with general form $\,ax+b=0$.

Actually, this linear log function shows that the decision boundary between class $k$ and class $l$, i.e. $Pr(G=k|X=x)=Pr(G=l|X=x)$, is linear in $x$. Given any pair of classes, decision boundaries are always linear. In $p$ dimensions, we separate regions by hyperplanes.

In the special case where the number of samples from each class are equal ($\,\pi_k=\pi_l$), the boundary surface or line lies halfway between $\,\mu_l$ and $\,\mu_k$

### QDA

The concept is the same idea of finding a boundary where the error rate for classification between classes are equal, except the assumption that each cluster has the same variance $\,\Sigma$ equal to the mean variance of $\Sigma_k \forall k$ is removed.

Following along from where QDA diverges from LDA.

$\,f_k(x)\pi_k=f_l(x)\pi_l$

$\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{ (2\pi)^{d/2}|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l$

$\,\Rightarrow \frac{1}{|\Sigma_k|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k] \right)\pi_k=\frac{1}{|\Sigma_l|^{1/2} }\exp\left( -\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l] \right)\pi_l$ by cancellation

$\,\Rightarrow -\frac{1}{2}\log(|\Sigma_k|)-\frac{1}{2} [x - \mu_k]^\top \Sigma_k^{-1} [x - \mu_k]+\log(\pi_k)=-\frac{1}{2}\log(|\Sigma_l|)-\frac{1}{2} [x - \mu_l]^\top \Sigma_l^{-1} [x - \mu_l]+\log(\pi_l)$ by taking the log of both sides

$\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top\Sigma_k^{-1}x + \mu_k^\top\Sigma_k^{-1}\mu_k - 2x^\top\Sigma_k^{-1}\mu_k - x^\top\Sigma_l^{-1}x - \mu_l^\top\Sigma_l^{-1}\mu_l + 2x^\top\Sigma_l^{-1}\mu_l \right)=0$ by expanding out

$\,\Rightarrow \log(\frac{\pi_k}{\pi_l})-\frac{1}{2}\log(\frac{|\Sigma_k|}{|\Sigma_l|})-\frac{1}{2}\left( x^\top(\Sigma_k^{-1}-\Sigma_l^{-1})x + \mu_k^\top\Sigma_k^{-1}\mu_k - \mu_l^\top\Sigma_l^{-1}\mu_l - 2x^\top(\Sigma_k^{-1}\mu_k-\Sigma_l^{-1}\mu_l) \right)=0$ this time there are no cancellations, so we can only factor

The final result is a quadratic equation specifying a curved boundary between classes with general form $\,ax^2+bx+c=0$.

It is quadratic because there is no boundaries.

## Linear and Quadratic Discriminant Analysis cont'd - October 5, 2009

### Summarizing LDA and QDA

We can summarize what we have learned on LDA and QDA so far into the following theorem.

Theorem:

Suppose that $\,Y \in \{1,\dots,k\}$, if $\,f_k(x) = Pr(X=x|Y=k)$ is Gaussian, the Bayes Classifier rule is

$\,h(X) = \arg\max_{k} \delta_k(x)$

where

$\,\delta_k = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k)$ (quadratic)
• Note The decision boundary between classes $k$ and $l$ is quadratic in $x$.

If the covariance of the Gaussians are the same, this becomes

$\,\delta_k = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k)$ (linear)
• Note $\,\arg\max_{k} \delta_k(x)$returns the set of k for which $\,\delta_k(x)$ attains its largest value.

### In practice

We need to estimate the prior, so in order to do this, we use the sample estimates of $\,\pi,\mu_k,\Sigma_k$ in place of the true values, i.e.

File:estimation.png
Estimation of the probability of belonging to either class k or l

$\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}$

$\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i$

$\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top$

In the case where we have a common covariance matrix, we get the ML estimate to be

$\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)}$

### Computation

Case 1: (Example) $\, \Sigma_k = I$

This means that the data is distributed symmetrically around the center $\mu$, i.e. the isocontours are all circles.

We have:

$\,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k)$

We see that the first term in the above equation, $\,\frac{1}{2}log(|I|)$, is zero. The second term contains $\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k)$, which is the squared Euclidean distance between $\,x$ and $\,\mu_k$. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, $\,log(\pi_k)$. The class that has the minimum distance will maximise $\,\delta_k$. According to the theorem, we can then classify the point to a specific class $\,k$. In addition, $\, \Sigma_k = I$ implies that our data is spherical.

Case 2: (General Case) $\, \Sigma_k \ne I$

We can decompose this as:

$\, \Sigma_k = USV^\top = USU^\top$ (In general when $\,X=USV^\top$, $\,U$ is the eigenvectors of $\,XX^T$ and $\,V$ is the eigenvectors of $\,X^\top X$. So if $\, X$ is symmetric, we will have $\, U=V$. Here $\, \Sigma$ is symmetric)

and the inverse of $\,\Sigma_k$ is

$\, \Sigma_k^{-1} = (USU^\top)^{-1} = (U^\top)^{-1}S^{-1}U^{-1} = US^{-1}U^\top$ (since $\,U$ is orthonormal)

So from the formula for $\,\delta_k$, the second term is

\begin{align} (x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top US^{-1}U^T(x-\mu_k)\\ & = (U^\top x-U^\top\mu_k)^\top S^{-1}(U^\top x-U^\top \mu_k)\\ & = (U^\top x-U^\top\mu_k)^\top S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^\top x-U^\top\mu_k) \\ & = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top I(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\ & = (S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top\mu_k)^\top(S^{-\frac{1}{2}}U^\top x-S^{-\frac{1}{2}}U^\top \mu_k) \\ \end{align}

where we have the Euclidean distance between $\, S^{-\frac{1}{2}}U^\top x$ and $\, S^{-\frac{1}{2}}U^\top\mu_k$.

A transformation of all the data points can be done from $\,x$ to $\,x^*$ where $\, x^* \leftarrow S^{-\frac{1}{2}}U^\top x$.

It is now possible to do classification with $\,x^*$, treating it as in Case 1 above.

Note that when we have multiple classes, they must all have the same transformation, else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.

If the classes have different shapes, in another word, have different covariance $\,\Sigma_k$, can we use the same method to transform all data points $\,x$ to $\,x^*$?

The answer is NO. Consider that you have two classes with different shapes, then consider transforming them to the same shape. Given a data point, justify which class this point belongs to. The question is, which transformation can you use? For example, if you use the transformation of class A, then you have assumed that this data point belongs to class A.

### The Number of Parameters in LDA and QDA

Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.

LDA: Since we just need to compare the differences between one given class and remaining $\,K-1$ classes, totally, there are $\,K-1$ differences. For each of them, $\,a^{T}x+b$ requires $\,d+1$ parameters. Therefore, there are $\,(K-1)\times(d+1)$ parameters.

QDA: For each of the differences, $\,x^{T}ax + b^{T}x + c$ requires $\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1$ parameters. Therefore, there are $(K-1)(\frac{d(d+3)}{2}+1)$ parameters.

A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.

## LDA and QDA in Matlab - October 7, 2009

We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called classify that allows us to perform LDA and QDA quickly and easily.

In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below reproduces that example, slightly modified, and explains each step.

>> load 2_3;
>> [U, sample] = princomp(X');
>> sample = sample(:,1:2);

First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.
>> plot (sample(1:200,1), sample(1:200,2), '.');
>> hold on;
>> plot (sample(201:400,1), sample(201:400,2), 'r.');

Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.
See title and legend for information on adding the title and legend.
Before using classify we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the group vector should be an empty string or NaN. (See grouping data for more information.)
>> group = ones(400,1);
>> group(201:400) = 2;

We can now classify our data.
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');

The full details of this line can be examined in the Matlab help file linked above. What we care about are class, which contains the labels that the algorithm thinks that each data point belongs to, and coeff, which contains information about the line that algorithm created to separate the data into each class.
We can see the efficacy of the algorithm by comparing class to group.
>> sum (class==group)
ans =
369

This compares the value in class to the value in group. The answer of 369 tells us that the algorithm correctly determined the class of the point 369 times, out of a possible 400 data points. This gives us an empirical error rate of 0.0775.
We can see the line produced by LDA using coeff.
>> k = coeff(1,2).const;
>> l = coeff(1,2).linear;
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);

Those familiar with the programming language C will find the sprintf line refreshingly familiar; those with no exposure to C are directed to Matlab's sprintf page. Essentially, this code sets up the equation of the line in the form 0 = a + bx + cy. We then use the ezplot function to plot the line.
The 2-3 data after LDA is performed. The line shows where the two classes are split.
Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to classify, and a more complicated procedure to plot the line.
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');
>> sum (class==group)
ans =
371
>> k = coeff(1,2).const;
>> l = coeff(1,2).linear;
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x.*y+%g*y.^2', k, l, q(1,1), q(1,2)+q(2,1), q(2,2));
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);

The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that it is only correct 2 in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve that do not lie on the correct side of the line.

classify can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.

Recall: An analysis of the function of princomp in matlab.
In our assignment 1, we have learnt that how to perform Principal Component Analysis using SVD method. In fact, the matlab offers us a function called princomp which can perform PCA conveniently. From the matlab help file on princomp, you can find the details about this function. But here we will analyze the code of the function of princomp() in matlab to find something different when comparing with SVD method. The following is the code of princomp and explanations to some emphasized steps.

   function [pc, score, latent, tsquare] = princomp(x);
%   PRINCOMP Principal Component Analysis (centered and scaled data).
%   [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and
%   returns the principal components in PC, the so-called Z-scores in SC
%   ORES, the eigenvalues of the covariance matrix of X in LATENT,
%   and Hotelling's T-squared statistic for each data point in TSQUARE.
%   Reference: J. Edward Jackson, A User's Guide to Principal Components
%   John Wiley & Sons, Inc. 1991 pp. 1-25.
%   B. Jones 3-17-94
%   Copyright 1993-2002 The MathWorks, Inc.
%   $Revision: 2.9$  $Date: 2002/01/17 21:31:45$
[m,n] = size(x);       %  get the lengh of the rows and columns of matrix x.
r = min(m-1,n);        %  max possible rank of X
avg = mean(x);         %  the mean of every column of X
centerx = (x - avg(ones(m,1),:));
%  centers X by subtracting off column means
[U,latent,pc] = svd(centerx./sqrt(m-1),0);
%  "economy size" decomposition
score = centerx*pc;
%  the representation of X in the principal component space
if nargout < 3
return;
end
latent = diag(latent).^2;
if (r   latent = [latent(1:r); zeros(n-r,1)];
score(:,r+1:end) = 0;
end
if nargout < 4
return;
end
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';
tsquare = sum(tmp.*tmp)';


From the above code, we should pay attention to the following aspects when comparing with SVD method:

First, Rows of $\,X$ correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of $\,X$.

 >> load 2_3;
>> [U, score] = princomp(X');


Second, princomp centers X by subtracting off column means.

The third, when $\,X=UdV'$, princomp uses $\,V$ as coefficients for principal components, rather than $\,U$.

The following is an example to perform PCA using princomp and SVD respectively to get the same results.

SVD method
 >> load 2_3
>> mn=mean(X,2);
>> X1=X-repmat(mn,1,400);
>> [s d v]=svd(X1');
>> y=X1'*v;

princomp
 >>[U score]=princomp(X');


Then we can see that y=score, v=U.

## Trick: Using LDA to do QDA - October 7, 2009

There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the Kernel trick that will be discussed later in the course.

Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.

### Motivation

Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.

If we look back at the equations for LDA and QDA, we see that in LDA we must estimate $\,\mu_1$, $\,\mu_2$ and $\,\Sigma$. In QDA we must estimate all of those, plus another $\,\Sigma$; the extra $\,\frac{d(d-1)}{2}$ estimations make QDA less robust with fewer data points.

### Theoretically

Suppose we can estimate some vector $\underline{w}^T$ such that

$y = \underline{w}^Tx$

where $\underline{w}$ is a d-dimensional column vector, and $x\ \epsilon\ \Re^d$ (vector in d dimensions).

We also have a non-linear function $g(x) = y = x^Tvx + \underline{w}^Tx$ that we cannot estimate.

Using our trick, we create two new vectors, $\,\underline{w}^*$ and $\,x^*$ such that:

$\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]$

and

$x^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]$

We can then estimate a new function, $g^*(x,x^2) = y^* = \underline{w}^{*T}x^*$.

Note that we can do this for any $x$ and in any dimension; we could extend a $D \times n$ matrix to a quadratic dimension by appending another $D \times n$ matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a $\,sin(x)$ dimension.

### By Example

Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.

>> load 2_3;
>> [U, sample] = princomp(X');
>> sample = sample(:,1:2);

We start off the same way, by using PCA to reduce the dimensionality of our data to 2.
>> X_star = zeros(400,4);
>> X_star(:,1:2) = sample(:,:);
>> for i=1:400
for j=1:2
X_star(i,j+2) = X_star(i,j)^2;
end
end

This projects our sample into two more dimensions by squaring our initial two dimensional data set.
>> group = ones(400,1);
>> group(201:400) = 2;
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');
>> sum (class==group)
ans =
375

We can now display our results.
>> k = coeff(1,2).const;
>> l = coeff(1,2).linear;
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);

The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.
Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with $x^4$ (i.e. we set X_star(i,j+2) = X_star(i,j)^4) we can correctly classify 376 points.

## Introduction to Fisher's Discriminant Analysis - October 7, 2009

Fisher's Discriminant Analysis (FDA), also known as Fisher's Linear Discriminant Analysis (LDA) in some sources, is a classical feature extraction technique. It was originally described in 1936 by Sir Ronald Aylmer Fisher, an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found here; a Wikipedia article summarizing the algorithm can be found here.

LDA is for classification and FDA is used for feature extraction.

### Contrasting FDA with PCA

The goal of FDA is in contrast to our other main feature extraction technique, principal component analysis (PCA).

• In PCA, we map data to lower dimensions to maximize the variation in those dimensions.
• In FDA, we map data to lower dimensions to best separate data in different classes.
2 clouds of data, and the lines that might be produced by PCA and FDA.

Because we are concerned with identifying which class data belongs to, FDA is often a better feature extraction algorithm for classification.

Another difference between PCA and FDA is that FDA is a supervised algorithm; that is, we know what class data belongs to, and we exploit that knowledge to find a good projection to lower dimensions.

### Intuitive Description of FDA

An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.

### Example in R

PCA and FDA primary dimension for normal multivariate data, using R.
>> X = matrix(nrow=400,ncol=2)
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))
>> Y = c(rep("red",200),rep("blue",200))

Create 2 multivariate normal random variables with $\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)$. Create Y, an index indicating which class they belong to.
>> s <- svd(X,nu=1,nv=1)

Calculate the singular value decomposition of X. The most significant direction is in s$v[,1], and is displayed as a black line. >> s2 <- lda(X,grouping=Y)  The lda function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in s2$scaling.

Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.

>> plot(X,col=Y,main="PCA vs. FDA example")

Plot the set of points, according to colours given in Y.
>> slope = s$v[2]/s$v[1]
>> intercept = mean(X[,2])-slope*mean(X[,1])
>> abline(a=intercept,b=slope)

Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.
>> slope2 = s2$scaling[2]/s2$scaling[1]
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])
>> abline(a=intercept2,b=slope2,col="red")

Plot the FLDA direction, again through the mean.
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)

Labeling the lines directly on the graph makes it easier to interpret.

## Ficher's Discriminant Analysis (FDA)

The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space. We can consider two kinds of problems:

• 2-class problem
• multi-class problem

### Two-class problem - October 9, 2009

In the two-class problem, we have the pre-knowledge that data points belong to two classes. Intuitively speaking points of each class form a cloud around the mean of the class, with each class having possibly different size. To be able to separate the two classes we must determine the class whose mean is closest to a given point while also accounting for the different size of each class, which is represented by the covariance of each class.

Assume $\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}$ and $\displaystyle\Sigma_{1}$, represent the mean and covariance of the 1st class, and $\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}$ and $\displaystyle\Sigma_{2}$ represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:

1.To make the means of these two classes as far apart as possible

In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projetion. If the original points are $\underline{x_{i}} \in \mathbb{R}^{d}$and the projected points are $\underline{w}^T \underline{x_{i}}$ then the mean of the projected points will be $\underline{w}^T \underline{\mu_{1}}$ and $\underline{w}^T \underline{\mu_{2}}$ for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, $(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})$. The steps of this maximization are given below.

2.We want to collapse all data points of each class to a single point, i.e., minimize the covariance within classes

Notice that the variance of the projected classes 1 and 2 are given by $\underline{w}^T\Sigma_{1}\underline{w}$ and $\underline{w}^T\Sigma_{2}\underline{w}$. The second goal is to minimize the sum of these two covariances.

As is demonstrated below, both of these goals can be accomplished simultaneously.

Original points are $\underline{x_{i}} \in \mathbb{R}^{d}$
Projected points are $\underline{z_{i}} \in \mathbb{R}^{1}$ with $\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}$

#### Between class covariance

In this particular case, we want to project all the data points in one dimensional space.

We want to maximize the Euclidean distance between projected means, which is

\begin{align} (\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}})^T(\underline{w}^T \underline{\mu_{1}} - \underline{w}^T \underline{\mu_{2}}) &= (\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} . \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})\\ &= \underline{w}^T(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T\underline{w} \end{align}

The quantity $(\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T$ is called between class covariance or $\,S_{B}$.

The goal is to maximize : $\underline{w}^T S_{B} \underline{w}$

#### Within class covariance

Covariance of class 1 is $\,\Sigma_{1}$ Covariance of class 2 is $\,\Sigma_{2}$ So covariance of projected points will be $\,\underline{w}^T \Sigma_{1} \underline{w}$ and $\underline{w}^T \Sigma_{2} \underline{w}$

If we sum this two quantities we have

\begin{align} \underline{w}^T \Sigma_{1} \underline{w} + \underline{w}^T \Sigma_{2} \underline{w} &= \underline{w}^T(\Sigma_{1} + \Sigma_{2})\underline{w} \end{align}

The quantity $\,(\Sigma_{1} + \Sigma_{2})$ is called within class covariance or $\,S_{W}$

The goal is to minimize $\underline{w}^T S_{W} \underline{w}$

#### Objective Function

Instead of maximizing $\underline{w}^T S_{B} \underline{w}$ and minimizing $\underline{w}^T S_{W} \underline{w}$ we can define the following objective function:

$\underset{\underline{w}}{max}\ \frac{\underline{w}^T S_{B} \underline{w}}{\underline{w}^T S_{W} \underline{w}}$

This maximization problem is equivalent to $\underset{\underline{w}}{max}\ \underline{w}^T S_{B} \underline{w}$ subject to constraint $\underline{w}^T S_{W} \underline{w} = 1$. We can use the Lagrange multiplier method to solve it:

$L(\underline{w},\lambda) = \underline{w}^T S_{B} \underline{w} - \lambda(\underline{w}^T S_{W} \underline{w} - 1)$

With $\frac{\part L}{\part \underline{w}} = 0$ we get:

\begin{align} &\Rightarrow\ 2\ S_{B}\ \underline{w}\ - 2\lambda\ S_{W}\ \underline{w}\ = 0\\ &\Rightarrow\ S_{B}\ \underline{w}\ =\ \lambda\ S_{W}\ \underline{w} \\ &\Rightarrow\ S_{W}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w} \end{align}

Note that $\, S_{W}=\Sigma_1+\Sigma_2$ is sum of two positive matrices and so it has an inverse.

Here $\underline{w}$ is the eigenvector of $S_{w}^{-1}\ S_{B}$ corresponding to the largest eigenvalue.

In facts, this expression can be simplified even more.

$\Rightarrow\ S_{w}^{-1}\ S_{B}\ \underline{w}\ =\ \lambda\ \underline{w}$ with $S_{B}\ =\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T$
$\Rightarrow\ S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}\ =\ \lambda\ \underline{w}$

The quantity $(\underline{\mu_{1}}-\underline{\mu_{2}})^T \underline{w}$ and $\lambda$ are scalars.
So we can say the quantity $S_{w}^{-1}\ (\underline{\mu_{1}}-\underline{\mu_{2}})$ is proportional to $\underline{w}$

#### FDA vs. PCA Example in Matlab

We can compare PCA and FDA through the figure produced by matlab.

The following are the code to produce the figure step by step and the explanation for steps.

 >>X1=mvnrnd([1,1],[1 1.5;1.5 3],300);
>>X2=mvnrnd([5,3],[1 1.5;1.5 3],300);
>>X=[X1;X2];

Create two multivariate normal random variables with $\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)$.
 >>plot(X(1:300,1),X(1:300,2),'.');
>>hold on
>>p1=plot(X(301:600,1),X(301:600,2),'r.');

Plot the the data of the two classes respectively.
 >>[U Y]=princomp(X);
>>plot([0 U(1,1)*10],[0 U(2,1)*10]);

Using PCA to find the principal component and plot it.
 >>sw=2*[1 1.5;1.5 3];
>>sb=([1; 1]-[5 ;3])*([1; 1]-[5; 3])';
>>g =inv(sw)*sb;
>>[v w]=eigs(g);
>>plot([v(1,1)*5 0],[v(2,1)*5 0],'r')

Using FDA to find the principal component and plot it.

Now we can compare them through the figure.

PCA and FDA primary dimension for normal multivariate data, using matlab

#### Practical example of 2_3

In this matlab example we explore FDA using our familiar data set 2_3 which consists of 200 handwritten "2" and 200 handwritten "3".

X is a matrix of size 64*400 and each column represents an 8*8 image of "2" or "3". Here X1 gets all "2" and X2 gets all "3".

>>load 2_3
>>X1 = X(:, 1:200);
>>X2 = X(:, 201:400);


Next we calculate within class covariance and between class covariance as before.

>>mu1 = mean(X1, 2);
>>mu2 = mean(X2, 2);
>>sb = (mu1 - mu2) * (mu1 - mu2)';
>>sw = cov(X1') + cov(X2');


We use the first two eigenvectors to project the dato in a two-dimensional space.

>>[v d] = eigs( inv(sw) * sb );
>>w = v(:, 1:2);
>>X_hat = w'*X;


Finally we plot the data and visualize the effect of FDA.

>> scatter(ones(1,200),X_hat(1:200))
>> hold on
>> scatter(ones(1,200),X_hat(201:400),'r')

File:fda2-3.jpg
FDA projection of data 2_3, using Matlab.

## FDA for Multi-class Problems - October 14, 2009

For the $k$-class problem, we need to find a projection from $d$-dimensional space to a $(k-1)$-dimensional space.

(It is more reasonable to have at least 2 directions)

Basically, the within class covariance matrix $\mathbf{S}_{W}$ is easily to obtain:

\begin{align} \mathbf{S}_{W} = \sum_{i=1}^{k} \mathbf{S}_{W,i} \end{align}

where $\mathbf{S}_{W,i} = \frac{1}{n_{i}}\sum_{j: y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i})(\mathbf{x}_{j} - \mathbf{\mu}_{i})^{T}$ and $\mathbf{\mu}_{i} = \frac{\sum_{j: y_{j}=i}\mathbf{x}_{j}}{n_{i}}$.

However, the between class covariance matrix $\mathbf{S}_{B}$ is not easy to obtain. One of the simplifications is that we may assume that the total covariance $\mathbf{S}_{T}$ of the data is constant, since $\mathbf{S}_{W}$ is easy to compute, we can get $\mathbf{S}_{B}$ using the following relationship:

\begin{align} \mathbf{S}_{B} = \mathbf{S}_{T} - \mathbf{S}_{W} \end{align}

Actually, there is another generation for $\mathbf{S}_{B}$. Denote a total mean vector $\mathbf{\mu}$ by

\begin{align} \mathbf{\mu} = \frac{1}{n}\sum_{i}\mathbf{x_{i}} = \frac{1}{n}\sum_{j=1}^{k}n_{j}\mathbf{\mu}_{j} \end{align}

Thus the total covariance matrix $\mathbf{S}_{T}$ is

\begin{align} \mathbf{S}_{T} = \sum_{i}(\mathbf{x_{i}-\mu})(\mathbf{x_{i}-\mu})^{T} \end{align}

Thus we obtain

\begin{align} & \mathbf{S}_{T} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j} - \mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})(\mathbf{x}_{j} - \mathbf{\mu}_{i} + \mathbf{\mu}_{i} - \mathbf{\mu})^{T} \\& = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{T}+ \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T} \\& = \mathbf{S}_{W} + \sum_{i=1}^{k} n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T} \end{align}

Since the total covariance $\mathbf{S}_{T}$ is the sum of the within class covariance $\mathbf{S}_{W}$ and the between class covariance $\mathbf{S}_{B}$, we can denote the second term as the general between class covariance matrix $\mathbf{S}_{B}$, thus we obtain

\begin{align} \mathbf{S}_{B} = \sum_{i=1}^{k} n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T} \end{align}

Therefore,

\begin{align} \mathbf{S}_{T} = \mathbf{S}_{W} + \mathbf{S}_{B} \end{align}

Recall that in the two class case problem, we have

\begin{align} & \mathbf{S}_{B^{\ast}} = (\mathbf{\mu}_{1}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}_{2})^{T} \\ & = (\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})(\mathbf{\mu}_{1}-\mathbf{\mu}+\mathbf{\mu}-\mathbf{\mu}_{2})^{T} \\ & = ((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))((\mathbf{\mu}_{1}-\mathbf{\mu})-(\mathbf{\mu}_{2}-\mathbf{\mu}))^{T} \\ & = (\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T}+(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T} \end{align}

From the general form,

\begin{align} & \mathbf{S}_{B} = n_{1}(\mathbf{\mu}_{1}-\mathbf{\mu})(\mathbf{\mu}_{1}-\mathbf{\mu})^{T} + n_{2}(\mathbf{\mu}_{2}-\mathbf{\mu})(\mathbf{\mu}_{2}-\mathbf{\mu})^{T} \end{align}

Apparently, they are very similar.

Now, we are trying to find the optimal transformation. Basically, we have

\begin{align} \mathbf{z}_{i} = \mathbf{W}^{T}\mathbf{x}_{i}, i=1,2,...,k-1 \end{align}

where $\mathbf{z}_{i}$ is a $(k-1)\times 1$ vector, $\mathbf{W}$ is a $d\times (k-1)$ transformation matrix, i.e. $\mathbf{W} = [\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]$, and $\mathbf{x}_{i}$ is a $d\times 1$ column vector.

Thus we obtain

\begin{align} & \mathbf{S}_{W}^{\ast} = \sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})(\mathbf{W}^{T}\mathbf{x}_{j}-\mathbf{W}^{T}\mathbf{\mu}_{i})^{T} \\ & = \sum_{i=1}^{k}\sum_{j: y_{j}=i}\mathbf{W}^{T}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\mathbf{W} \\ & = \mathbf{W}^{T}\left[\sum_{i=1}^{k}\sum_{j: y_{j}=i}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})\right]\mathbf{W} \\ & = \mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W} \end{align}

Similarly, we obtain

\begin{align} & \mathbf{S}_{B}^{\ast} = \sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T} \\ & = \sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W} \\ & = \mathbf{W}^{T}\left[ \sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\right]\mathbf{W} \\ & = \mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W} \end{align}

Now, we use the determinant of the matrix, i.e. the product of the eigenvalues of the matrix, as our measure.

\begin{align} \phi(\mathbf{W}) = \frac{|\mathbf{S}_{B}^{\ast}|}{|\mathbf{S}_{W}^{\ast}|} = \frac{\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}}{\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}} \end{align}

The solution for this question is that the columns of the transformation matrix $\mathbf{W}$ are exactly the eigenvectors that correspond to largest $k-1$ eigenvalues with respect to

\begin{align} \mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} = \lambda_{i}\mathbf{w}_{i} \end{align}

Also, note that we can use

\begin{align} \sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2} \end{align}

as our measure.

Recall that

\begin{align} \|\mathbf{X}\|^2 = Tr(\mathbf{X}^{T}\mathbf{X}) \end{align}

Thus we obtain that

\begin{align} & \sum_{i=1}^{k}n_{i}\|(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}\|^{2} \\ & = \sum_{i=1}^{k}n_{i}Tr[(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}] \\ & = Tr[\sum_{i=1}^{k}n_{i}(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})(\mathbf{W}^{T}\mathbf{\mu}_{i}-\mathbf{W}^{T}\mathbf{\mu})^{T}] \\ & = Tr[\sum_{i=1}^{k}n_{i}\mathbf{W}^{T}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}] \\ & = Tr[\mathbf{W}^{T}\sum_{i=1}^{k}n_{i}(\mathbf{\mu}_{i}-\mathbf{\mu})(\mathbf{\mu}_{i}-\mathbf{\mu})^{T}\mathbf{W}] \\ & = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}] \end{align}

Similarly, we can get $Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]$. Thus we have following criterion function

\begin{align} \phi(\mathbf{W}) = \frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]} \end{align}

Similar to the two class case problem, we have:

max $Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]$ subject to $Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]=1$

To solve this optimization problem a Lagrange multiplier $\Lambda$, which actually is a $d \times d$ diagonal matrix, is introduced:

\begin{align} L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{B}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - 1 \right\} \end{align}

Differentiating with respect to $\mathbf{W}$ we obtain:

\begin{align} \frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W} \end{align}

Note that the $\mathbf{S}_{B}$ and $\mathbf{S}_{W}$ are both symmetric matrices, thus set the first derivative to zero, we obtain:

\begin{align} \mathbf{S}_{B}\mathbf{W} - \Lambda\mathbf{S}_{W}\mathbf{W}=0 \end{align}

Thus,

\begin{align} \mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W} \end{align}

where

$\mathbf{\Lambda} = \begin{pmatrix} \lambda_{1} & & 0\\ &\ddots&\\ 0 & &\lambda_{d} \end{pmatrix}$

and $\mathbf{W} = [\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]$.

As a matter of fact, $\mathbf{\Lambda}$ must have $\mathbf{k-1}$ nonzero eigenvalues, because $rank({S}_{W}^{-1}\mathbf{S}_{B})=k-1$.

Therefore, the solution for this question is as same as the previous case. The columns of the transformation matrix $\mathbf{W}$ are exactly the eigenvectors that correspond to largest $k-1$ eigenvalues with respect to

\begin{align} \mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} = \lambda_{i}\mathbf{w}_{i} \end{align}

## Linear Regression Models - October 14, 2009

Regression analysis is a general statistical technique for modelling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, $\,y$, changes according to changes in $\,X$.

We will start by considering a very simple regression model, the linear regression model.

General information on linear regression can be found at the University of South Florida and this MIT lecture.

For the purpose of classification, the linear regression model assumes that the regression function $\,E(Y|X)$ is linear in the inputs $\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}$.

The linear regression model has the general form:

\begin{align} f(x) = \beta^{T}\mathbf{x}_{i}+\beta_{0} \end{align}

where $\,\beta$ is a $1 \times d$ vector.

Given input data $\,\mathbf{x}_{1}, ..., \mathbf{x}_{p}$ and $\,y_{1}, ..., y_{p}$ our goal is to find $\,\beta$ and $\,\beta_0$ such that the linear model fits the data while minimizing sum of squared errors using the Least Squares method.

Note that vectors $\mathbf{x}_{i}$ could be numerical inputs, transformations of the original data, i.e. $\log \mathbf{x}_{i}$ or $\sin \mathbf{x}_{i}$, or basis expansions, i.e. $\mathbf{x}_{i}^{2}$ or $\mathbf{x}_{i}\times \mathbf{x}_{j}$.

Denote $\mathbf{X}$ as a $n\times(d+1)$ matrix with each row an input vector (with 1 in the first position), $\,\beta = (\beta_0, \beta_1,..., \beta_{d})^{T}$ and $\mathbf{y}$ as a $n \times 1$ vector of outputs. We then try to minimize the residual sum-of-squares

\begin{align} \mathrm{RSS}(\beta)=(\mathbf{y}-\mathbf{X}\beta)(\mathbf{y}-\mathbf{X}\beta)^{T} \end{align}

This is a quadratic function in the $\,d+1$ parameters. Differentiating with respect to $\,\beta$ we obtain

\begin{align} \frac{\partial \mathrm{RSS}}{\partial \beta} = -2\mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta) \end{align}
\begin{align} \frac{\partial^{2}\mathrm{RSS}}{\partial \beta \partial \beta^{T}}=2\mathbf{X}^{T}\mathbf{X} \end{align}

Set the first derivative to zero

\begin{align} \mathbf{X}^{T}(\mathbf{y}-\mathbf{X}\beta)=0 \end{align}

we obtain the solution

\begin{align} \hat \beta = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y} \end{align}

Thus the fitted values at the inputs are

\begin{align} \mathbf{\hat y} = \mathbf{X}\hat\beta = \mathbf{X} (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y} \end{align}

where $\mathbf{H} = \mathbf{X} (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}$ is called the hat matrix.

• Note For classification purposes, this is not a correct model. Recall the following application of Bayes classifier:

$r(x)= P( Y=k | X=x )= \frac{f_{k}(x)\pi_{k}}{\Sigma_{k}f_{k}(x)\pi_{k}}$
It is clear that to make sense mathematically, $\displaystyle r(x)$ must be a value between 0 and 1. If this is estimated with the regression function $\displaystyle r(x)=E(Y|X=x)$ and $\mathbf{\hat\beta}$ is learned as above, then there is nothing that would restrict $\displaystyle r(x)$ to taking values between 0 and 1.

#### A linear regression example in Matlab

We can see how linear regression works through the following example in Matlab. The following is the code and the explanation for each step.

Again, we use the data in 2_3.m.

 >>load 2_3;
>>[U, sample] = princomp(X');
>>sample = sample(:,1:2);


We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.

 >>y = zeros(400,1);
>>y(201:400) = 1;


We let y represent the set of labels coded as 0 and 1.

 >>x=[sample;ones(1,400)];


Construct x by adding a row of vector 1 to data.

 >>b=inv(x*x')*x*y;


Calculate b, which represents $\beta$ in the linear regression model.

 >>x1=x';
>>for i=1:400
if x1(i,:)*b>0.5
plot(x1(i,1),x1(i,2),'.')
hold on
elseif x1(i,:)*b < 0.5
plot(x1(i,1),x1(i,2),'r.')
end
end


Plot the fitted y values.

File:linearregression.png
the figure shows that the classification of the data points in 2_3.m by the linear regression model

## Logistic Regression- October 16, 2009

### Intuition behind Logistic Regression

Recall that, for classification purposes, the linear regression model presented in the above section is not correct because it does not force $\,r(x)$ to be between 0 and 1 and sum to 1. Consider the following log odds model (for two classes):

$\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)=\beta^Tx$

Calculating $\,P(Y=1|X=x)$ leads us to the logistic regression model, which as opposed to the linear regression model, allows the modelling of the posterior probabilities of the classes through linear methods and at the same time ensures that they sum to one and are between 0 and 1. It is a type of Generalized Linear Model (GLM).

### The Logistic Regression Model

The logistic regression model for the two class case is defined as

Class 1

$P(Y=1 | X=x)$
$P(Y=1 | X=x) =\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=P(x;\underline{\beta})$

Then we have that

Class 0

$P(Y=0 | X=x)$
$P(Y=0 | X=x) = 1-P(Y=1 | X=x)=1-\frac{\exp(\underline{\beta}^T \underline{x})}{1+\exp(\underline{\beta}^T \underline{x})}=\frac{1}{1+\exp(\underline{\beta}^T \underline{x})}$

### Fitting a Logistic Regression

Logistic regression tries to fit a distribution. The fitting of logistic regression models is usually accomplished by maximum likelihood. The maximum likelihood of $\underline\beta$ maximizes the probability of obtaining the data $\displaystyle{x_{1},...,x_{n}}$ from the known distribution. Combining $\displaystyle P(Y=1 | X=x)$ and $\displaystyle P(Y=0 | X=x)$ as follows, we can consider the two classes at the same time:

$p(\underline{x_{i}};\underline{\beta}) = \left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{y_i} \left(\frac{1}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)^{1-y_i}$

Assuming the data $\displaystyle {x_{1},...,x_{n}}$ is drawn independently, the likelihood function is

\begin{align} \mathcal{L}(\theta)&=p({x_{1},...,x_{n}};\theta)\\ &=\displaystyle p(x_{1};\theta) p(x_{2};\theta)... p(x_{n};\theta) \quad \mbox{(by independence)}\\ &= \prod_{i=1}^n p(x_{i};\theta) \end{align}

Since it is more convenient to work with the log-likelihood function, take the log of both sides, we get

$\displaystyle l(\theta)=\displaystyle \sum_{i=1}^n \log p(x_{i};\theta)$

So,

\begin{align} l(\underline\beta)&=\displaystyle\sum_{i=1}^n y_{i}\log\left(\frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x_i})}\right)+(1-y_{i})\log\left(\frac{1}{1+\exp(\underline{\beta}^T\underline{x_i})}\right)\\ &= \displaystyle\sum_{i=1}^n y_{i}(\underline{\beta}^T\underline{x_i}-\log(1+\exp(\underline{\beta}^T\underline{x_i}))+(1-y_{i})(-\log(1+\exp(\underline{\beta}^T\underline{x_i}))\\ &= \displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}-y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))- \log(1+\exp(\underline{\beta}^T\underline{x_i}))+y_{i} \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\ &=\displaystyle\sum_{i=1}^n y_{i}\underline{\beta}^T\underline{x_i}- \log(1+\exp(\underline{\beta}^T\underline{x_i}))\\ \end{align}

To maximize the log-likelihood, set its derivative to 0.

\begin{align} \frac{\partial l}{\partial \underline{\beta}} &= \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{\exp(\underline{\beta}^T \underline{x_i})}{1+\exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]\\ &=\sum_{i=1}^n \left[{y_i} \underline{x}_i - p(\underline{x}_i;\underline{\beta})\underline{x}_i\right] \end{align}

To solve this equation, the Newton-Raphson algorithm is used which requires the second derivative in addition to the first derivative. This is demonstrated in the next section.

## Logistic Regression(2) - October 19, 2009

### Logistic Regression Model

Recall that in the last lecture, we learned the logistic regression model.

• $P(Y=1 | X=x)=P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}$
• $P(Y=0 | X=x)=1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}$

### Find $\underline{\beta}$

Criteria: find a $\underline{\beta}$ that maximizes the conditional likelihood of Y given X using the training data.

From above, we have the first derivative of the log-likelihood:

$\frac{\partial l}{\partial \underline{\beta}} = \sum_{i=1}^n \left[{y_i} \underline{x}_i- \frac{exp(\underline{\beta}^T \underline{x_i})}{1+exp(\underline{\beta}^T \underline{x}_i)}\underline{x}_i\right]$ $=\sum_{i=1}^n \left[{y_i} \underline{x}_i - P(\underline{x}_i;\underline{\beta})\underline{x}_i\right]$

The Newton-Raphson algorithm requires the second-derivative or Hessian matrix.

$\frac{\partial^{2} l}{\partial \underline{\beta} \partial \underline{\beta}^T }= \sum_{i=1}^n - \underline{x_i} \frac{(exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)(1+exp(\underline{\beta}^T \underline{x}_i))-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i)exp(\underline{\beta}^T\underline{x}_i)}{(1+exp(\underline{\beta}^T \underline{x}))^2}$

(note: $\frac{\partial\underline{\beta}^T\underline{x}_i}{\partial \underline{\beta}^T}=\underline{x}_i^T$ you can check it here, it's a very useful website including a Matrix Reference Manual that you can find information about linear algebra and the properties of real and complex matrices.)

$=\sum_{i=1}^n - \underline{x}_i \frac{(-\underline{x}_i exp(\underline{\beta}^T\underline{x}_i) \underline{x}_i^T)}{(1+exp(\underline{\beta}^T \underline{x}))(1+exp(\underline{\beta}^T \underline{x}))}$ (by cancellation)
$=\sum_{i=1}^n - \underline{x}_i \underline{x}_i^T P(\underline{x}_i;\underline{\beta}）)[1-P(\underline{x}_i;\underline{\beta})]）$(since $P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}$ and $1-P(\underline{x}_i;\underline{\beta})=\frac{1}{1+exp(\underline{\beta}^T \underline{x})}$)

The same second derivative can be achieved if we reduce the occurrences of beta to 1 by the identity$\frac{a}{1+a}=1-\frac{1}{1+a}$

And solving $\frac{\partial}{\partial \underline{\beta}^T}\sum_{i=1}^n \left[{y_i} \underline{x}_i-\left[1-\frac{1}{1+exp(\underline{\beta}^T \underline{x}_i)}\right]\underline{x}_i\right]$

Starting with $\,\underline{\beta}^{old}$, the Newton-Raphson update is

$\,\underline{\beta}^{new}\leftarrow \,\underline{\beta}^{old}- (\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T})^{-1}(\frac{\partial l}{\partial \underline{\beta}})$ where the derivatives are evaluated at $\,\underline{\beta}^{old}$

The iteration will terminate when $\underline{\beta}^{new}$ is very close to $\underline{\beta}^{old}$.

The iteration can be described in matrix form.

• Let $\,\underline{Y}$ be the column vector of $\,y_i$. ($n\times1$)
• Let $\,X$ be the ${d}\times{n}$ input matrix.
• Let $\,\underline{P}$ be the ${n}\times{1}$ vector with $i$th element $P(\underline{x}_i;\underline{\beta}^{old})$.
• Let $\,W$ be an ${n}\times{n}$ diagonal matrix with $i$th element $P(\underline{x}_i;\underline{\beta}^{old})[1-P(\underline{x}_i;\underline{\beta}^{old})]$

then

$\frac{\partial l}{\partial \underline{\beta}} = X(\underline{Y}-\underline{P})$

$\frac{\partial ^2 l}{\partial \underline{\beta}\partial \underline{\beta}^T} = -XWX^T$

The Newton-Raphson step is

$\underline{\beta}^{new} \leftarrow \underline{\beta}^{old}+(XWX^T)^{-1}X(\underline{Y}-\underline{P})$

This equation is sufficient for computation of the logistic regression model. However, we can simplify further to uncover an interesting feature of this equation.

\begin{align} \underline{\beta}^{new} &= (XWX^T)^{-1}(XWX^T)\underline{\beta}^{old}+(XWX^T)^{-1}XWW^{-1}(\underline{Y}-\underline{P})\\ &=(XWX^T)^{-1}XW[X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})]\\ &=(XWX^T)^{-1}XWZ \end{align}

where $Z=X^T\underline{\beta}^{old}+W^{-1}(\underline{Y}-\underline{P})$

Recall that linear regression by least square finds the following minimum: $\min_{\underline{\beta}}(\underline{y}-\underline{\beta}^T X)^T(\underline{y}-\underline{\beta}^TX)$

we have $\underline\hat{\beta}=(XX^T)^{-1}X\underline{y}$

Similarly, we can say that $\underline{\beta}^{new}$ is the solution of a weighted least square problem:

$\underline{\beta}^{new} \leftarrow \min_{\underline{\beta}}(Z-X\underline{\beta}^T)W(Z-X\underline{\beta})$

#### WLS

Actually, the weighted least squares estimator minimizes the weighted sum of squared errors $S(\beta) = \sum_{i=1}^{n}w_{i}[y_{i}-\mathbf{x}_{i}^{T}\beta]^{2}$ where $\displaystyle w_{i}\gt 0$. Hence the WLS estimator is given by $\hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}y_{i}\right]$

A weighted linear regression of the iteratively computed response $\mathbf{z}=\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})$

Therefore, we obtain

\begin{align} & \hat\beta^{WLS}=\left[\sum_{i=1}^{n}w_{i}\mathbf{x}_{i}\mathbf{x}_{i}^{T} \right]^{-1}\left[ \sum_{i=1}^{n}w_{i}\mathbf{x}_{i}z_{i}\right] \\& = \left[ \mathbf{XWX}^{T}\right]^{-1}\left[ \mathbf{XWz}\right] \\& = \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{XW}(\mathbf{X}^{T}\beta^{old}+\mathbf{W}^{-1}(\mathbf{y}-\mathbf{p})) \\& = \beta^{old}+ \left[ \mathbf{XWX}^{T}\right]^{-1}\mathbf{X}(\mathbf{y}-\mathbf{p}) \end{align}

note:Here we obtain $\underline{\beta}$, which is a $d\times{1}$ vector, because we construct the model like $\underline{\beta}^T\underline{x}$. If we construct the model like $\underline{\beta}_0+ \underline{\beta}^T\underline{x}$, then similar to linear regression, $\underline{\beta}$ will be a $(d+1)\times{1}$ vector.

Choosing $\displaystyle\beta=0$ seems to be a suitable starting value for the Newton-Raphson iteration procedure in this case. However, this does not guarantee convergence. The procedure will usually converge since the log-likelihood function is concave(or convex). In the case that it does not, we can just prove the local convergence of the method, which means the iteration would converge only if the initial point is closed enough to the exact solution. However, in practice, choosing an appropriate initial value is really trivial, namely, it is not often to find a initial too far from the exact solution to make the iteration invalid. <ref>C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, chapter 5 </ref> Besides, step-size halving will solve this problem. <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer

2009),121.</ref>

#### Pseudo Code

1. $\underline{\beta} \leftarrow 0$
2. Set $\,\underline{Y}$, the label associated with each observation $\,i=1...n$.
3. Compute $\,\underline{P}$ according to the equation $P(\underline{x}_i;\underline{\beta})=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}$ for all $\,i=1...n$.
4. Compute the diagonal matrix $\,W$ by setting $\,w_i,i$ to $P(\underline{x}_i;\underline{\beta}）)[1-P(\underline{x}_i;\underline{\beta})]$ for all $\,i=1...n$.
5. $Z \leftarrow X^T\underline{\beta}+W^{-1}(\underline{Y}-\underline{P})$.
6. $\underline{\beta} \leftarrow (XWX^T)^{-1}XWZ$.
7. If the new $\underline{\beta}$ value is sufficiently close to the old value, stop; otherwise go back to step 3.

### Comparison with Linear Regression

• Similarities
1. They are both to attempt to estimate $\,P(Y=k|X=x)$ (For logistic regression, we just mentioned about the case that $\,k=0$ or $\,k=1$ now).
2. They are both have linear boundaris.
note:For linear regression, we assume the model is linear. The boundary is $P(Y=k|X=x)=\underline{\beta}^T\underline{x}_i+\underline{\beta}_0=0.5$ (linear)
For logistic regression, the boundary is $P(Y=k|X=x)=\frac{exp(\underline{\beta}^T \underline{x})}{1+exp(\underline{\beta}^T \underline{x})}=0.5 \Rightarrow exp(\underline{\beta}^T \underline{x})=1\Rightarrow \underline{\beta}^T \underline{x}=0$ (linear)
• Differences
1. Linear regression: $\,P(Y=k|X=x)$ is linear function of $\,x$, $\,P(Y=k|X=x)$ is not guaranteed to fall between 0 and 1 and to sum up to 1.
2. Logistic regression: $\,P(Y=k|X=x)$ is a nonlinear function of $\,x$, and it is guaranteed to range from 0 to 1 and to sum up to 1.

### Comparison with LDA

1. The linear logistic model only consider the conditional distribution $\,P(Y=k|X=x)$. No assumption is made about $\,P(X=x)$.
2. The LDA model specifies the joint distribution of $\,X$ and $\,Y$.
3. Logistic regression maximizes the conditional likelihood of $\,Y$ given $\,X$: $\,P(Y=k|X=x)$
4. LDA maximizes the joint likelihood of $\,Y$ and $\,X$: $\,P(Y=k,X=x)$.
5. If $\,\underline{x}$ is d-dimensional,the number of adjustable parameter in logistic regression is $\,d$. The number of parameters grows linearly w.r.t dimension.
6. If $\,\underline{x}$ is d-dimensional,the number of adjustable parameter in LDA is $\,(2d)+d(d+1)/2+2=(d^2+5d+4)/2$. The number of parameters grows quardratically w.r.t dimension.
7. As logistic regression relies on fewer assumptions, it seems to be more robust.
8. In practice, Logistic regression and LDA often give the similar results.

#### By example

Now we compare LDA and Logistic regression by an example. Again, we use them on the 2_3 data.

 >>load 2_3;
>>[U, sample] = princomp(X');
>>sample = sample(:,1:2);
>>plot (sample(1:200,1), sample(1:200,2), '.');
>>hold on;
>>plot (sample(201:400,1), sample(201:400,2), 'r.');

First, we do PCA on the data and plot the data points that represent 2 or 3 in different colors. See the previous example for more details.
 >>group = ones(400,1);
>>group(201:400) = 2;

Group the data points.
 >>[B,dev,stats] = mnrfit(sample,group);
>>x=[ones(1,400); sample'];

Now we use mnrfit to use logistic regression to classfy the data. This function can return B which is a $(d+1)\times{(k–1)}$ matrix of estimates, where each column corresponds to the estimated intercept term and predictor coefficients. In this case, B is a $3\times{1}$ matrix.
 >> B
B =0.1861
-5.5917
-3.0547

This is our $\underline{\beta}$. So the posterior probabilities are:
$P(Y=1 | X=x)=\frac{exp(0.1861-5.5917X_1-3.0547X_2)}{1+exp(0.1861-5.5917X_1-3.0547X_2)}$.
$P(Y=2 | X=x)=\frac{1}{1+exp(0.1861-5.5917X_1-3.0547X_2)}$
The classification rule is:
$\hat Y = 1$, if $\,0.1861-5.5917X_1-3.0547X_2\gt =0$
$\hat Y = 2$, if $\,0.1861-5.5917X_1-3.0547X_2\lt 0$
 >>f = sprintf('0 = %g+%g*x+%g*y', B(1), B(2), B(3));
>>ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))])

Plot the decision boundary by logistic regression.
This is a decision boundary by logistic regression.The line shows how the two classes split.
 >>[class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');
>>k = coeff(1,2).const;
>>l = coeff(1,2).linear;
>>f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));
>>h=ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);

Plot the decision boundary by LDA. See the previous example for more information about LDA in matlab.
From this figure, we can see that the results of Logistic Regression and LDA are very similar.

## 2009.10.21

### Multi-Class Logistic Regression

Our earlier goal with logistic regression was to model the posteriors for a 2 class classification problem with a linear function bounded by the interval [0,1]. In that case our model was,

$\log\left(\frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right)= \log\left(\frac{\frac{\exp(\beta^T x)}{1+\exp(\beta^T x)}}{\frac{1}{1+\exp(\beta^T x)}}\right) =\beta^Tx$

We can extend this idea to the more general case with K-classes. This model is specified with K - 1 terms where the Kth class in the denominator can be chosen arbitrarily.

$\log\left(\frac{P(Y=i|X=x)}{P(Y=K|X=x)}\right)=\beta_i^Tx,\quad i \in \{1,\dots,K-1\}$

The posteriors for each class are given by,

$P(Y=i|X=x) = \frac{\exp(\beta_i^T x)}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}, \quad i \in \{1,\dots,K-1\}$

$P(Y=K|X=x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(\beta_k^T x)}$

Seeing these equations as a weighted least squares problem makes them easier to derivate.

Note that we still retain the property that the sum of the posteriors is 1. In general the posteriors are no longer complements of each other as in true in the 2 class problem where we could express $\displaystyle P(Y=1|X=x)=1-P(Y=0|X=x)$. Fitting a Logistic model for the K>2 class problem isn't as 'nice' as in the 2 class problem since we don't have the same simplification.

### Perceptron (Foundation of Neural Network)

Figure 1: Diagram of a linear perceptron.

Recall the use of Least Squares regression as a classifier, shown to be identical to LDA. To classify points with least squares we take the sign of a linear combination of data points and assign a label equivalent to +1 or -1.

Least Squares returns the sign of a linear combination of data points as the class label

sign($(\underline{\beta}^T \underline{x} + {\beta}_0)$

In the 1950s Frank Rosenblatt developed an iterative linear classifier while at Cornell University known as the Perceptron. The concept of a perceptron was fundamental to the later development of the Artificial Neural Network models. The perceptron is a simple type of neural network which models the electrical signals of biological neurons. In fact, it was the first neural network to be algorithmically described. <ref>Simon S. Haykin, Neural Networks and Learning Machines, (Prentice Hall 2008). </ref>

As in other linear classification methods like Least Squares, Rosenblatt's classifier determines a hyperplane for the decision boundary. Linear methods all determine slightly different decision boundaries, Rosenblatt's algorithm seeks to minimize the distance between the decision boundary and the misclassified points <ref>H. Trevor, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer 2009),156.</ref>.

Particular to the iterative nature of the solution, the problem has no global mean (not convex). It does not converge to give a unique hyperplane, and the solutions depend on the size of the gap between classes. If the classes are separable then the algorithm is shown to converge to a local mean. The proof of this convergence is known as the perceptron convergence theorem. However, for overlapping classes convergence to a local mean cannot be guaranteed.

If we find a hyperplane that is not unique between 2 classes, there will be infinitely many solutions obtained from the perceptron algorithm.

As seen in Figure 1, after training, the perceptron determines the label of the data by computing the sign of a linear combination of components.

#### A Perceptron Example

The perceptron network can figure out the decision boundray line even if we dont know how to draw the line. We just have to give it some examples first. For example:

1,0,0 +1
1,0,1 +1
1,1,0 +1
0,0,1 -1
0,1,1 -1
1,1,1 -1

Then the perceptron starts out not knowing how to separate the answers so it guesses. For example we input 1,0,0 and it guesses -1. But the right answer is +1. So the perceptron adjusts its line and we try the next example. Eventually the perceptron will have all the answers right.

y=[1;1;1;-1;-1;-1];
x=[1,0,0;1,0,1;,1,1,0;0,0,1;0,1,1;1,1,1]';
b_0=0;
b=[1;1;1];
rho=.5;
for j=1:100;
changed=0;
for i=1:6
d=(b'*x(:,i)+b_0)*y(i);
if d<0
b=b+rho*x(:,i)*y(i);
b_0=b_0+rho*y(i);
changed=1;
end
end
if changed==0
break;
end
end


## The Perceptron (Lecture October 23, 2009)

File:misclass.png
Figure 2: This figure shows a misclassified point and the movement of the decision boundary.

A Perceptron can be modeled as shown in Figure 1 of the previous lecture where$\,x_0$ is the model intercept and $x_{1},\ldots,x_{d}$ represent the feature data, $\sum_{i=0}^d \beta_{j}x_{j}$ is a linear combination of some weights of these inputs, and $I(\sum_{i=1}^d \beta_{j}x_{j})$, where $\,I$ indicates the sign of the expression and returns the label of the data point.

The Perceptron algorithm seeks a linear boundary between two classes. A linear decision boundary can be represented by$\underline{\beta}^T\underline{x}+\beta_{0}.$ The algorithm begins with an arbitrary hyperplane $\underline{\beta}^T\underline{x}+\beta_{0}$ (initial guess). Its goal is to minimize the distance between the decision boundary and the misclassified data points. This is illustrated in Figure 2. It attempts to find the optimal $\underline\beta$ by iteratively adjusting the decision boundary until all points are on the correct side of the boundary. It terminates when there are no misclassified points.

File:distance2.jpg
Figure 3: This figure illustrates the derivation of the distance between the decision boundary and misclassified points

Derivation: The distance between the decision boundary and misclassified points.

If $\underline{x_{1}}$ and $\underline{x_{2}}$both lie on the decision boundary then,

\begin{align} \underline{\beta}^T\underline{x_{1}}+\beta_{0} &= \underline{\beta}^T\underline{x_{2}}+\beta_{0} \\ \underline{\beta}^T (x_{1}-x_{2})&=0 \end{align}

$\underline{\beta}^T (x_{1}-x_{2})$ denotes an inner product. Since the inner product is 0 and $(\underline{x_{1}}-\underline{x_{2}})$ is a vector lying on the decision boundary, $\underline{\beta}$ is orthogonal to the decision boundary.

Let $\underline{x_{i}}$ be a misclassified point.

Then the projection of the vector $\underline{x_{i}}$ on the direction that is orthogonal to the decision boundary is $\underline{\beta}^T\underline{x_{i}}$. Now, if $\underline{x_{0}}$ is also on the decision boundary, then $\underline{\beta}^T\underline{x_{0}}+\beta_{0}=0$ and so $\underline{\beta}^T\underline{x_{0}}= -\beta_{0}$. Looking at Figure 3, it can be seen that the distance between $\underline{x_{i}}$ and the decision boundary is the absolute value of $\underline{\beta}^T\underline{x_{i}}+\beta_{0}.$

Consider $y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0}).$

Notice that if $\underline{x_{i}}$ is classified correctly then this product is positive. This is because if it is classified correctly, then either both ($\underline{\beta}^T\underline{x_{i}}+\beta_{0})$ and$\displaystyle y_{i}$ are positive or they are both negative. However, if $\underline{x_{i}}$ is classified incorrectly then one of $(\underline{\beta}^T\underline{x_{i}}+\beta_{0})$ and $\displaystyle y_{i}$ is positive and the other is negative. The result is that the above product is negative for a point that is misclassified.

For the algorithm, we need only consider the distance between the misclassified points and the decision boundary.

Consider $\phi(\underline{\beta},\beta_{0})= -\displaystyle\sum_{i\in M} –y_{i}(\underline{\beta}^T\underline{x_{i}}+\beta_{0})$

which is a summation of positive numbers and where $\displaystyle M$ is the set of all misclassified points.
The goal now becomes to $\min_{\underline{\beta},\beta_{0}} \phi(\underline{\beta},\beta_{0}).$

This can be done using a gradient descent approach, which is a numerical method that takes one predetermined step in the direction of the gradient, getting closer to a minimum at each step, until the gradient is zero. A problem with this algorithm is the possibility of getting stuck in a local minimum. To continue, the following derivatives are needed:

$\frac{\partial \phi}{\partial \underline{\beta}}= -\displaystyle\sum_{i \in M}y_{i}\underline{x_{i}} \ \ \ \ \ \ \ \ \ \ \ \frac{\partial \phi}{\partial \beta_{0}}= -\displaystyle\sum_{i \in M}y_{i}$

Then the gradient descent type algorithm (Perceptron Algorithm) is

$\begin{pmatrix} \underline{\beta}^{\mathrm{new}}\\ \underline{\beta_0}^{\mathrm{new}} \end{pmatrix} = \begin{pmatrix} \underline{\beta}^{\mathrm{old}}\\ \underline{\beta_0}^{\mathrm{old}} \end{pmatrix} +\rho \begin{pmatrix} y_i \underline{x_i}\\ y_i \end{pmatrix}$

where $\displaystyle\rho$ is the magnitude of each step called the "learning rate" or the "convergence rate". The algorithm continues until $\begin{pmatrix} \underline{\beta}^{\mathrm{new}}\\ \underline{\beta_0}^{\mathrm{new}} \end{pmatrix} = \begin{pmatrix} \underline{\beta}^{\mathrm{old}}\\ \underline{\beta_0}^{\mathrm{old}} \end{pmatrix}$ or until it has iterated a specified number of times. If the algorithm converges, it has found a linear classifier, ie., there are no misclassified points.

#### Problems with the Algorithm and Issues Affecting Convergence

1. If the data is not separable, then the Perceptron algorithm will not converge since it cannot find a linear classifier that classifies all of the points correctly.
2. Convergence rates depend on the size of the gap between classes. If the gap is large, then the algorithm converges quickly. However, if the gap is small, the algorithm converges slowly. This problem can be eliminated by using basis expansions technique. To be specific, we try to find a hyperplane not in the original space, but in the enlarged space obtained by using some basis functions.
3. If the classes are separable, there exists infinitely many solutions to Perceptron, all of which are hyperplanes.
4. The speed of convergence of the algorithm is also dependent on the value of $\displaystyle\rho$, the learning rate. A larger value of $\displaystyle\rho$ could yield quicker convergence, but if this value is too large, it may also result in “skipping over” the minimum that the algorithm is trying to find and possibly oscillating forever between the last two points, before and after the min.
5. A perfect separation is not always available even desirable. If observations comes from different classes sharing the same imput, the classification model seems to be overfitting and will generally have poor predictive performance.
6. The perceptron convergence theorem states that if there exists an exact solution (in other words, if the training data set is linearly separable), then the perceptron learning algorithm is guaranteed to ﬁnd an exact solution in a ﬁnite number of steps. Proofs of this theorem can be found for example in Rosenblatt (1962), Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), and Bishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved we will not be able to distinguish between a nonseparable problem and one that is simply slow to converge<ref>

Pattern Recognition and Machine Learning,Christopher M. Bishop,194

</ref>.

#### Comment on gradient descent algorithm

Consider yourself on the peak and you want to get to the land as fast as possible. So which direction should you step? Intuitively it should be the direction in which the height decreases fastest, which is given by the gradient. However, if the mountain has a saddle shape and you initially stand in the middle, then you will finally arrive at the saddle point (local minimum) and get stuck there.

In addition, note that in the final form of our gradient descent algorithm, we get rid of the summation over $\,i$ (all data points). Actually, this is an alternative of the original gradient descent algorithm (sometimes called batch gradient descent) known as Stochastic gradient descent, where we approximate the true gradient by only evaluating on a single training example. This means that $\,{\beta}$ gets improved by computation of only one sample. When there is a large data set, say, population database, it's very time-consuming to do summation over millions of samples. By Stochastic gradient descent, we can treat the problem sample by sample and still get decent result in practice.

## Neural Networks (NN) - October 28, 2009

A neural network is a parallel, distributed information processing structure consisting of processing elements interconnected together with signal channels called connections. Each processing element has a single output connection with branches that "fan out" onto as many connections as desired each carrying the same signal - the processing element output signal. <ref> Theory of the Backpropagation Neural Network, R. Necht-Nielsen </ref> It is a multistage regression or classification model represented by a network. Figure 1 is an example of a typical neural network but it can have many different forms.

Figure 1: General Structure of a Neural Network.

A regression problem typically has only one unit in the output layer. In a k-class classification problem, there are usually k units in the output layer that each represent the probability of class k and each $\displaystyle y_k$ is coded (0,1).

### Activation Function

Activation Function is a term that is frequently used in classification by NN.

In perceptron, we have a "sign" function that takes the sign of a weighted sum of input features.

File:signfuncperceptron.png
The sign function is of the form File:signfunc1.png and is not continuous at 0. Thus, we replace it by a smooth function $\displaystyle \sigma$ of the form File:signfunc2.png and call it the activation function.
The choice of this function $\displaystyle \sigma$ is determined by the properties of the data and the assumed distribution of target variables, but for multiple binary classiﬁcation problems $\sigma(a)=\frac {1}{1+e^{-a}}$ (inverse-logit) form is often used.

Note: A key difference between the perceptron and NN is that the neural network uses continuous nonlinearities in the units, for the purpose of differentiation, whereas the perceptron often uses a non-differentiable activation function. The neural network function is differentiable with respect to the network parameters so that a gradient descent method can be used in training. Moreover, perceptron is a linear classifier, while NN, by combining layers of perceptrons, is able to classify non-linear problems through proper training.

By assigning some weights to the connectors in the neural network (see diagram above) we weigh the input that comes into the perceptron, to get an output that in turn acts as an input to the next layer of perceptrons, and so on for each layer. This type of neural network is called Feed-Forward Neural Network. Applications to Feed-Forward Neural Networks include data reduction, speech recognition, sensor signal processing, and ECG abnormality detection, to name a few. <ref>J. Annema, Feed-Forward Neural Networks, (Springer 1995), pp. 9 </ref>

### Back-propagation

For a while, the Neural Network model was just an idea, since there were no algorithms for training the model until 1986, when Geoffrey Hinton <ref> http://www.cs.toronto.edu/~hinton/backprop.html </ref> came up with an algorithm called back-propagation. After that, a number of other training algorithms and various configurations of Neural Networks were implemented.

When we were talking about perceptrons, we applied gradient descent algorithm for optimizing the weights. Back-propagation uses this idea of gradient descent to train neural network based on the chain rule in calculus.

Assume that last output layer has only one unit, so we are working with a regression problem. Later we will see how this can be extended to more output layers and thus turn into a classificaiton problem.

For simplicity, there is only 1 unit at the end and assume for the moment we are doing regression.

Note that we make a distinction between the input weights $\displaystyle (w_i)$ and hidden weights $\displaystyle (u_i)$.

Within each perceptron we have a function $\displaystyle z_i=\sigma(a_i)$ that takes input $\displaystyle a_i$ and outputs $\displaystyle z_i's$. The $\displaystyle z_i's$ are the inputs into the final output of the model $\Rightarrow \hat y_i=\sum_{i=1}^p w_i z_i$

We can find the error of the neural network output by evaluating the squared difference between the true classification and the resulting classification output $\Rightarrow \displaystyle error=||y-\hat y ||^2$

First find derivative of the model error with respect to output weights $\displaystyle w_i$
$\frac{\partial err}{\partial w_i}=\frac{\partial err}{\partial \hat y} \cdot \frac{\partial \hat y}{\partial w_i}$
$\frac{\partial err}{\partial w_i}=2(y-\hat y) \cdot z_i$

Now we need to find the derivative of the model error with respect to hidden weights $\displaystyle u_i's$
Consider the following diagram that opens up the hidden layers of the neural network:

i j are reversed!

Notice that the weighted sum on the output of the perceptrons at layer $\displaystyle l$ are the inputs into the perceptrons at layer $\displaystyle j$ and so on for all hidden layers.

So, using the chain rule
$\frac{\partial err}{\partial u_{jl}}=\frac{\partial err}{\partial a_j} \cdot \frac{\partial a_j}{\partial u_{jl}}$
$\frac{\partial err}{\partial u_{jl}}=\delta_j \cdot z_l$

Note that a change in $\,a_j$ causes changes in all $\,a_i$ in the next layer which the error based on, thus we need to sum over i in the chain: $\delta_j = \frac{\partial err}{\partial a_j} = \sum_i \frac{\partial err}{\partial a_i} \cdot \frac{\partial a_i}{\partial a_j} =\sum_i \delta_i \cdot \frac{\partial a_i}{\partial a_j}$
$\,\frac{\partial a_i}{\partial a_j}=\frac{\partial a_i}{\partial z_j} \cdot \frac{\partial z_j}{\partial a_j}=u_{ij} \cdot \sigma'(a_j)$ Using the activation function $\,\sigma(\cdot)$

So $\delta_j = \sum_i \delta_i \cdot u_{ij} \cdot \sigma'(a_j)$
$\delta_j = \sigma'(a_j)\sum_i \delta_i \cdot u_{ij}$

Having calculated the error that the output creates, we can propagate this error back to the previous layers while adjusting the weights to solve a particular problem.

## Neural Networks (NN) - October 30, 2009

### Back-propagation

The idea is that we first feed an input from the training set to the Neural Network, then find the error rate at the output and then we propagate the error to previous layers and for each edge of weight $\,u_{ij}$ we find $\frac{\partial \mathrm{err}}{\partial u_{ij}}$. Having the error rates at hand we adjust the weight of each edge by taking steps proportional to the negative of the gradient to decrease the error at output. The next step is to apply the next input from the training set and go through the described adjustment procedure. The overview of Back-propagation algorithm:

1. Feed a point $\,x$ in the training set to the network, and find the output of all the nodes.
2. Evaluate $\,\delta_k=y_k-\hat{y_k}$ for all output units, where $y_k$ is the expected output and $\hat{y_k}$ is the real output.
3. By propagating to the previous layers evaluate all $\,\delta_j$s for hidden units: $\,\delta_j=\sigma'(a_j)\sum_i \delta_i u_{ij}$ where $i$ is associated to the previous layer.
4. Using $\frac{\partial \mathrm{err}}{\partial u_{jl}} = \delta_j\cdot z_l$ find all the derivatives.
5. Adjust each weight by taking steps proportional to the negative of the gradient: $u_{jl}^{\mathrm{new}} \leftarrow u_{jl}^{\mathrm{old}} -\rho \frac{\partial \mathrm{err}}{u_{ij}}$
6. Feed the next point in the training set and repeat the above steps.

This still leaves the question of how to initialize the weights $\,u_{ij}, w_i$. The method of choosing weights mentioned in class was to randomize the weights before the first step. This is not likely to be near the optimal solution in every case, but is simple to implement. An alternative is to use an orthogonal least squares method to find the initial weights <ref>http://www.mitpressjournals.org/doi/abs/10.1162/neco.1995.7.5.982</ref>. Regression is performed on the weights and output by using a linear approximation of $\,\sigma(a_i)$, and finds optimal weights in the linear model. Back propagation is used afterward to find the optimal solution, since the NN is non-linear.

### Dimensionality reduction application

Figure 1: Bottleneck configuration for applying dimensionality reduction.

One possible application of Neural Networks is to perform dimensionality reduction, like other techniques, e.g., PCA, MDS, LLE and Isomap.

Consider the following configuration as shown in figure 1: As we go forward in layers of this Neural Network, the number of nodes is reduced, until we reach a layer with the number of nodes representing the desired dimensionality. However, note that at the very first few layers the number of nodes may not be strictly decreasing, as long as finally it can reach a layer with less nodes. From now on in the Neural Network the previous layers are mirrored. So at the output layer we have the same number of states as we have in the input layer. Now note that if we feed the network with each point and get an output approximately equal to the fed input, that means at the output the same input is reconstructed from the middle layer units. So the output of the middle layer units can represent the input with less dimensions.

To train this Neural Network, we feed the network with a training point and through back propagation we adjust the network weights based on the error between the input layer and the reconstruction at the output layer. Our low dimensional mapping will be the observed output from the middle layer. Data reconstruction consists of putting the low dimensional data through the second half of the network.

### Deep Neural Network

Back-propagation in practice may not work well when there are too many hidden layers, since the $\,\delta$ may become negligible and the errors vanish. This is a numerical problem, where it is difficult to estimate the errors. So in practice configuring a Neural Network with Back-propagation faces some subtleties. Deep Neural Networks became popular two or three years ago, when introduced by Bradford Nill in his PhD thesis. Deep Neural Network training algorithm deals with the training of a Neural Network with a large number of layers.

The approach of training the deep network is to assume the network has only two layers first and train these two layers. After that we train the next two layers, so on and so forth.

Although we know the input and we expect a particular output, we do not know the correct output of the hidden layers, and this will be the issue that the algorithm mainly deals with. There are two major techniques to resolve this problem: using Boltzman machine to minimize the energy function, which is inspired from the theory in atom physics concerning the most stable condition; or somehow finding out what output of the second layer is most likely to lead us to the expected output at the output layer.

### Neural Networks in Practice

Now that we know so much about Neural Networks, what are suitable real world applications? Neural Networks have already been successfully applied in many industries.

Since neural networks are good at identifying patterns or trends in data, they are well suited for prediction or forecasting needs, such as customer research, sales forecasting, risk management and so on.

Take a specific marketing case for example. A feedforward neural network was trained using back-propagation to assist the marketing control of airline seat allocations. The neural approach was adaptive to the rule. The system is used to monitor and recommend booking advice for each departure.

### Issues with Neural Network

When Neural Networks was first introduced they were thought to be modeling human brains, hence they were given the fancy name "Neural Network". But now we know that they are just logistic regression layers on top of each other but have nothing to do with the real function principle in the brain.

We do not know why deep networks turn out to work quite well in practice. Some people claim that they mimic the human brains, but this is unfounded. As a result of these kinds of claims it is important to keep the right perspective on what this field of study is trying to accomplish. For example, the goal of machine learning may be to mimic the 'learning' function of the brain, but necessarily the processes the brain uses to learn.

As for the algorithm, since it does not have a convex form, we still face the problem of local minimum, although people have devised other techniques to avoid this dilemma.

In sum, Neural Network lacks a strong learning theory to back up its "success", thus it's hard for people to wisely apply and adjust it. Having said that, it is not an active research area in machine learning. NN still has wide applications in the engineering field such as in control.

## Complexity Control October 30, 2009

File:overfitting-model.png
Figure 2. The overfitting model passes through all the points of the training set, but has poor predictive power for new points. In exchange the line model has some error on the training points but has extracted the main characteristic of the training points, and has good predictive power.

There are two issues that we have to avoid in Machine Learning:

1. Overfitting
2. Underfitting

Overfitting occurs when our model is heavily complex with so many degrees of freedom, that we can learn every detail of the training set. Such a model will have very high precision on the training set but will show very poor ability to predict outcomes of new instances.

In a Neural Network if the depth is too much, the network will have so many degrees of freedom and will learn every characteristic of the training data set. That means it will show a very precise outcome of the training set but will not be able to generalize the commonality of the training set to predict the outcome of new cases.

The underfitting occurs when the model we picked to describe the data is not complex enough, and has high error rate on the training set. There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.

Example

1. Consider the example showed in the figure. We have a training set and we like to find a model which fits it the best. We can find a polynomail of high degree which almost passes through all the points in the training set. But, in fact the training set is comming from a line model. Now the problem is although the complex model has less error on the training set it diverges from the line in other ranges which we have no training point. Because of that the high degree polynomail has very poor predictive result on test cases. This is an example of overfitting model.
2. Now consider a training set which comes from a polynomail of degree two model. If we model this training set with a polynomail of degree one, our model will have high error rate on the training set, and is not complex enough to describe the problem.
3. Consider a simple classification example. If our classification rule takes as input only the colour of a fruit and concludes that it is a banana, then it is not a good classifier. The reason is that just because a fruit is a yellow, does not mean that it is a banana. We can add complexity to our model to make it a better classifier by considering more features typical of bananas, such as size and shape. If we continue to make our model more and more complex in order to improve our classifier, we will eventually reach a point where the quality of our classifier no longer improves, ie., we have overfit the data. This occurs when we have considered so many features that we have perfectly described the existing bananas, but if presented with a new banana of slightly different shape than the existing bananas, for example, it cannot be detected. This is the tradeoff; what is the right level of complexity?

## Complexity Control - Nov 2, 2009

Overfitting occurs when the model becomes too complex and underfitting occurs when it is not complex enough, both of which are not desirable. To control complexity, it is necessary to make assumptions for the model before fitting the data. Assumptions that we can make for a model are with polynomials or neural network. There are other ways as well.

Figure 1: An example of a model with a family of polynomials

We do not want a model to get too complex, so we control it by making an assumption on the model. With complexity control, we want a model or a classifier with a low error rate.

### How do we choose a good classifier?

Our goal is to find a classifier that minimizes the true error rate. Recall the empirical error rate

$\, L_{h}= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})$

h is a classifier and we want to minimize the error rate. So we apply $\displaystyle x_1$ to $\displaystyle x_n$ to $\displaystyle h$, and take the average to get the empirical true error rate estimation of probability that $h(x_{i}) \neq y_{i}$

Figure 2

There is a downward bias to this estimate meaning that it is always less than the true error rate.

If there is a change in our complexity from low to high, our error rate is always decreasing. When we apply our model to the test data, our error rate will start to decrease to a point and it will increase since the model hasn't seen it before. This can be explained since training error will decrease when we fit the model better by increasing its complexity, but as we have seen, this complex model will not generalize well, resulting in a larger test error.

We use our test data (from the test sample line shown on Figure 2) to get our empirical error rate. Right complexity is defined as where error rate (or training data) and test data is minimum; and this is one idea behind complexity control.

Figure 3

We assume that we have samples $\,X1, . . . ,Xn$ that follow some (possibly unknown) distribution. We want to estimate a parameter $\,f$ of the unknown distribution. This parameter may be the mean $\,E(X_i)$, the variance $\,var(X_i)$ or some other quantities.

The unknown parameter $\,f$ is a ﬁxed real number $f\in R$. To estimate it, we use an estimator which is a function of our observations $\hat{f}(X_1,...,X_n)$.

$Bias (\hat{f}) = E(\hat{f}) - f$

$MSE (\hat{f}) = E[(\hat{f} - f)^2]$

$Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]$

One property we desire of the estimator is that it is correct on average, that is, it is unbiased. $Bias (\hat{f}) = E(\hat{f}) - f=0$. However, there is a more important property for an estimator than just being unbiased: the mean squared error. It reflects the error that the estimator makes.

Hence, our goal is to minimize $MSE (\hat{f})$

From figure 3, we can see that the relationship of the three parameters is: $MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f})$, thus given a Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.

A Test error is a good estimation on MSE. We want to have a somewhat balanced bias and variance (not high on bias or variance), although it will have some bias.

Referring to Figure 2, overfitting happens after the point where training data (training sample line) starts to decrease and test data (test sample line) starts to increase. There are 2 main approaches to avoid overfitting:

1) Estimating error rate

$\hookrightarrow$ Empirical training error is not a good estimation

$\hookrightarrow$ Empirical test error is a better estimation

$\hookrightarrow$ Cross-Validation is fast

$\hookrightarrow$ Computing error bound (analytically) using some probability inequality (but we will not be discussing too much about it in the course)

2) Regularization

$\hookrightarrow$ Use of shrinkage method

$\hookrightarrow$ Decrease the chance of overfitting by controlling the weights

## Cross-Validation (CV) - Introduction

Figure 1: Illustration of Cross-Validation

Cross-Validation is used to estimate error. Here is a general introduction to CV:

$\hookrightarrow$ We have a set of data

$\hookrightarrow$ We divide it into 2 parts, Training data (T) and Validation data (V)

$\hookrightarrow$ For our calculation, we pretend that we do not know the label of V and we use T to learn about the data

$\hookrightarrow$ We compute the empirical error rate on V and the model hasn't seen V yet

CV has different variations

## Cross-validation

Figure 1: Classical/Standard cross-validation

Cross-validation is the simplest and most widely used method to estimate the true error. It comes from the observation that although training error always decreases with the increasing complexity of the model, the test error starts to increase from a certain point, which is noted as overfitting. Since test error estimates MSE (mean square error) best, people come up with the idea to randomly separate the data set into training set and validation set which is used to simulate test set.

Then we only use training set to implement the algorithm while keep the validation set untouched. As a result, the validation set will be totally unknown to the trained model. The error rate is then estimated by:

$\hat L(h) = \frac{1}{|\nu|}\sum_{X_i \in \nu}(h(x_i) \neq y_i)$, where $\nu$ is the cardinality of the validation set.

When we change the complexity, the error generated by the validation set will have the same behavior as the test set, so we are able to choose the best parameters to get the lowest error.

### K-fold Cross-validation

File:k-fold.png
Figure 2: K-fold cross-validation

Above is the simplest form of cross-validation. However, in reality, data is hard to collect and usually we are suffered from the curse of dimensionality, which requires even bigger data set. Consequently, we may not afford to sacrifice part of the limited resources. To address this problem, we divide the data set into $\,K$ subsets roughly equal in size. The usual choice is $\,K = 10$ For every $\,k$th $( \,k \in [ 1, K ] )$ part, we use the rest $\,K-1$part to fit the model and test on the $\,k$ part to estimate the prediction error $\hat L_k$, where

$\hat L(h) = \frac{1}{K}\sum_{k=1}^K\hat L_k$

For example, suppose we want to fit a polynomial model to the data set and split the set into four equal subsets shown in Figure 2. First we choose the degree to be 1, i.e. a lineal model. Next we use the first three sets as training sets and the last as validation set, then the 1st, 2nd, 4th subsets as training set and the 3rd as validatio set, so on and so forth until all the subsets have been the validation set once(all observations are used for both training and validation). After we get $\hat L_1, \hat L_2, \hat L_3, \hat L_4$, we can calculate $\hat L$ for degree 1 model. Similarly, we can estimate the error for n degree model and generate a simulating curve. Now we are able to choose the right degree which is corresponding to the minimum error. Also, we can use this way to find the optimal unit number of hidden layers of neural networks. We can begin with 1 unit number, then 2,3 and so on and so forth. Then find the unit number of hidden layers with lowest average error.

### Leave-one-out Cross-validation

If we set the subset number $\,K$ in the above discussion to be the cardinality of the whole data set, which means everytime the test set consists of only one observation and all the others are training data, hence the name leave-one-out cross-validation.

However, through the above example, we can see that this method is computationally expensive: for every possible value of the parameter, $\,K$ times training process is needed. This difficiency is even more obvious in leave-one-out cross-validation. Fortunately, it can be shown under some conditions, this can be done without iteration.

## Regularization for Neural Network —— Weight Decay

Figure 1: activation function

Weight decay training is suggested as an implementation for achieving a robust neural network which is insensitive to noise. Since the number of hidden layers in NN is usually decided by certain domain knowledge, it may easily get into the problem of overfitting.

It can be seen from Figure 1 that when the weight is in the vicinity of zero, the operative part of the activation function shows linear behavior. The NN then collapses to an approximately linear model. Note that a linear model is the simplest model, we can avoid overfitting by constraining the weights to be small. This gives us a hint to initialize the random weights to be close to zero.

Formally, we penalize nonlinear weights by adding a penalty term in the error function. Now the regularized error function becomes:

$\,REG = err + \lambda(\sum_{i}|w_i|^2 + \sum_{jk}|u_{jk}|^2)$, where $\,err$ is the original error in back-propagation; $\,w_i$ is the weights of the output layer; $\,u_{jk}$ is the weights of the hidden layers.

As in back-propagation, we take partial derivative with respect to the weights:

$\frac{\partial REG}{\partial w_i} = \frac{\partial err}{\partial w_i} + 2\lambda w_i$

$\frac{\partial REG}{\partial u_{jk}} = \frac{\partial err}{\partial u_{jk}} + 2\lambda u_{jk}$

$w^{new} = w^{old} - \rho(\frac{\partial REG}{\partial w} - 2\lambda w)$

$u^{new} = u^{old} - \rho(\frac{\partial REG}{\partial u} - 2\lambda w)$

Note that here $\,\lambda$ serves as a trade-off parameter, tuning between the error rate and the linearity. Actually, we may also set $\,\lambda$ by cross-validation. The tuning parameter is important since weights of zero will lead to zero derivatives and the algorithm will not change. On the other hand, starting with weights that are too large means starting with a nonlinear model which can often lead to poor solutions. <ref>Trevor Hastie, Robert Tibshirani, Jerome Friedman, Elements of Statistical Learning (Springer 2009) pp.398</ref>

<references/>