# STAT 441/841 / CM 463/763 - Tuesday, 2011/09/20

## Wiki Course Notes

Students will need to contribute to the wiki for 20% of their grade. Access via wikicoursenote.com Go to editor sign-up, and use your UW userid for your account name, and use your UW email.

primary (10%) Post a draft of lecture notes within 48 hours. You will need to do this 1 or 2 times, depending on class size.

secondary (10%) Make improvements to the notes for at least 60% of the lectures. More than half of your contributions should be technical rather than editorial. There will be a spreadsheet where students can indicate what they've done and when. The instructor will conduct random spot checks to ensure that students have contributed what they claim.

## Classification (Lecture: Sep. 20, 2011)

### Introduction

Machine learning (ML) methodology in general is an artificial intelligence approach to establish and train a model to recognize the pattern or underlying mapping of a system based on a set of training examples consisting of input and output patterns. Unlike in classical statistics where inference is made from small datasets, machine learning involves drawing inference from an overwhelming amount of data that could not be reasonably parsed by manpower.

In machine learning, pattern recognition is the assignment of some sort of output value (or label) to a given input value (or instance), according to some specific algorithm. The approach of using examples to produce the output labels is known as learning methodology. When the underlying function from inputs to outputs exists, it is referred to as the target function. The estimate of the target function which is learnt or output by the learning algorithm is known as the solution of learning problem. In case of classification this function is referred to as the decision function.

In the broadest sense, any method that incorporates information from training samples in the design of a classifier employs learning. Learning tasks can be classified along different dimensions. One important dimension is the distinction between supervised and unsupervised learning. In supervised learning a category label for each pattern in the training set is provided. The trained system will then generalize to new data samples. In unsupervised learning , on the other hand, training data has not been labeled, and the system forms clusters or natural grouping of input patterns based on some sort of measure of similarity and it can then be used to determine the correct output value for new data instances.

The first category is known as pattern classification and the second one as clustering. Pattern classification is the main focus in this course.

Classification problem formulation : Suppose that we are given n observations. Each observation consists of a pair: a vector $\mathbf{x}_i\subset \mathbb{R}^d, \quad i=1,...,n$, and the associated label $y_i$. Where $\mathbf{x}_i = (x_{i1}, x_{i2}, ... x_{id}) \in \mathcal{X} \subset \mathbb{R}^d$ and $Y_i$ belongs to some finite set $\mathcal{Y}$.

The classification task is now looking for a function $f:\mathbf{x}_i\mapsto y$ which maps the input data points to a target value (i.e. class label). Function $f(\mathbf{x},\theta)$ is defined by a set of parametrs $\mathbf{\theta}$ and the goal is to train the classifier in a way that among all possible mappings with different parameters the obtained decision boundary gives the minimum classification error.

### Definitions

The true error rate for classifier $h$ is the error with respect to the unknown underlying distribution when predicting a discrete random variable Y from a given input X.

$L(h) = P(h(X) \neq Y )$

The empirical error rate is the error of our classification function $h(x)$ on a given dataset with known outputs (e.g. training data, test data)

$\hat{L}_n(h) = (1/n) \sum_{i=1}^{n} \mathbf{I}(h(X_i) \neq Y_i)$ where h is a clssifier and $\mathbf{I}()$ is an indicator function. The indicator function is defined by

$\mathbf{I}(x) = \begin{cases} 1 & \text{if } x \text{ is true} \\ 0 & \text{if } x \text{ is false} \end{cases}$

So in this case, $\mathbf{I}(h(X_i)\neq Y_i) = \begin{cases} 1 & \text{if } h(X_i)\neq Y_i \text{ (i.e. misclassification)} \\ 0 & \text{if } h(X_i)=Y_i \text{ (i.e. classified properly)} \end{cases}$

For example, suppose we have 100 new data points with known (true) labels

$X_1 ... X_{100}$ $y_1 ... y_{100}$

To calculate the empirical error, we count how many times our function $h(X)$ classifies incorrectly (does not match $y$) and divide by n=100.

### Bayes Classifier

The principle of the Bayes Classifier is to calculate the posterior probability of a given object from its prior probability via Bayes' Rule, and then assign the object to the class with the largest posterior probability<ref> http://www.wikicoursenote.com/wiki/Stat841#Bayes_Classifier </ref>.

First recall Bayes' Rule, in the format $P(Y|X) = \frac{P(X|Y) P(Y)} {P(X)}$

P(Y|X)  : posterior , probability of $Y$ given $X$

P(X|Y)  : likelihood, probability of $X$ being generated by $Y$

P(Y)  : prior, probability of $Y$ being selected

P(X)  : marginal, probability of obtaining $X$

We will start with the simplest case: $\mathcal{Y} = \{0,1\}$

$r(x) = P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x)} = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}$

Bayes' rule can be approached by computing either one of the following:

1) The posterior: $\ P(Y=1|X=x)$ and $\ P(Y=0|X=x)$

2) The likelihood: $\ P(X=x|Y=1)$ and $\ P(X=x|Y=0)$

The former reflects a Bayesian approach. The Bayesian approach uses previous beliefs and observed data (e.g., the random variable $\ X$) to determine the probability distribution of the parameter of interest (e.g., the random variable $\ Y$). The probability, according to Bayesians, is a degree of belief in the parameter of interest taking on a particular value (e.g., $\ Y=1$), given a particular observation (e.g., $\ X=x$). Historically, the difficulty in this approach lies with determining the posterior distribution. However, more recent methods such as Markov Chain Monte Carlo (MCMC) allow the Bayesian approach to be implemented <ref name="PCAustin">P. C. Austin, C. D. Naylor, and J. V. Tu, "A comparison of a Bayesian vs. a frequentist method for profiling hospital performance," Journal of Evaluation in Clinical Practice, 2001</ref>.

The latter reflects a Frequentist approach. The Frequentist approach assumes that the probability distribution (including the mean, variance, etc.) is fixed for the parameter of interest (e.g., the variable $\ Y$, which is not random). The observed data (e.g., the random variable $\ X$) is simply a sampling of a far larger population of possible observations. Thus, a certain repeatability or frequency is expected in the observed data. If it were possible to make an infinite number of observations, then the true probability distribution of the parameter of interest can be found. In general, frequentists use a technique called hypothesis testing to compare a null hypothesis (e.g. an assumption that the mean of the probability distribution is $\ \mu_0$) to an alternative hypothesis (e.g. assuming that the mean of the probability distribution is larger than $\ \mu_0$) <ref name="PCAustin"/>. For more information on hypothesis testing see <ref>R. Levy, "Frequency hypothesis testing, and contingency tables" class notes for LING251, Department of Linguistics, University of California, 2007. Available: http://idiom.ucsd.edu/~rlevy/lign251/fall2007/lecture_8.pdf </ref>.

There was some class discussion on which approach should be used. Both the ease of computation and the validity of both approaches were discussed. A main point that was brought up in class is that Frequentists consider X to be a random variable, but they do not consider Y to be a random variable because it has to take on one of the values from a fixed set (in the above case it would be either 0 or 1 and there is only one correct label for a given value X=x). Thus, from a Frequentist's perspective it does not make sense to talk about the probability of Y. This is actually a grey area and sometimes Bayesians and Frequentists use each others' approaches. So using Bayes' rule doesn't necessarily mean you're a Bayesian. Overall, the question remains unresolved.

The Bayes Classifier uses $\ P(Y=1|X=x)$

$P(Y=1|X=x) = \frac{P(X=x|Y=1) P(Y=1)} {P(X=x|Y=1) P(Y=1) + P(X=x|Y=0) P(Y=0)}$

P(Y=1) : the prior, based on belief/evidence beforehand

denominator : marginalized by summation

$h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases}$

The set $\mathcal{D}(h) = \{ x : P(Y=1|X=x) = P(Y=0|X=x)... \}$

which defines a decision boundary.

$h^*(x) = \begin{cases} 1 \ \ if \ \ P(Y=1|X=x) \gt P(Y=0|X=x) \\ 0 \ \ \ \ \ \ otherwise \end{cases}$

Theorem: The Bayes Classifier is optimal, i.e., if $h$ is any other classification rule, then $L(h^*) \lt = L(h)$

Proof: Consider any classifier $h$. We can express the error rate as

$P( \{h(X) \ne Y \} ) = E_{X,Y} [ \mathbf{1}_{\{h(X) \ne Y \}} ] = E_X \left[ E_Y[ \mathbf{1}_{\{h(X) \ne Y \}}| X] \right]$

To minimize this last expression, it suffices to minimize the inner expectation. Expanding this expectation:

$E_Y[ \mathbf{1}_{\{h(X) \ne Y \}}| X] = \sum_{y \in Supp(Y)} P( h(X) \ne y | X) \mathbf{1}_{\{h(X) \ne y \} }$

which, in the two-class case, simplifies to

$= P( h(X) \ne 0 | X) \mathbf{1}_{\{h(X) \ne 0 \} } + P( h(X) \ne 1 | X) \mathbf{1}_{\{h(X) \ne 1 \} }$
$= r(X) \mathbf{1}_{\{h(X) \ne 0 \} } + (1-r(X))\mathbf{1}_{\{h(X) \ne 1 \} }$

where $r(x)$ is defined as above. We should 'choose' h(X) to equal the label that minimizes the sum. Consider if $r(X)\gt 1/2$, then $r(X)\gt 1-r(X)$ so we should let $h(X) = 1$ to minimize the sum. Thus the Bayes classifier is the optimal classifier.

Why then do we need other classification methods? Because X densities are often/typically unknown. I.e., $f_k(x)$ and/or $\pi_k$ unknown.

$P(Y=k|X=x) = \frac{P(X=x|Y=k)P(Y=k)} {P(X=x)} = \frac{f_k(x) \pi_k} {\sum_k f_k(x) \pi_k}$ f_k(x) is referred to as the class conditional distribution (~likelihood).

Therefore, we rely on some data to estimate quantities.

### Three Main Approaches

1. Empirical Risk Minimization: Choose a set of classifiers H (e.g., line, neural network) and find $h^* \in H$ that minimizes (some estimate of) L(h).

2. Regression: Find an estimate ($\hat{r}$) of function $r$ and define $h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases}$

The $1/2$ in the expression above is a threshold set for the regression prediction output.

In general regression refers to finding a continuous, real valued y. The problem here is more difficult, because of the restricted domain (y is a set of discrete label values).

3. Density Estimation: Estimate $P(X=x|Y=0)$ from $X_i$'s for which $Y_i = 0$ Estimate $P(X=x|Y=1)$ from $X_i$'s for which $Y_i = 1$ and let $P(Y=?) = (1/n) \sum_{i=1}^{n} Y_i$

Define $\hat{r}(x) = \hat{P}(Y=1|X=x)$ and $h(x) = \begin{cases} 1 \ \ \hat{r}(x) \gt 1/2 \\ 0 \ \ otherwise \end{cases}$

It is possible that there may not be enough data to estimate from for density estimation. But the main problem lies with high dimensional spaces, as the estimation results may not be good (high error rate) and sometimes even infeasible. The term curse of dimensionality was coined by Bellman <ref>R. E. Bellman, Dynamic Programming. Princeton University Press, 1957</ref> to describe this problem.

As the dimension of the space goes up, the learning requirements go up exponentially.

The simplest approach to classification is to use the third approach.

### Multi-Class Classification

Generalize to case Y takes on k>2 values.

Theorem: $Y \in \mathcal{Y} = \{1,2,..., k\}$ optimal rule

$\ h^{*}(x) = argmax_k P(Y=k|X=x)$

where $P(Y=k|X=x) = \frac{f_k(x) \pi_k} {\sum_r f_r \pi_r}$

### Examples of Classification

• Face detection in images.
• Medical diagnosis.
• Detecting credit card fraud (fraudulent or legitimate).
• Speech recognition.
• Handwriting recognition.

There are also some interesting reads on Bayes Classification:

## LDA and QDA

Discriminant function analysis finds features that best allow discrimination between two or more classes. The approach is similar to analysis of Variance (ANOVA) in that discriminant function analysis looks at the mean values to determine if two or more classes are very different and should be separated. Once the discriminant functions (that separate two or more classes) have been determined, new data points can be classified (i.e. placed in one of the classes) based on the discriminant functions <ref> StatSoft, Inc. (2011). Electronic Statistics Textbook. [Online]. Available: http://www.statsoft.com/textbook/discriminant-function-analysis/. </ref>. Linear discriminant analysis (LDA) and Quadratic discriminant analysis (QDA) are methods of discriminant analysis that are best applied to linearly and quadradically separable classes, respectively. Fisher discriminant analysis (FDA) another method of discriminant analysis that is different from linear discriminant analysis, but oftentimes both terms are used interchangeably.

### LDA

The simplest method is to use approach 3 (above) and assume a parametric model for densities. Assume class conditional is Gaussian.

$\mathcal{Y} = \{ 0,1 \}$ assumed (i.e., 2 labels)

$h(x) = \begin{cases} 1 \ \ P(Y=1|X=x) \gt P(Y=0|X=x) \\ 0 \ \ otherwise \end{cases}$

$P(Y=1|X=x) = \frac{f_1(x) \pi_1} {\sum_k f_k \pi_k} \ \$ (denom = P(x))

1) Assume Gaussian distributions

$f_k(x) = \frac{1}{(2\pi)^{d/2} |\Sigma_k|^{1/2}} \text{exp}\big(-\frac{1}{2}(\mathbf{x-\mu_k}) \Sigma_k^{-1}(\mathbf{x-\mu_k}) )$

must compare $\frac{f_1(x) \pi_1} {p(x)}$ with $\frac{f_0(x) \pi_0} {p(x)}$ Note that the p(x) denom can be ignored: $f_1(x) \pi_1$ with $f_0(x) \pi_0$

To find the decision boundary, set $f_1(x) \pi_1 = f_0(x) \pi_0$

$\frac{1}{(2\pi)^{d/2} |\Sigma_1|^{1/2}} exp(-\frac{1}{2}(\mathbf{x - \mu_1}) \Sigma_1^{-1}(\mathbf{x-\mu_1}) )\pi_1 = \frac{1}{(2\pi)^{d/2} |\Sigma_0|^{1/2}} exp(-\frac{1}{2}(\mathbf{x -\mu_0}) \Sigma_0^{-1}(\mathbf{x-\mu_0}) )\pi_0$

2) Assume $\Sigma_1 = \Sigma_0$, we can use $\Sigma = \Sigma_0 = \Sigma_1$.

$\frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} exp(-\frac{1}{2}(\mathbf{x -\mu_1}) \Sigma^{-1}(\mathbf{x-\mu_1}) )\pi_1 = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} exp(-\frac{1}{2}(\mathbf{x- \mu_0}) \Sigma^{-1}(\mathbf{x-\mu_0}) )\pi_0$

3) Cancel $(2\pi)^{-d/2} |\Sigma|^{-1/2}$ from both sides.

$exp(-\frac{1}{2}(\mathbf{x - \mu_1}) \Sigma^{-1}(\mathbf{x-\mu_1}) )\pi_1 = exp(-\frac{1}{2}(\mathbf{x - \mu_0}) \Sigma^{-1}(\mathbf{x-\mu_0}) )\pi_0$

4) Take log of both sides.

$-\frac{1}{2}(\mathbf{x - \mu_1}) \Sigma^{-1}(\mathbf{x-\mu_1}) )+ \text{log}(\pi_1) = -\frac{1}{2}(\mathbf{x - \mu_0}) \Sigma^{-1}(\mathbf{x-\mu_0}) )+ \text{log}(\pi_0)$

5) Subtract one side from both sides, leaving zero on one side.

$-\frac{1}{2}(\mathbf{x - \mu_1})^T \Sigma^{-1} (\mathbf{x-\mu_1}) + \text{log}(\pi_1) - [-\frac{1}{2}(\mathbf{x - \mu_0})^T \Sigma^{-1} (\mathbf{x-\mu_0}) + \text{log}(\pi_0)] = 0$

$\frac{1}{2}[-\mathbf{x}^T \Sigma^{-1}\mathbf{x - \mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + 2\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{x} + \mathbf{x}^T \Sigma^{-1}\mathbf{x} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} - 2\mathbf{\mu_0}^T \Sigma^{-1} \mathbf{x} ] + \text{log}(\frac{\pi_1}{\pi_0}) = 0$

Cancelling out the terms quadratic in $\mathbf{x}$ and rearranging results in

$\frac{1}{2}[-\mathbf{\mu_1}^T \Sigma^{-1} \mathbf{\mu_1} + \mathbf{\mu_0}^T \Sigma^{-1} \mathbf{\mu_0} + (2\mathbf{\mu_1}^T \Sigma^{-1} - 2\mathbf{\mu_0}^T \Sigma^{-1}) \mathbf{x}] + \text{log}(\frac{\pi_1}{\pi_0}) = 0$

We can see that the first pair of terms is constant, and the second pair is linear in x. Therefore, we end up with something of the form $ax + b = 0$. For more about LDA <ref>http://sites.stat.psu.edu/~jiali/course/stat597e/notes2/lda.pdf</ref>

## LDA and QDA Continued (Lecture: Sep. 22, 2011)

If we relax assumption 2 (i.e. $\Sigma_1 \neq \Sigma_0$) then we get a quadratic equation that can be written as ${x}^Ta{x}+b{x} + c = 0$

### Generalizing LDA and QDA

Theorem:

Suppose that $\,Y \in \{1,\dots,K\}$, if $\,f_k(\mathbf{x}) = Pr(X=\mathbf{x}|Y=k)$ is Gaussian. The Bayes Classifier is

$\,h^*(\mathbf{x}) = \arg\max_{k} \delta_k(\mathbf{x})$

Where

$\,\delta_k(\mathbf{x}) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(\mathbf{x}-\boldsymbol{\mu}_k)^\top\Sigma_k^{-1}(\mathbf{x}-\boldsymbol{\mu}_k) + log (\pi_k)$

When the Gaussian variances are equal $\Sigma_1 = \Sigma_0$ (e.g. LDA), then

$\,\delta_k(\mathbf{x}) = \mathbf{x}^\top\Sigma^{-1}\boldsymbol{\mu}_k - \frac{1}{2}\boldsymbol{\mu}_k^\top\Sigma^{-1}\boldsymbol{\mu}_k + log (\pi_k)$

(To compute this, we need to calculate the value of $\,\delta$ for each class, and then take the one with the max. value).

### In practice

We estimate the prior to be the chance that a random item from the collection belongs to class k, e.g.

$\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}$

The mean to be the average item in set k, e.g.

$\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i$

and calculate the covariance of each class e.g.

$\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top$

If we wish to use LDA we must calculate a common covariance, so we average all the covariances e.g.

$\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{r=1}^{k}n_r}$

Where: $\,n_r$ is the number of data points in class $\,r$, $\,\Sigma_r$ is the covariance of class $\,r$, $\,n$ is the total number of data points, and $\,k$ is the number of classes.

### Computation

For QDA we need to calculate: $\,\delta_k(\mathbf{x}) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(\mathbf{x}-\boldsymbol{\mu}_k)^\top\Sigma_k^{-1}(\mathbf{x}-\boldsymbol{\mu}_k) + log (\pi_k)$

Lets first consider when $\, \Sigma_k = I, \forall k$. This is the case where each distribution is spherical, around the mean point.

#### Case 1

When $\, \Sigma_k = I$

We have:

$\,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(\mathbf{x}-\boldsymbol{\mu}_k)^\top I(\mathbf{x}-\boldsymbol{\mu}_k) + log (\pi_k)$

but $\ \log(|I|)=\log(1)=0$

and $\, (\mathbf{x}-\boldsymbol{\mu}_k)^\top I(\mathbf{x}-\boldsymbol{\mu}_k) = (\mathbf{x}-\boldsymbol{\mu}_k)^\top(\mathbf{x}-\boldsymbol{\mu}_k)$ is the squared Euclidean distance between two points $\,\mathbf{x}$ and $\,\boldsymbol{\mu}_k$

Thus in this condition, a new point can be classified by its distance away from the center of a class, adjusted by some prior.

Further, for two-class problem with equal prior, the discriminating function would be the bisector of the 2-class's means.

#### Case 2

When $\, \Sigma_k \neq I$

Using the Singular Value Decomposition (SVD) of $\, \Sigma_k$ we get $\, \Sigma_k = U_kS_kV_k^\top$. In particular, $\, U_k$ is a collection of eigenvectors of $\, \Sigma_k\Sigma_k^*$, and $\, V_k$ is a collection of eigenvectors of $\,\Sigma_k^*\Sigma_k$. Since $\, \Sigma_k$ is a symmetric matrix<ref> http://en.wikipedia.org/wiki/Covariance_matrix#Properties </ref>, $\, \Sigma_k = \Sigma_k^*$, so we have $\, \Sigma_k = U_kS_kU_k^\top$.

For $\,\delta_k$, the second term becomes what is also known as the Mahalanobis distance <ref>P. C. Mahalanobis, "On The Generalised Distance in Statistics," Proceedings of the National Institute of Sciences of India, 1936</ref> :

\begin{align} (\mathbf{x}-\boldsymbol{\mu}_k)^\top\Sigma_k^{-1}(\mathbf{x}-\boldsymbol{\mu}_k)&= (\mathbf{x}-\boldsymbol{\mu}_k)^\top U_kS_k^{-1}U_k^T(\mathbf{x}-\boldsymbol{\mu}_k)\\ & = (U_k^\top \mathbf{x}-U_k^\top\boldsymbol{\mu}_k)^\top S_k^{-1}(U_k^\top \mathbf{x}-U_k^\top \boldsymbol{\mu}_k)\\ & = (U_k^\top \mathbf{x}-U_k^\top\boldsymbol{\mu}_k)^\top S_k^{-\frac{1}{2}}S_k^{-\frac{1}{2}}(U_k^\top \mathbf{x}-U_k^\top\boldsymbol{\mu}_k) \\ & = (S_k^{-\frac{1}{2}}U_k^\top \mathbf{x}-S_k^{-\frac{1}{2}}U_k^\top\boldsymbol{\mu}_k)^\top I(S_k^{-\frac{1}{2}}U_k^\top \mathbf{x}-S_k^{-\frac{1}{2}}U_k^\top \boldsymbol{\mu}_k) \\ & = (S_k^{-\frac{1}{2}}U_k^\top \mathbf{x}-S_k^{-\frac{1}{2}}U_k^\top\boldsymbol{\mu}_k)^\top(S_k^{-\frac{1}{2}}U_k^\top \mathbf{x}-S_k^{-\frac{1}{2}}U_k^\top \boldsymbol{\mu}_k) \\ \end{align}

If we think of $\, S_k^{-\frac{1}{2}}U_k^\top$ as a linear transformation that takes points in class $\,k$ and distributes them spherically around a point, like in case 1. Thus when we are given a new point, we can apply the modified $\,\delta_k$ values to calculate $\ h^*(\,x)$. After applying the singular value decomposition, $\,\Sigma_k^{-1}$ is considered to be an identity matrix such that

$\,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}[(S_k^{-\frac{1}{2}}U_k^\top \mathbf{x}-S_k^{-\frac{1}{2}}U_k^\top\boldsymbol{\mu}_k)^\top(S_k^{-\frac{1}{2}}U_k^\top \mathbf{x}-S_k^{-\frac{1}{2}}U_k^\top \boldsymbol{\mu}_k)] + log (\pi_k)$

and,

$\ \log(|I|)=\log(1)=0$

For applying the above method with classes that have different covariance matrices (for example the covariance matrices $\ \Sigma_0$ and $\ \Sigma_1$ for the two class case), each of the covariance matrices has to be decomposed using SVD to find the according transformation. Then, each new data point has to be transformed using each transformation to compare its distance to the mean of each class (for example for the two class case, the new data point would have to be transformed by the class 1 transformation and then compared to $\ \mu_0$ and the new data point would also have to be transformed by the class 2 transformation and then compared to $\ \mu_1$).

The difference between Case 1 and Case 2 (i.e. the difference between using the Euclidean and Mahalanobis distance) can be seen in the illustration below.

Illustration of Euclidean distance (a) and Mahalanobis distance (b) where the contours represent equidistant points from the center using each distance metric. Source: <ref>R. De Maesschalck, D. Jouan-Rimbaud and D. L. Massart, "Tutorial - The Mahalanobis distance," Chemometrics and Intelligent Laboratory Systems, 2000 </ref>

As can be seen from the illustration above, the Mahalanobis distance takes into account the distribution of the data points, whereas the Euclidean distance would treat the data as though it has a spherical distribution. Thus, the Mahalanobis distance applies for the more general classification in Case 2, whereas the Euclidean distance applies to the special case in Case 1 where the data distribution is assumed to be spherical.

Generally, we can conclude that QDA provides a better classifier for the data then LDA because LDA assumes that the covariance matrix is identical for each class, but QDA does not. QDA still uses Gaussian distribution as a class conditional distribution. In our real life, this distribution can not be happened each time, so we have to use other distribution as a complement.

### The Number of Parameters in LDA and QDA

Both LDA and QDA require us to estimate some parameters. Here is a comparison between the number of parameters needed to be estimated for LDA and QDA:

LDA: Since we just need to compare the differences between one given class and remaining $\,K-1$ classes, totally, there are $\,K-1$ differences. For each of them, $\,a^{T}x+b$ requires $\,d+1$ parameters. Therefore, there are $\,(K-1)\times(d+1)$ parameters.

QDA: For each of the differences, $\,x^{T}ax + b^{T}x + c$ requires $\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1$ parameters. Therefore, there are $(K-1)(\frac{d(d+3)}{2}+1)$ parameters.

A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.

## Trick: Using LDA to do QDA

There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the Kernel trick that will be discussed later in the course.

In this approach the feature vector is augmented with the quadratic terms (i.e. new dimensions are introduced) where the original data will be projected to that dimensions. We then apply LDA on the new higher-dimensional data.

The motivation behind this approach is to take advantage of the fact that fewer parameters have to be calculated in LDA , as explained in previous sections, and therefore have a more robust system in situations where we have fewer data points.

If we look back at the equations for LDA and QDA, we see that in LDA we must estimate $\,\mu_1$, $\,\mu_2$ and $\,\Sigma$. In QDA we must estimate all of those, plus another $\,\Sigma$; the extra $\,\frac{d(d-1)}{2}$ estimations make QDA less robust with fewer data points.

### Theoretically

Suppose we have a quadratic function to estimate: $g(\mathbf{x}) = y = \mathbf{x}^T\mathbf{v}\mathbf{x} + \mathbf{w}^T\mathbf{x}$.

Using this trick, we introduce two new vectors, $\,\hat{\mathbf{w}}$ and $\,\hat{\mathbf{x}}$ such that:

$\hat{\mathbf{w}} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]^T$

and

$\hat{\mathbf{x}} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]^T$

We can then apply LDA to estimate the new function: $\hat{g}(\mathbf{x},\mathbf{x}^2) = \hat{y} =\hat{\mathbf{w}}^T\hat{\mathbf{x}}$.

Note that we can do this for any $\, x$ and in any dimension; we could extend a $D \times n$ matrix to a quadratic dimension by appending another $D \times n$ matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a $\,sin(x)$ dimension. Pay attention, We don't do QDA with LDA. If we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA.

## Principal Component Analysis (PCA) (Lecture: Sep. 27, 2011)

Principal Component Analysis (PCA) is a method of dimensionality reduction/feature extraction that transforms the data from a D dimensional space into a new coordinate system of dimension d, where d <= D ( the worst case would be to have d=D). The goal is to preserve as much of the variance in the original data as possible when switching the coordinate systems. Give data on D variables, the hope is that the data points will lie mainly in a linear subspace of dimension lower than D. In practice, the data will usually not lie precisely in some lower dimensional subspace.

The new variables that form a new coordinate system are called principal components (PCs). PCs are denoted by $\ \mathbf{u}_1, \mathbf{u}_2, ... , \mathbf{u}_D$. The principal components form a basis for the data. Since PCs are orthogonal linear transformations of the original variables there is at most D PCs. Normally, not all of the D PCs are used but rather a subset of d PCs, $\ \mathbf{u}_1, \mathbf{u}_2, ... , \mathbf{u}_d$, to approximate the space spanned by the original data points $\ \mathbf{x}=[x_1, x_2, ... , x_D]^T$. We can choose d based on what percentage of the original data we would like to maintain.

The first PC, $\ \mathbf{u}_1$ is called first principal component and has the maximum variance, thus it accounts for the most significant variance in the data. The second PC, $\ \mathbf{u}_2$ is called second principal component and has the second highest variance and so on until PC, $\ \mathbf{u}_D$ which has the minimum variance.

Let $u = \mathbf{w}^T\mathbf{x}=\mathbf{w}^TS\mathbf{w}$ Where $\ S$ is the sample covariance matrix.

We would like to find the $\ \mathbf{w}$ which gives us maximum variation:

$\ \max (Var(\mathbf{w}^T \mathbf{x})) = \max (\mathbf{w}^T S \mathbf{w})$

Note: we require the constraint $\ \mathbf{w}^T \mathbf{w} = 1$ because if there is no constraint on the length of $\ \mathbf{w}$ then there is no upper bound. With the constraint, the direction and not the length that maximizes the variance can be found.

#### Lagrange Multiplier

Before we proceed, we should review Lagrange multipliers.

"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]

Lagrange multipliers are used to find the maximum or minimum of a function $\displaystyle f(x,y)$ subject to constraint $\displaystyle g(x,y)=0$

we define a new constant $\lambda$ called a Lagrange Multiplier and we form the Lagrangian,

$\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)$

If $\displaystyle f(x^*,y^*)$ is the max of $\displaystyle f(x,y)$, there exists $\displaystyle \lambda^*$ such that $\displaystyle (x^*,y^*,\lambda^*)$ is a stationary point of $\displaystyle L$ (partial derivatives are 0).
In addition $\displaystyle (x^*,y^*)$ is a point in which functions $\displaystyle f$ and $\displaystyle g$ touch but do not cross. At this point, the tangents of $\displaystyle f$ and $\displaystyle g$ are parallel or gradients of $\displaystyle f$ and $\displaystyle g$ are parallel, such that:

$\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g$

where,
$\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow$ the gradient of $\, f$
$\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow$ the gradient of $\, g$

#### Example :

Suppose we want to maximize the function $\displaystyle f(x,y)=x-y$ subject to the constraint $\displaystyle x^{2}+y^{2}=1$. We can apply the Lagrange multiplier method to find the maximum value for the function $\displaystyle f$; the Lagrangian is:

$\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)$

We want the partial derivatives equal to zero:

$\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0$

$\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0$

$\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1$

Solving the system we obtain two stationary points: $\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)$ and $\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)$. In order to understand which one is the maximum, we just need to substitute it in $\displaystyle f(x,y)$ and see which one as the biggest value. In this case the maximum is $\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)$.

### Determining w :

Use the Lagrange multiplier conversion to obtain: $\displaystyle L(\mathbf{w}, \lambda) = \mathbf{w}^T S\mathbf{w} - \lambda (\mathbf{w}^T \mathbf{w} - 1)$ where $\displaystyle \lambda$ is a constant

Take the derivative and set it to zero: $\displaystyle{\partial L \over{\partial \mathbf{w}}} = 0$

To obtain: $\displaystyle 2S\mathbf{w} - 2 \lambda \mathbf{w} = 0$

Rearrange to obtain: $\displaystyle S\mathbf{w} = \lambda \mathbf{w}$

where $\displaystyle w$ is eigenvector of $\displaystyle S$ and $\ \lambda$ is the eigenvalue of $\displaystyle S$ as $\displaystyle S\mathbf{w}= \lambda \mathbf{w}$ , and $\displaystyle \mathbf{w}^T \mathbf{w}=1$ , then we can write

$\displaystyle \mathbf{w}^T S\mathbf{w}= \mathbf{w}^T\lambda \mathbf{w}= \lambda \mathbf{w}^T \mathbf{w} =\lambda$

Note that the PCs decompose the total variance in the data in the following way :

$\sum_{i=1}^{D} Var(u_i)$

$= \sum_{i=1}^{D} (\lambda_i)$

$\ = Tr(S)$ ---- (S is a co-variance matrix, and therefore it's symmetric)

$= \sum_{i=1}^{D} Var(x_i)$

## Principal Component Analysis (PCA) Continued (Lecture: Sep. 29, 2011)

As can be seen from the above expressions, $\ Var(\mathbf{w}^\top \mathbf{w}) = \mathbf{w}^\top S \mathbf{w}= \lambda$ where lambda is an eigenvalue of the sample covariance matrix $\ S$ and $\ \mathbf{w}$ is its corresponding eigenvector. So $\ Var(u_i)$ is maximized if $\ \lambda_i$ is the maximum eigenvalue of $\ S$ and the first principal component (PC) is the corresponding eigenvector. Each successive PC can be generated in the above manner by taking the eigenvectors of $\ S$<ref>www.wikipedia.org/wiki/Eigenvalues_and_eigenvectors</ref> that correspond to the eigenvalues:

$\ \lambda_1 \geq ... \geq \lambda_D$

such that

$\ Var(u_1) \geq ... \geq Var(u_D)$

### Alternative Derivation

Another way of looking at PCA is to consider PCA as a projection from a higher D-dimension space to a lower d-dimensional subspace that minimizes the squared reconstruction error. The squared reconstruction error is the difference between the original data set $\ X$ and the new data set $\hat{X}$ obtained by first projecting the original data set into a lower d-dimensional subspace and then projecting it back into the the original higher D-dimension space. Since information is (normally) lost by compressing the the original data into a lower d-dimensional subspace, the new data set will (normally) differ from the original data even though both are part of the higher D-dimension space. The reconstruction error is computed as shown below.

#### Reconstruction Error

$e = \sum_{i=1}^{n} || x_i - \hat{x}_i ||^2$

#### Minimize Reconstruction Error

Suppose $\bar{x} = 0$ where $\hat{x}_i = x_i - \bar{x}$

Let $\ f(y) = U_d y$ where $\ U_d$ is a D by d matrix with d orthogonal unit vectors as columns.

Fit the model to the data and minimize the reconstruction error:

$\ min_{U_d, y_i} \sum_{i=1}^n || x_i - U_d y_i ||^2$

Differentiate with respect to $\ y_i$:

$\frac{\partial e}{\partial y_i} = 0$

we can rewrite reconstruction-error as : $\ e = \sum_{i=1}^n(x_i - U_d y_i)^T(x_i - U_d y_i)$

$\ \frac{\partial e}{\partial y_i} = 2(-U_d)(x_i - U_d y_i) = 0$

since $\ U_d(x_i - U_d y_i)$ is a linear combination of the columns of $\ U_d$,

which are independent (orthogonal to each other) we can conclude that:

$\ x_i - U_d y_i = 0$ or equivalently,

$\ x_i = U_d y_i$

$\ y_i = U_d^T x_i$

Find the orthogonal matrix $\ U_d$:

$\ min_{U_d} \sum_{i=1}^n || x_i - U_d U_d^T x_i||^2$

#### PCA Implementation Using Singular Value Decomposition

A unique solution can be obtained by finding the Singular Value Decomposition (SVD) of $\ X$:

$\ X = U S V^T$

For each rank d, $\ U_d$ consists of the first d columns of $\ U$. Also, the covariance matrix can be expressed as follows $\ S = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)(x_i - \mu)^T$.

Simply put, by subtracting the mean of each of the data point features and then applying SVD, one can find the principal components:

$\tilde{X} = X - \mu$

$\ \tilde{X} = U S V^T$

Where $\ X$ is a d by n matrix of data points and the features of each data point form a column in $\ X$. Also, $\ \mu$ is a d by n matrix with identical columns each equal to the mean of the $\ x_i$'s, ie $\mu_{:,j}=\frac{1}{n}\sum_{i=1}^n x_i$. Note that the arrangement of data points is a convention and indeed in Matlab or conventional statistics, the transpose of the matrices in the above formulae is used.

As the $\ S$ matrix from the SVD has the eigenvalues arranged from largest to smallest, the corresponding eigenvectors in the $\ U$ matrix from the SVD will be such that the first column of $\ U$ is the first principal component and the second column is the second principal component and so on.

### Examples

Note that in the Matlab code in the examples below, the mean was not subtracted from the datapoints before performing SVD. This is what was shown in class. However, to properly perform PCA, the mean should be subtracted from the datapoints.

#### Example 1

Consider a matrix of data points $\ X$ with the dimensions 560 by 1965. 560 is the number of elements in each column. Each column is a vector representation of a 20x28 grayscale pixel image of a face (see image below) and there is a total of 1965 different images of faces. Each of the images are corrupted by noise, but the noise can be removed by projecting the data back to the original space taking as many dimensions as one likes (e.g, 2, 3 4 0r 5). The corresponding Matlab commands are shown below:

</ref>
 >> % start with a 560 by 1965 matrix X that contains the data points
>>
>> % set the colors to grayscale
>> colormap gray
>>
>> % show image in column 10 by reshaping column 10 into a 20 by 28 matrix
>> imagesc(reshape(X(:,10),20,28)')
>>
>> % perform SVD, if X matrix if full rank, will obtain 560 PCs
>> [S U V] = svd(X);
>>
>> % reconstruct X ( project X onto the original space) using only the first ten principal components
>> Y_pca = U(:, 1:10)'*X;
>>
>> % show image in column 10 of X_hat which is now a 560 by 1965 matrix
>> imagesc(reshape(X_hat(:,10),20,28)')


The reason why the noise is removed in the reconstructed image is because the noise does not create a major variation in a single direction in the original data. Hence, the first ten PCs taken from $\ U$ matrix are not in the direction of the noise. Thus, reconstructing the image using the first ten PCs, will remove the noise.

#### Example 2

Consider a matrix of data points $\ X$ with the dimensions 64 by 400. 64 is the number of elements in each column. Each column is a vector representation of a 8x8 grayscale pixel image of either a handwritten number 2 or a handwritten number 3 (see image below) and there are a total of 400 different images, where the first 200 images show a handwritten number 2 and the last 200 images show a handwritten number 3.

An example of the handwritten number images used in Example 2. Source: <ref>A. Ghodsi, "PCA" class notes for STAT841, Department of Statistics and Actuarial Science, University of Waterloo, 2011. </ref>

The corresponding Matlab commands for performing PCA on the data points are shown below:

 >> % start with a 64 by 400 matrix X that contains the data points
>>
>> % set the colors to grayscale
>> colormap gray
>>
>> % show image in column 2 by reshaping column 2 into a 8 by 8 matrix
>> imagesc(reshape(X(:,2),8,8))
>>
>> % perform SVD, if X matrix if full rank, will obtain 64 PCs
>> [U S V] = svd(X);
>>
>> % project data down onto the first two PCs
>> Y = U(:,1:2)'*X;
>>
>> % show Y as an image (can see the change in the first PC at column 200,
>> % when the handwritten number changes from 2 to 3)
>> imagesc(Y)
>>
>> % perform PCA using Matlab build-in function (do not use for assignment)
>> % also note that due to the Matlab convention, the transpose of X is used
>> [COEFF, Y] = princomp(X');
>>
>> % again, use the first two PCs
>> Y = Y(:,1:2);
>>
>> % use plot digits to show the distribution of images on the first two PCs
>> images = reshape(X, 8, 8, 400);
>> plotdigits(images, Y, .1, 1);


Using the plotdigits function in Matlab, clearly illustrates that the first PC captured the differences between the numbers 2 and 3 as they are projected onto different regions of the axis for the first PC. Also, the second PC captured the tilt of the handwritten numbers as numbers tilted to the left or right were projected onto different regions of the axis for the second PC.

#### Example 3

(Not discussed in class) In the news recently was a story that captures some of the ideas behind PCA. Over the past two years, Scott Golder and Michael Macy, researchers from Cornell University, collected 509 million Twitter messages from 2.4 million users in 84 different countries. The data they used were words collected at various times of day and they classified the data into two different categories: positive emotion words and negative emotion words. Then, they were able to study this new data to evaluate subjects' moods at different times of day, while the subjects were in different parts of the world. They found that the subjects generally exhibited positive emotions in the mornings and late evenings, and negative emotions mid-day. They were able to "project their data onto a smaller dimensional space" using PCS. Their paper, "Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures," is available in the journal Science.<ref>http://www.pcworld.com/article/240831/twitter_analysis_reveals_global_human_moodiness.html</ref>.

Assumptions Underlying Principal Component Analysis can be found here<ref>http://support.sas.com/publishing/pubcat/chaps/55129.pdf</ref>

#### Example 4

(Not discussed in class) A somewhat well known learning rule in the field of neural networks called Oja's rule can be used to train networks of neurons to compute the principal component directions of data sets. <ref>A Simplified Neuron Model as a Principal Component Analyzer. Erkki Oja. 1982. Journal of Mathematical Biology. 15: 267-273</ref> This rule is formulated as follows

$\,\Delta w = \eta yx -\eta y^2w$

where $\,\Delta w$ is the neuron weight change, $\,\eta$ is the learning rate, $\,y$ is the neuron output given the current input, $\,x$ is the current input and $\,w$ is the current neuron weight. This learning rule shares some similarities with another method for calculating principal components: power iteration. The basic algorithm for power iteration (taken from wikipedia: <ref>Wikipedia. http://en.wikipedia.org/wiki/Principal_component_analysis#Computing_principal_components_iteratively</ref>) is shown below

$\mathbf{p} =$ a random vector
do c times:
$\mathbf{t} = 0$ (a vector of length m)
for each row $\mathbf{x} \in \mathbf{X^T}$
$\mathbf{t} = \mathbf{t} + (\mathbf{x} \cdot \mathbf{p})\mathbf{x}$
$\mathbf{p} = \frac{\mathbf{t}}{|\mathbf{t}|}$
return $\mathbf{p}$


Comparing this with the neuron learning rule we can see that the term $\, \eta y x$ is very similar to the $\,\mathbf{t}$ update equation in the power iteration method, and identical if the neuron model is assumed to be linear ($\,y(x)=x\mathbf{p}$) and the learning rate is set to 1. Additionally, the $\, -\eta y^2w$ term performs the normalization, the same function as the $\,\mathbf{p}$ update equation in the power iteration method.

### Observations

Some observations about the PCA were brought up in class:

• PCA assumes that data is on a linear subspace or close to a linear subspace. For non-linear dimensionality reduction, other techniques are used. Amongst the first proposed techniques for non-linear dimensionality reduction are Locally Linear Embedding (LLE) and Isomap. More recent techniques include Maximum Variance Unfolding (MVU) and t-Distributed Stochastic Neighbor Embedding (t-SNE). Kernel PCAs may also be used, but they depend on the type of kernel used and generally do not work well in practice. (Kernels will be covered in more detail later in the course.)
• Finding the number of PCs to use is not straightforward. It requires knowledge about the instrinsic dimentionality of data. In practice, oftentimes a heuristic approach is adopted by looking at the eigenvalues ordered from largest to smallest. If there is a "dip" in the magnitude of the eigenvalues, the "dip" is used as a cut off point and only the large eigenvalues before the "dip" are used. Otherwise, it is possible to add up the eigenvalues from largest to smallest until a certain percentage value is reached. This percentage value represents the percentage of variance that is preserved when projecting onto the PCs corresponding to the eigenvalues that have been added together to achieve the percentage.
• It is a good idea to normalize the variance of the data before applying PCA. This will avoid PCA finding PCs in certain directions due to the scaling of the data, rather than the real variance of the data.
• PCA can be considered as an unsupervised approach, since the main direction of variation is not known beforehand, i.e. it is not completely certain which dimension the first PC will capture. The PCs found may not correspond to the desired labels for the data set. There are, however, alternate methods for performing supervised dimensionality reduction.
• (Not in class) Even though the traditional PCA method does not work well on data set that lies on a non-linear manifold. A revised PCA method, called c-PCA, has been introduced to improve the stability and convergence of intrinsic dimension estimation. The approach first finds a minimal cover (a cover of a set X is a collection of sets whose union contains X as a subset<ref>http://en.wikipedia.org/wiki/Cover_(topology)</ref>) of the data set. Since set covering is an NP-hard problem, the approach only finds an approximation of minimal cover to reduce the complexity of the run time. In each subset of the minimal cover, it applies PCA and filters out the noise in the data. Finally the global intrinsic dimension can be determined from the variance results from all the subsets. The algorithm produces robust results.<ref>Mingyu Fan, Nannan Gu, Hong Qiao, Bo Zhang, Intrinsic dimension estimation of data by principal component analysis, 2010. Available: http://arxiv.org/abs/1002.2050</ref>
• (Not in class) While PCA finds the mathematically optimal method (as in minimizing the squared error), it is sensitive to outliers in the data that produce large errors PCA tries to avoid. It therefore is common practice to remove outliers before computing PCA. However, in some contexts, outliers can be difficult to identify. For example in data mining algorithms like correlation clustering, the assignment of points to clusters and outliers is not known beforehand. A recently proposed generalization of PCA based on a Weighted PCA increases robustness by assigning different weights to data objects based on their estimated relevancy.<ref>http://en.wikipedia.org/wiki/Principal_component_analysis</ref>
• (Not in class) Comparison between PCA and LDA: Principal Component Analysis (PCA)and Linear Discriminant Analysis (LDA) are two commonly used techniques for data classification and dimensionality reduction. Linear Discriminant Analysis easily handles the case where the within-class frequencies are unequal and their performances has been examined on randomly generated test data. This method maximizes the ratio of between-class variance to the within-class variance in any particular data set thereby guaranteeing maximal separability. ... The prime difference between LDA and PCA is that PCA does more of feature classification and LDA does data classification. In PCA, the shape and location of the original data sets changes when transformed to a different space whereas LDA doesn’t change the location but only tries to provide more class separability and draw a decision region between the given classes. This method also helps to better understand the distribution of the feature data." <ref> Balakrishnama, S., Ganapathiraju, A. LINEAR DISCRIMINANT ANALYSIS - A BRIEF TUTORIAL. http://www.isip.piconepress.com/publications/reports/isip_internal/1998/linear_discrim_analysis/lda_theory.pdf </ref>

### Summary

The PCA algorithm can be summarized into the following steps:

1. Recover basis
$\ \text{ Calculate } XX^T=\Sigma_{i=1}^{t}x_ix_{i}^{T} \text{ and let } U=\text{ eigenvectors of } XX^T \text{ corresponding to the largest } d \text{ eigenvalues.}$
2. Encode training data
$\ \text{Let } Y=U^TX \text{, where } Y \text{ is a } d \times t \text{ matrix of encodings of the original data.}$
3. Reconstruct training data
$\hat{X}=UY=UU^TX$.
4. Encode test example
$\ y = U^Tx \text{ where } y \text{ is a } d\text{-dimensional encoding of } x$.
5. Reconstruct test example
$\hat{x}=Uy=UU^Tx$.

### Dual PCA

Singular value decomposition allows us to formulate the principle components algorithm entirely in terms of dot products between data points and limit the direct dependence on the original dimensionality d. Now assume that the dimensionality d of the d × n matrix of data X is large (i.e., d >> n). In this case, the algorithm described in previous sections become impractical. We would prefer a run time that depends only on the number of training examples n, or that at least has a reduced dependence on n. Note that in the SVD factorization $\ X = U \Sigma V^T$, the eigenvectors in $\ U$ corresponding to non-zero singular values in $\ \Sigma$ (square roots of eigenvalues) are in a one-to-one correspondence with the eigenvectors in $\ V$ . After performing dimensionality reduction on $\ U$ and keep only the first l eigenvectors, corresponding to the top l non-zero singular values in $\ \Sigma$, these eigenvectors will still be in a one-to-one correspondence with the first l eigenvectors in $\ V$ :

$\ X V = U \Sigma$

$\ \Sigma$ is square and invertible, because its diagonal has non-zero entries. Thus, the following conversion between the top l eigenvectors can be derived:

$\ U = X V \Sigma^{-1}$

Now Replacing $\ U$ with $\ X V \Sigma^{-1}$ gives us the dual form of PCA.

## Fisher Discriminant Analysis (FDA) (Lecture: Sep. 29, 2011 - Oct. 04, 2011)

Fisher Discriminant Analysis (FDA) is sometimes called Fisher Linear Discriminant Analysis (FLDA) or just Linear Discriminant Analysis (LDA). This causes confusion with the Linear Discriminant Analysis (LDA) technique covered earlier in the course. The LDA technique covered earlier in the course has a normality assumption and is a boundary finding technique. The FDA technique outlined here is a supervised feature extraction technique. FDA differs from PCA as well because PCA does not use the class labels, $\ y_i$, of the data $\ (x_i,y_i)$ while FDA organizes data into their classes by finding the direction of maximum separation between classes.

• PCA

- Find a rank d subspace which minimize the squared reconstruction error:

$\Sigma = |x_i - \hat{x} |^2$

where $\hat{x}$ is projection of original data.

One main drawback of the PCA technique is that the direction of greatest variation may not produce the classification we desire. For example, imagine if the data set above had a lightening filter applied to a random subset of the images. Then the greatest variation would be the brightness and not the more important variations we wish to classify. As another example , if we imagine 2 cigar like clusters in 2 dimensions, one cigar has $y = 1$ and the other $y = -1$. The cigars are positioned in parallel and very closely together, such that the variance in the total data-set, ignoring the labels, is in the direction of the cigars. For classification, this would be a terrible projection, because all labels get evenly mixed and we destroy the useful information. A much more useful projection is orthogonal to the cigars, i.e. in the direction of least overall variance, which would perfectly separate the data-cases (obviously, we would still need to perform classification in this 1-D space.) See figure below <ref>www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf</ref>. FDA circumvents this problem by using the labels, $\ y_i$, of the data $\ (x_i,y_i)$ i.e. the FDA uses supervised learning. The main difference between FDA and PCA is that, in PCA we are interested in transforming the data to a new coordinate system such that the greatest variance of data lies on the first coordinate, but in FDA, we project the data of each class onto a point in such a way that the resulting points would be as far apart from each other as possible. The FDA goal is achieved by projecting data onto a suitably chosen line that minimizes the within class variance, and maximizes the distance between the two classes i.e. group similar data together and spread different data apart. This way, new data acquired can be compared, after a transformation, to where these projections, using some well-chosen metric.

We first consider the cases of two-classes. Denote the mean and covariance matrix of class $i=0,1$ by $\mathbf{\mu}_i$ and $\mathbf{\Sigma}_i$ respectively. We transform the data so that it is projected into 1 dimension i.e. a scalar value. To do this, we compute the inner product of our $dx1$-dimensional data, $\mathbf{x}$, by a to-be-determined $dx1$-dimensional vector $\mathbf{w}$. The new means and covariances of the transformed data:

$\mu'_i:\rightarrow \mathbf{w}^{T}\mathbf{\mu}_i$
$\Sigma'_i :\rightarrow \mathbf{w}^{T}\mathbf{\sigma}_i \mathbf{w}$

The new means and variances are actually scalar values now, but we will use vector and matrix notation and arguments throughout the following derivation as the multi-class case is then just a simpler extension.

### Goals of FDA

As will be shown in the objective function, the goal of FDA is to maximize the separation of the classes (between class variance) and minimize the scatter within each class (within class variance). That is, our ideal situation is that the individual classes are as far away from each other as possible and at the same time the data within each class are as close to each other as possible (collapsed to a single point in the most extreme case). An interesting note is that R. A. Fisher who FDA is named after, used the FDA technique for purposes of taxonomy, in particular for categorizing different species of iris flowers. <ref name="RAFisher">R. A. Fisher, "The Use of Multiple measurements in Taxonomic Problems," Annals of Eugenics, 1936</ref>. It is very easy to visualize what is meant by within class variance (i.e. differences between the iris flowers of the same species) and between class variance (i.e. the differences between the iris flowers of different species) in that case.

First, we need to reduce the dimensionality of covariate to one dimension (two-class case) by projecting the data onto a line. That is take the d-dimensional input values x and project it to one dimension by using $z=\mathbf{w}^T \mathbf{x}$ where $\mathbf{w}^T$ is 1 by d and $\mathbf{x}$ is d by 1.

Goal: choose the vector $\mathbf{w}=[w_1,w_2,w_3,...,w_d]^T$ that best seperate the data, then we perform classification with projected data $z$ instead of original data $\mathbf{x}$ .

$\hat{{\mu}_0}=\frac{1}{n_0}\sum_{i:y_i=0} x_i$

$\hat{{\mu}_1}=\frac{1}{n_1}\sum_{i:y_i=1} x_i$

$\mathbf{x}\rightarrow\mathbf{w}^{T}\mathbf{x}$.
$\mathbf{\mu}\rightarrow\mathbf{w}^{T}\mathbf{\mu}$.
$\mathbf{\Sigma}\rightarrow\mathbf{w}^{T}\mathbf{\Sigma}\mathbf{w}$

1) Our first goal is to minimize the individual classes' covariance. This will help to collapse the data together. We have two minimization problems

$\min_{\mathbf{w}} \mathbf{w}^{T} \mathbf{\Sigma}_0 \mathbf{w}$

and

$\min_{\mathbf{w}} \mathbf{w}^{T} \mathbf{\Sigma}_1 \mathbf{w}$.

But these can be combined:

$\min_{\mathbf{w}} \mathbf{w} ^{T}\mathbf{\Sigma}_0 \mathbf{w} + \mathbf{w}^{T} \mathbf{\Sigma}_1 \mathbf{w}$
$= \min_{\mathbf{w}} \mathbf{w} ^{T}( \mathbf{\Sigma_0} + \mathbf{\Sigma_1} ) \mathbf{w}$

Define $\mathbf{S}_W =\mathbf{\Sigma_0} + \mathbf{\Sigma_1}$, called the within class variance matrix.

2) Our second goal is to move the minimized classes as far away from each other as possible. One way to accomplish this is to maximize the distances between the means of the transformed data i.e.

$\max_{\mathbf{w}} |\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1|^2$

Simplifying:

$\max_{\mathbf{w}} \,(\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1)^T (\mathbf{w}^{T}\mathbf{\mu}_0 - \mathbf{w}^{T}\mathbf{\mu}_1)$
$= \max_{\mathbf{w}}\, (\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}\mathbf{w} \mathbf{w}^{T} (\mathbf{\mu}_0-\mathbf{\mu}_1)$
$= \max_{\mathbf{w}} \,\mathbf{w}^{T}(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}\mathbf{w}$

Recall that $\mathbf{\mu}_i$ are known. Denote

$\mathbf{S}_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}$

This matrix, called the between class variance matrix, is a rank 1 matrix, so an inverse does not exist. Altogether, we have two optimization problems we must solve simultaneously:

1) $\min_{\mathbf{w}} \mathbf{w}^{T} \mathbf{S_W} \mathbf{w}$
2) $\max_{\mathbf{w}} \mathbf{w}^{T} \mathbf{S_B} \mathbf{w}$

There are other metrics one can use to both minimize the data's variance and maximizes the distance between classes, and other goals we can try to accomplish (see metric learning, below...one day), but Fisher used this elegant method, hence his recognition in the name, and we will follow his method.

We can combine the two optimization problems into one after noting that the negative of max is min:

$\max_{\mathbf{w}} \; \alpha \mathbf{w}^{T} \mathbf{S_B} \mathbf{w} - \mathbf{w}^{T} \mathbf{S_W} \mathbf{w}$

The $\alpha$ coefficient is a necessary scaling factor: if the scale of one of the terms is much larger than the other, the optimization problem will be dominated by the larger term. This means we have another unknown, $\alpha$, to solve for. Instead, we can circumvent the scaling problem by looking at the ratio of the quantities, the original solution Fisher proposed:

$\max_{\mathbf{w}} \frac{\mathbf{w}^{T} \mathbf{S_B} \mathbf{w}}{\mathbf{w}^{T} \mathbf{S_W} \mathbf{w}}$

This optimization problem can be shown<ref> http://www.socher.org/uploads/Main/optimizationTutorial01.pdf </ref> to be equivalent to the following optimization problem:

$\max_{\mathbf{w}} \mathbf{w}^{T} \mathbf{S_B} \mathbf{w}$

(optimized function)

subject to:

${\mathbf{w}^{T} \mathbf{S_W} \mathbf{w}} = 1$

(constraint)

A heuristic understanding of this equivalence is that we have two degrees of freedom: direction and scalar. The scalar value is irrelevant to our discussion. Thus, we can set one of the values to be a constant. We can use Lagrange multipliers to solve this optimization problem:

$L( \mathbf{w}, \lambda) = \mathbf{w}^{T} \mathbf{S_B} \mathbf{w} - \lambda(\mathbf{w}^{T} \mathbf{S_W} \mathbf{w}-1)$
$\Rightarrow \frac{\partial L}{\partial \mathbf{w}} = 2 \mathbf{S}_B \mathbf{w} - 2\lambda \mathbf{S}_W\mathbf{w}$

Setting the partial derivative to 0 gives us a generalized eigenvalue problem:

$\mathbf{S}_B \mathbf{w} = \lambda \mathbf{S}_W \mathbf{w}$
$\Rightarrow \mathbf{S}_W^{-1} \mathbf{S}_B \mathbf{w} = \lambda \mathbf{w}$

This is a generalized eigenvalue problem and $\ \mathbf{w}$ can be computed as the eigenvector corresponds to the largest eigenvalue of

$\mathbf{S}_W^{-1} \mathbf{S}_B$

It is very likely that $\mathbf{S}_W$ has an inverse. If not, the pseudo-inverse<ref> http://en.wikipedia.org/wiki/Generalized_inverse </ref><ref> http://www.mathworks.com/help/techdoc/ref/pinv.html </ref> can be used. In Matlab the pseudo-inverse function is named pinv. Thus, we should choose $\mathbf{w}$ to equal the eigenvector of the largest eigenvalue as our projection vector.

In fact we can simplify the above expression further in the case of two classes. Recall the definition of $\mathbf{S}_B = (\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T}$. Substituting this into our expression:

$\mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1)(\mathbf{\mu}_0-\mathbf{\mu}_1)^{T} \mathbf{w} = \lambda \mathbf{w}$
$(\mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) ) ((\mathbf{\mu}_0-\mathbf{\mu}_1)^{T} \mathbf{w}) = \lambda \mathbf{w}$

This second term is a scalar value, let's denote it $\beta$. Then

$\mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) = \frac{\lambda}{\beta} \mathbf{w}$
$\Rightarrow \, \mathbf{S}_W^{-1}(\mathbf{\mu}_0-\mathbf{\mu}_1) \propto \mathbf{w}$

(this equation indicates the direction of the separation). All we are interested in the direction of $\mathbf{w}$, so to compute this is sufficient to finding our projection vector. Though this will not work in higher dimensions, as $\mathbf{w}$ would be a matrix and not a vector in higher dimensions.

### Extensions to Multiclass Case

If we have $\ k$ classes, we need $\ k-1$ directions i.e. we need to project $\ k$ 'points' onto a $\ k-1$ dimensional hyperplane. What does this change in our above derivation? The most significant difference is that our projection vector,$\mathbf{w}$, is no longer a vector but instead is a matrix $\mathbf{W}$, where $\mathbf{W}$ is a d*(k-1) matrix if X is in d-dim. We transform the data as:

$\mathbf{x}' :\rightarrow \mathbf{W}^{T} \mathbf{x}$

so our new mean and covariances for class k are:

$\mathbf{\mu_k}' :\rightarrow \mathbf{W}^{T} \mathbf{\mu_k}$
$\mathbf{\Sigma_k}' :\rightarrow \mathbf{W}^{T} \mathbf{\Sigma_k} \mathbf{W}$

What are our new optimization sub-problems? As before, we wish to minimize the within class variance. This can be formulated as:

$\min_{\mathbf{W}} \mathbf{W}^{T} \mathbf{\Sigma_1} \mathbf{W} + \dots + \mathbf{W}^{T} \mathbf{\Sigma_k} \mathbf{W}$

Again, denoting $\mathbf{S}_W = \mathbf{\Sigma_1} + \dots + \mathbf{\Sigma_k}$, we can simplify above expression:

$\min_{\mathbf{W}} \mathbf{W}^{T} \mathbf{S}_W \mathbf{W}$

Similarly, the second optimization problem is:

$\max_{\mathbf{W}} \mathbf{W}^{T} \mathbf{S}_B \mathbf{W}$

What is $\mathbf{S}_B$ in this case? It can be shown that $\mathbf{S}_T = \mathbf{S}_B + \mathbf{S}_W$ where $\mathbf{S}_T$ is the covariance matrix of all the data. From this we can compute $\mathbf{S}_B$.

Next, if we express $\mathbf{W} = ( \mathbf{w}_1 , \mathbf{w}_2 , \dots ,\mathbf{w}_k )$ observe that, for $\mathbf{A} = \mathbf{S}_B , \mathbf{S}_W$:

$Tr(\mathbf{W}^{T} \mathbf{A} \mathbf{W}) = \mathbf{w}_1^{T} \mathbf{A} \mathbf{w}_1^{T} + \dots + \mathbf{w}_k \mathbf{A} \mathbf{w}_k$

where $\ Tr()$ is the trace of a matrix. Thus, following the same steps as in the two-class case, we have the new optimization problem:

$\max_{\mathbf{W}} \frac{ Tr(\mathbf{W}^{T} \mathbf{S}_B \mathbf{W}) }{Tr(\mathbf{W}^{T} \mathbf{S}_W \mathbf{W})}$

The first (k-1) eigenvector of $\mathbf{S}_W^{-1} \mathbf{S}_B$ are required (k-1) direction. That is why under multiclass case, for the k-class problem, we need to project initial points onto k-1 direction.

subject to:

$Tr( \mathbf{W} \mathbf{S_W} \mathbf{W}^{T}) = 1$

Again, in order to solve the above optimization problem, we can use the Lagrange multiplier <ref> http://en.wikipedia.org/wiki/Lagrange_multiplier </ref>:

\begin{align}L(\mathbf{W},\Lambda) = Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}] - \Lambda\left\{ Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}] - I \right\}\end{align}.

where $\ \Lambda$ is a d by d diagonal matrix.

Then, we differentiating with respect to $\mathbf{W}$:

\begin{align}\frac{\partial L}{\partial \mathbf{W}} = (\mathbf{S}_{B} + \mathbf{S}_{B}^{T})\mathbf{W} - \Lambda (\mathbf{S}_{W} + \mathbf{S}_{W}^{T})\mathbf{W}\end{align} = 0.

Thus:

\begin{align}\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{S}_{W}\mathbf{W}\end{align}
\begin{align}\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{W} = \Lambda\mathbf{W}\end{align}

where, $\mathbf{\Lambda} =\begin{pmatrix}\lambda_{1} & & 0\\&\ddots&\\0 & &\lambda_{d}\end{pmatrix}$

The above equation is of the form of an eigenvalue problem. Thus, for the solution the k-1 eigenvectors corresponding to the k-1 largest eigenvalues should be chosen as the projection matrix, $\mathbf{W}$. In fact, there should only by k-1 eigenvectors corresponding to k-1 non-zero eigenvalues using the above equation.

### Summary

FDA has two optimization problems:

1) $\min_{\mathbf{w}} \mathbf{w}^{T} \mathbf{S_W} \mathbf{w}$
2) $\max_{\mathbf{w}} \mathbf{w}^{T} \mathbf{S_B} \mathbf{w}$

where $\mathbf{S}_W = \mathbf{\Sigma_1} + \dots + \mathbf{\Sigma_k}$ is called the within class variance and $\ \mathbf{S}_B = \mathbf{S}_T - \mathbf{S}_W$ is called the between class variance where $\mathbf{S}_T$ is the variance of all the data together.

Every column of $\mathbf{w}$ paranell a single eigenvector.

The two optimization problems are combined as follows:

$\max_{\mathbf{w}} \frac{\mathbf{w}^{T} \mathbf{S_B} \mathbf{w}}{\mathbf{w}^{T} \mathbf{S_W} \mathbf{w}}$

By adding a constraint as shown:

$\max_{\mathbf{w}} \mathbf{w}^{T} \mathbf{S_B} \mathbf{w}$

subject to:

$\mathbf{w}^{T} \mathbf{S_W} \mathbf{w} = 1$

Lagrange multipliers can be used and essentially the problem becomes an eigenvalue problem:

\begin{align}\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w} = \lambda\mathbf{w}\end{align}

And $\ w$ can be computed as the k-1 eigenvectors corresponding to the largest k-1 eigenvalues of $\mathbf{S}_W^{-1} \mathbf{S}_B$.

### Variations

Some adaptations and extensions exist for the FDA technique (Source: <ref>R. Gutierrez-Osuna, "Linear Discriminant Analysis" class notes for Intro to Pattern Analysis, Texas A&M University. Available: [1]</ref>):

1) Non-Parametric LDA (NPLDA) by Fukunaga

This method does not assume that the Gaussian distribution is unimodal and it is actually possible to extract more than k-1 features (where k is the number of classes).

2) Orthonormal LDA (OLDA) by Okada and Tomita

This method finds projections that are orthonormal in addition to maximizing the FDA objective function. This method can also extract more than k-1 features (where k is the number of classes).

3) Generalized LDA (GLDA) by Lowe

This method incorporates additional cost functions into the FDA objective function. This causes classes with a higher cost to be placed further apart in the lower dimensional representation.

### Optical Character Recognition (OCR) using FDA

Optical Character Recognition (OCR) is a method to translate scanned, human-readable text into machine-encoded text. In class, we have employed FDA to recognize digits. A paper <ref>Manjunath Aradhya, V.N., Kumar, G.H., Noushath, S., Shivakumara, P., "Fisher Linear Discriminant Analysis based Technique Useful for Efficient Character Recognition", Intelligent Sensing and Information Processing, 2006.</ref> describes the use of FDA to recognize printed documents written in English and Kannada, the fifth most popular language in India. The researchers conducted two types of experiments: one on printed Kannada and English documents and another on handwritten English characters. In the first type of experiments, they conducted four experiments: i) clear and degraded characters in specific fonts; ii) characters in various size; iii) characters in various fonts; iv) characters with noise. In experiment i, FDA achieved 98.2% recognition rate with 12 projection vectors in 21,560 samples. In experiment ii, it achieved 96.9% recognition rate with 10 projection vectors in 11,200 samples. In experiment iii, it achieved 93% recognition rate with 17 projection vectors in 19,850 samples. In experiment iv, it achieved 96.3% recognition rate with 14 projection vectors in 20,000 samples. Overall, the recognition by FDA was very satisfying. In the second type of experiment, a total of 12,400 handwriting samples from 200 different writers were collected. With 175 samples for training purpose, the recognition rate by FDA is 92% with 35 projection vectors.

### Facial Recognition using FDA

The Fisherfaces method of facial recognition uses PCA and FDA in a similar way to using just PCA. However, it is more advantageous than using on PCA because it minimizes variation within each class and maximizes class separation. The PCA only method is, therefore, more sensitive to lighting and pose variations. In studies done by Belhumeir, Hespanda, and Kiregeman (1997) and Turk and Pentland (1991), this method had a 96% recognition rate. <ref>Bagherian, Elham. Rahmat, Rahmita. Facial Feature Extraction for Face Recognition: a Review. International Symposium on Information Technology, 2008. ITSim2 article number 4631649.</ref>

## Linear and Logistic Regression (Lecture: Oct. 06, 2011)

### Linear Regression

Both Regression and Classification are aimed to find a function h which maps data X to feature Y. In regression, $\ y$ is a continuous variable. In classification, $\ y$ is a discrete variable. In linear regression, data is modeled using a linear function, and unknown parameters are estimated from the data. Regression problems are easier to formulate into functions (since $\ y$ is continuous) and it is possible to solve classification problems by treating them like regression problems. In order to do so, the requirement in classification that $\ y$ is discrete must first be relaxed. Once $\ y$ has been found using regression techniques, it is possible to determine the discrete class corresponding to the $\ y$ that has been found to solve the original classification problem. The discrete class is obtained by defining a threshold where $\ y$ values below the threshold belong to one class and $\ y$ values above the threshold belong to another class.

When running a regression we are making two assumptions, 1) there is a linear relationship between two variables (i.e. X and Y) and 2) this relationship is additive (i.e. Y= x1 + x2 + …+xN). Technically, linear regression estimates how much Y changes when X changes one unit.

More formally: a more direct approach to classification is to estimate the regression function $\ r(\mathbf{x}) = E[Y | X]$ without bothering to estimate $\ f_k(\mathbf{x})$. For the linear model, we assume that either the regression function $r(\mathbf{x})$ is linear, or the linear model has a reasonable approximation.

Here is a simple example. If $\ Y = \{0,1\}$ (a two-class problem), then $\, h^*(\mathbf{x})= \left\{\begin{matrix} 1 &\text{, if } \hat r(\mathbf{x})\gt \frac{1}{2} \\ 0 &\mathrm{, otherwise} \end{matrix}\right.$

Basically, we can use a linear function $\ f(x, \beta) = y_i = \mathbf{\beta\,}^T \mathbf{x_{i}} + \mathbf{\beta\,_0}$ , $\mathbf{x_{i}} \in \mathbb{R}^{d}$ and use the least squares approach to fit the function to the given data. This is done by minimizing the following expression:

$\min_{\mathbf{\beta}} \sum_{i=1}^n (y_i - \mathbf{\beta}^T \mathbf{x_{i}} - \mathbf{\beta_0})^2$

For convenience, $\mathbf{\beta}$ and $\mathbf{\beta}_0$ can be combined into a d+1 dimensional vector, $\tilde{\mathbf{\beta}}$. And an extra term 1 is appended to $\ x$. Thus, the function to be minimized can now be re-expressed as:

$\ LS = \min_{\tilde{\beta}} \sum_{i=1}^{n} (y_i - \tilde{\beta}^T \tilde{x_i} )^2$

$\ LS = \min_{\tilde{\beta}} || y - X \tilde{\beta} ||^2$

where

$\tilde{\mathbf{\beta}} = \left( \begin{array}{c}\mathbf{\beta_{1}} \\ \\ \vdots \\ \\ \mathbf{\beta}_{d} \\ \\ \mathbf{\beta}_{0} \end{array} \right) \in \mathbb{R}^{d+1}$ and

$\tilde{x} = \left( \begin{array}{c}{x_{1}} \\ \\ \vdots \\ \\ {x}_{d} \\ \\ 1 \end{array} \right) \in \mathbb{R}^{d+1}$.

where $\tilde{\mathbf{\beta}}$ is a d+1 by 1 matrix(a d+1 dimensional vector)

Here $\ y$ and $\tilde{\beta}$ are vectors and $\ X$ is a n by d+1 matrix with each row represents a data point with a 1 as the last entry. X also can be seen as a matrix in which each column represents a feature and the $\ (d+1)^{th}$ column is an all-one vector corresponding to $\ \beta_0$ .

$\ {\tilde{\beta}}$ that minimizes the error is:

$\ \frac{\partial LS}{\partial \tilde{\beta}} = -2(X^T)(y-X\tilde{\beta})=0$, which gives us $\ {\tilde{\beta}} = (X^TX)^{-1}X^Ty$. When $\ X^TX$ is singular we have to use pseudo inverse for obtaining optimal $\ \tilde{\beta}$.

Using regression to solve classification problems is not mathematically correct, if we want to be true to classification. However, this method works well in practice, if the problem is not complicated. When we have only two classes (for which the target values are encoded as $\ \frac{-n}{n_1}$ and $\ \frac{n}{n_2}$, where $\ n_i$ is the number of data points in class i and n is the total number of points in the data set) this method is identical to LDA.

#### Matlab Example

The following is the code and the explanation for each step.

Again, we use the data in 2_3.m.

 >>load 2_3;
>>[U, sample] = princomp(X');
>>sample = sample(:,1:2);


We carry out Principal Component Analysis (PCA) to reduce the dimensionality from 64 to 2.

 >>y = zeros(400,1);
>>y(201:400) = 1;


We let y represent the set of labels coded as 0 and 1.

 >>x=[sample;ones(1,400)];


Construct x by adding a row of vector 1 to data.

 >>b=inv(x*x')*x*y;


Calculate b, which represents $\beta$ in the linear regression model.

 >>x1=x';
>>for i=1:400
if x1(i,:)*b>0.5
plot(x1(i,1),x1(i,2),'.')
hold on
elseif x1(i,:)*b < 0.5
plot(x1(i,1),x1(i,2),'r.')
end
end


Plot the fitted y values.

File:linearregression.png
the figure shows that the classification of the data points in 2_3.m by the linear regression model

#### Practical Usefulness

Linear regression in general is not very useful for classification purposes. One of the main problems is that new data may not always have a positive ("more successful") impact on the linear regression learning algorithm due to the non-linear "binary" form of the classes. Consider the following simple example:

The boundary decision at $r(x)=0.5$ was added for visualization purposes. Clearly, linear regression categorises this data properly. However, consider adding one more datum:

This datum actually skews linear regression to the point that it misclassifies some of the data points that should be labelled '1'. This shows how linear regression cannot adapt well to binary classification problems.

#### general guidelines for building a regression model

1. Make sure all relevant predictors are included. These are based on your research question, theory and knowledge on the topic. 2. Combine those predictors that tend to measure the same thing (i.e. as an index). 3. Consider the possibility of adding interactions (mainly for those variables with large effects) 4. Strategy to keep or drop variables:

 1. Predictor not significant and has the expected sign -> Keep it
2. Predictor not significant and does not have the expected sign -> Drop it
3. Predictor is significant and has the expected sign -> Keep it
4. Predictor is significant but does not have the expected sign -> Review, you may need more variables, it
may be interacting with another variable in the model or there may be an error in the data.<ref>


### Logistic Regression

Logistic regression is a more advanced method for classification, and is more commonly used. In statistics, logistic regression (sometimes called the logistic model or logit model) is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression. Like many forms of regression analysis, it makes use of several predictor variables that may be either numerical or categorical. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person's age, sex and body mass index. Logistic regression is used extensively in the medical and social sciences fields, as well as marketing applications such as prediction of a customer's propensity to purchase a product or cease a subscription.<ref>http://en.wikipedia.org/wiki/Logistic_regression</ref>

We can define a function
$f_1(x)= P(Y=1| X=x) = (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})$

$P(Y=1 | X=x)$

This is a valid conditional density function since the two components ($f_1$ and $f_2$, shown just below) sum to 1 and remain in [0, 1].

It looks similar to a step function, but we have relaxed it so that we have a smooth curve, and can therefore take the derivative.

The range of this function is (0,1) since

$\lim_{x \to -\infty}f_1(\mathbf{x}) = 0$ and $\lim_{x \to \infty}f_1(\mathbf{x}) = 1$.

As shown on this graph of $\ P(Y=1 | X=x)$.

Then we compute the complement of f1(x), and get

$f_2(x)= P(Y=0| X=x) = 1-f_1(x) = (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})$, denoted $f_2$.

$P(Y=0 | X=x)$

Function $f_2$ is commonlly called Logistic function, and it behaves like
$\lim_{x \to -\infty}f_2(\mathbf{x}) = 1$ and
$\lim_{x \to \infty}f_2(\mathbf{x}) = 0$.

As shown on this graph of $\ P(Y=0 | X=x)$.

Since $f_1$ and $f_2$ specify the conditional distribution, the Bernoulli distribution is appropriate for specifying the likelihood of the class. Conveniently code the two classes via 0 and 1 responses, then the likelihood of $y_i$ for given input $x_i$ is given by,

$f(y_i|\mathbf{x_i}) = (f_1(\mathbf{x_i}))^{y} (1-f_1\mathbf{x_i}))^{1-y} = (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{y_i} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{1-y_i}$

Thus y takes value 1 with success probability $f_1$ and value 0 with failure probability $1 - f_1$. We can use this to derive the likelihood for N training observations, and search for the maximizing parameter $\beta$.

In general, we can think of the problem as having a box with some knobs. Inside the box is our objective function which gives the form to classify our input ($x_i$) to our output ($y_i$). The knobs in the box are functioning like the parameters of the objective function. Our job is to find the proper parameters that can minimize the error between our output and the true value. So we have turned our machine learning problem into an optimization problem.

Since we need to find the parameters that maximize the chance of having our observed data coming from the distribution of $f (x|\theta)$, we need to introduce Maximum Likelihood Estimation.

#### Maximum Likelihood Estimation

Given iid data points $({\mathbf{x}_i})_{i=1}^n$ (i.i.d) and density function $f(\mathbf{x}|\mathbf{\theta})$, where the form of f is known but the parameters $\theta$ are our unknown. For example, we know that the data come from a Gaussian distribution but we don't know the mean and variance of the distribution. The maximum likelihood estimation of $\theta\,_{ML}$ is a set of parameters that maximize the probability of observing $({\mathbf{x}_i})_{i=1}^n$ given $\theta\,_{ML}$.

$\theta_\mathrm{ML} = \underset{\theta}{\operatorname{arg\,max}}\ f(\mathbf{x}|\theta)$.

There was some discussion in class regarding the notation. In literature, Bayesians use $f(\mathbf{x}|\mu)$ while Frequentists use $f(\mathbf{x};\mu)$. In practice, these two are equivalent.

Our goal is to find theta to maximize $\mathcal{L}(\theta\,) = f(\underline{\mathbf{x}}|\;\theta) = \prod_{i=1}^n f(\mathbf{x_i}|\theta)$. where $\underline{\mathbf{x}}=\{x_i\}_{i=1}^{n}$ (The second equality holds because data points are iid.)

In many cases, it’s more convenient to work with the natural logarithm of the likelihood. (Recall that the logarithm preserves minumums and maximums.) $\ell(\theta)=\ln\mathcal{L}(\theta\,)$

$\ell(\theta\,)=\sum_{i=1}^n \ln f(\mathbf{x_i}|\theta)$

Applying Maximum Likelihood Estimation to $f(y|\mathbf{x})= (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{y} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})^{1-y}$, gives

$\mathcal{L}(\mathbf{\beta\,})=\prod_{i=1}^n (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{y_i} (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{1-y_i}$

$\ell(\mathbf{\beta\,}) = \sum_{i=1}^n \left[ y_i \ln(P(Y=y_i|X=x_i)) + (1-y_i) \ln(1-P(Y=y_i|X=x_i))\right]$

This is the likelihood function we want to maximize. Note that $-\ell(\mathbf{\beta\,})$ can be interpreted as the cost function we want to minimize. Simplifying, we get:

\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) + (1-y_i) (\ln{1} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}))\right) \\[10pt]&{} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) - (1-y_i) \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \\[10pt] &{} = \sum_{i=1}^n \left(y_i ({\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})) - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}) + y_i \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \\[10pt] &{} = \sum_{i=1}^n \left(y_i {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}

\begin{align} {\frac{\partial \ell}{\partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - \frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}} \mathbf{x_i} \right) \\[8pt] & {}= \sum_{i=1}^n \left(y_i \mathbf{x_i} - P(\mathbf{x_i} | \mathbf{\beta\,}) \mathbf{x_i}\right) \end{align}

Now set $\frac{\partial \ell}{\partial \mathbf{\beta\,}}$ equal to 0, and $\mathbf{\beta\,}$ can be numerically solved by Newton's method.

#### Newton's Method

Newton's Method (or Newton-Raphson method) is a numerical method to find better approximations to the solutions of real-valued function. The function usually does not have an analytical form.

The goal is to find $\mathbf{x}$ such that $f(\mathbf{x}) = 0$, such Xs are called the roots of function f. The recursion can be implemented by $\mathbf{x_1} = \mathbf{x_0} - \frac{f(\mathbf{x_0})}{f'(\mathbf{x_0})}.\,\!$.

It takes an initial guess $\mathbf{x_0}$ and the direction $\ \frac{f(x_0)}{f'(x_0)}$ that moves toward a better approximation. It then finds a newer and better $\mathbf{x_1}$. Taking this $\mathbf{x_1}$ as $\mathbf{x_0}$ in the second run, it finds a newer and better $\mathbf{x_1}$ than the previous $\mathbf{x_1}$. Repeating the same process, the $\mathbf{x_1}$ will be sufficiently accurate to the actual solutions.

##### Matlab Example

Below is the Matlab code to find a root of the function $\,y=x^2-2500$ from the initial guess of $\,x=90$. The roots of this equation are trivially solved analytically to be $\,x=\pm 50$.

x=1:100;
y=x.^2 - 2500;  %function to find root of
plot(x,y);

x_opt=90;  %starting guess
x_traversed=[];
y_traversed=[];
error=[];

for i=1:6,
y_opt=x_opt^2-2500;
y_prime_opt=2*x_opt;

%save results of each iteration
x_traversed=[x_traversed x_opt];
y_traversed=[y_traversed y_opt];
error=[error abs(y_opt)];

%update minimum
x_opt=x_opt-(y_opt/y_prime_opt);
end

hold on;
plot(x_traversed,y_traversed,'r','LineWidth',2);
title('Progressions Towards Root of y=x^2 - 2500');
legend('y=x^2 - 2500','Progression');
xlabel('x');
ylabel('y');

hold off;
figure();
semilogy(1:6,error);
title('Error vs Iteration');
xlabel('Iteration');
ylabel('Absolute Y Error');


In this example the Newton method converges to an optimum to within machine precision in only 6 iterations as can be seen from the plot of the Y deviate below.

• Linear regression implements a statistical model that, when relationships between the independent variables and the dependent variable are almost linear, shows optimal results.
• Linear regression is often inappropriately used to model non-linear relationships.
• Linear regression is limited to predicting numeric output.
• A lack of explanation about what has been learned can be a problem.

Logistic regression has several advantages over discriminant analysis:

• It is more robust: the independent variables don't have to be normally distributed, or have equal variance in each group.
• It does not assume a linear relationship between the IV and DV.
• It may handle nonlinear effects.
• You can add explicit interaction and power terms.
• The DV need not be normally distributed.
• There is no homogeneity of variance assumption.
• Normally distributed error terms are not assumed.
• It does not require that the independent variables be interval.
• It does not require that the independent variables be unbounded.

### Comparison Between Logistic Regression And Linear Regression

Linear regression is a regression where the explanatory variable X and response variable Y are linearly related. Both X and Y can be continuous variables, and for every one unit increase in the explanatory variable, there is a set increase or decrease in the response variable Y. A closed form solution exists for the least squares estimate of $\beta$.

Logistic regression is a regression where the explanatory variable X and response variable Y are not linearly related. The response variable provides the probability of occurrence of an event. X can be continuous but Y must be a categorical variable (e.g., can only assume two values, i.e. 0 or 1). For every one unit increase in the explanatory variable, there is a set increase or decrease in the probability of occurrence of the event. No closed form solution exists for the least squares estimate of $\beta$.

In terms of making assumptions on the data set: In LDA, we assumed that the probability density function (PDF) of each class and priors were Gaussian and Bernoulli, respectively. However, in Logistic Regression, we assumed that the PDF of each class had a parametric form and we ignored the priors. Therefore, we may conclude that Logistic regression has less assumptions than LDA.

## Newton-Raphson Method (Lecture: Oct 11, 2011)

Previously we had derivated the log likelihood function for the logistic function.

\begin{align} L(\beta\,) = \prod_{i=1}^n \left( (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{y_i}(\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})^{1-y_i} \right) \end{align}

After taking log, we can have:

\begin{align} \ell(\beta\,) = \sum_{i=1}^n \left( y_i \ln{\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}} + (1 - y_i) \ln{\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}} \right) \end{align}

This implies that:

\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i \left( {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln(1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}) \right) - (1 - y_i)\ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}

\begin{align} {\ell(\mathbf{\beta\,})} & {} = \sum_{i=1}^n \left(y_i {\mathbf{\beta\,}^T \mathbf{x_i}} - \ln({1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})\right) \end{align}

Our goal is to find the $\beta\,$ that maximizes ${\ell(\mathbf{\beta\,})}$. We use calculus to do this ie solve ${\frac{\partial \ell}{\partial \mathbf{\beta\,}}}=0$. To do this we use the famous numerical method of Newton-Raphson. This is an iterative method where we calculate the first and second derivative at each iteration.

#### Newton's Method

Here is how we usually implement Newton's Method: $\mathbf{x_{n+1}} = \mathbf{x_n} - \frac{f(\mathbf{x_n})}{f'(\mathbf{x_n})}.\,\!$. In our particular case, we look for x such that g'(x) = 0, and implement it by $\mathbf{x_{n+1}} = \mathbf{x_n} - \frac{f'(\mathbf{x_n})}{f''(\mathbf{x_n})}.\,\!$.
In practice, the convergence speed depends on |F'(x*)|, where F(x) = $\mathbf{x} - \frac{f(\mathbf{x})}{f'(\mathbf{x})}.\,\!$. The smaller the |F'(x*)| is, the faster the convergence is.

The first derivative is typically called the score vector.

\begin{align} S(\beta\,) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - \frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}} \mathbf{x_i} \right) \\[8pt] \end{align}

\begin{align} S(\beta\,) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \sum_{i=1}^n \left(y_i \mathbf{x_i} - P(x_i|\beta) \mathbf{x_i} \right) \\[8pt] \end{align}

where $\ P(x_i|\beta) = \frac{e^{\beta^T x_i}}{1+e^{\beta^T x_i}}$

The negative of the second derivative is typically called the information matrix.

\begin{align} I(\beta\,) {}= -{\frac{\partial^2 \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \sum_{i=1}^n \left(\mathbf{x_i}\mathbf{x_i}^T (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})(1 - \frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}) \right) \\[8pt] \end{align}

\begin{align} I(\beta\,) {}= -{\frac{\partial^2 \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \sum_{i=1}^n \left(\mathbf{x_i}\mathbf{x_i}^T (\frac{e^{\mathbf{\beta\,}^T \mathbf{x_i}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}})(\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x_i}}}) \right) \\[8pt] \end{align}

\begin{align} I(\beta\,) {}= -{\frac{\partial^2 \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \sum_{i=1}^n \left(\mathbf{x_i}\mathbf{x_i}^T (P(x_i|\beta))(1 - P(x_i|\beta)) \right) \\[8pt] \end{align}

again where $\ P(x_i|\beta) = \frac{e^{\beta^T x_i}}{1+e^{\beta^T x_i}}$

$\, \beta\,^{new} \leftarrow \beta\,^{old}-\frac {f(\beta\,^{old})}{f'(\beta\,^{old})}$

We then use the following update formula to calcalute continually better estimates of the optimal $\beta\,$. It is not typically important what you use as your initial estimate $\beta\,^{(1)}$ is. (However, some improper beta will cause I to be a singular matrix).

$\beta\,^{(r+1)} {}= \beta\,^{(r)} + (I(\beta\,^{(r)}))^{-1} S(\beta\,^{(r)} )$

#### Matrix Notation

Let $\mathbf{y}$ be a (n x 1) vector of all class labels. This is called the response in other contexts.

Let $\mathbb{X}$ be a (n x (d+1)) matrix of all your features. Each row represents a data point. Each column represents a feature/covariate.

Let $\mathbf{p}^{(r)}$ be a (n x 1) vector with values $P(\mathbf{x_i} |\beta\,^{(r)} )$

Let $\mathbb{W}^{(r)}$ be a (n x n) diagonal matrix with $\mathbb{W}_{ii}^{(r)} {}= P(\mathbf{x_i} |\beta\,^{(r)} )(1 - P(\mathbf{x_i} |\beta\,^{(r)} ))$

The score vector, information matrix and update equation can be rewritten in terms of this new matrix notation, so the first derivative is

\begin{align} S(\beta\,^{(r)}) {}= {\frac{\partial \ell}{ \partial \mathbf{\beta\,}}}&{} = \mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)})\end{align}

And the second derivative is

\begin{align} I(\beta\,^{(r)}) {}= -{\frac{\partial^{2} \ell}{\partial \mathbf {\beta\,} \partial \mathbf{\beta\,}^T}}&{} = \mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X} \end{align}

Therfore, we can fit a regression problem as follows

$\beta\,^{(r+1)} {}= \beta\,^{(r)} + (I(\beta\,^{(r)}))^{-1}S(\beta\,^{(r)} ) {}$

$\beta\,^{(r+1)} {}= \beta\,^{(r)} + (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}\mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)})$

#### Iteratively Re-weighted Least Squares

If we reorganize this updating formula we can see it is really iteratively solving a least squares problem each time with a new weighting.

$\beta\,^{(r+1)} {}= (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}(\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X}\beta\,^{(r)} + \mathbb{X}^T(\mathbf{y} - \mathbf{p}^{(r)}))$

$\beta\,^{(r+1)} {}= (\mathbb{X}^T\mathbb{W}^{(r)}\mathbb{X})^{-1}\mathbb{X}^T\mathbb{W}^{(r)}\mathbf(z)^{(r)}$

where $\mathbf{z}^{(r)} = \mathbb{X}\beta\,^{(r)} + (\mathbb{W}^{(r)})^{-1}(\mathbf{y}-\mathbf{p}^{(r)})$

Recall that linear regression by least squares finds the following minimum: $\ \min_{\beta}(y-X \beta)^T(y-X \beta)$

Similarly, we can say that $\ \beta^{(r+1)}$ is the solution of a weighted least square problem in the new space of $\ \mathbf{z}$: ( compare the equation of $\ \beta^{(r+1)}$ with the solution of weighted least square $\ {\tilde{\beta}} = (X^TX)^{-1}X^Ty$ )

$\beta^{(r+1)} \leftarrow arg \min_{\beta}(\mathbf{z}-X \beta)^T W (\mathbf{z}-X \beta)$

#### Fisher Scoring Method

Fisher Scoring is a method very similiar to Newton-Raphson. It uses the expected Information Matrix as opposed to the observed information matrix. This distinction simplifies the problem and in perticular the computational complexity. To learn more about this method & logistic regression in general you can take Stat431/831 at the University of Waterloo.

### Multi-class Logistic Regression

In a multi-class logistic regression we have K classes. For 2 classes K and l

$\frac{P(Y=l|X=x)}{P(Y=K|X=x)} = e^{\beta_l^T x}$
(this is resulting from $f_1(x)= (\frac{e^{\mathbf{\beta\,}^T \mathbf{x}}}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})$ and $f_2(x)= (\frac{1}{1+e^{\mathbf{\beta\,}^T \mathbf{x}}})$ )

We call $log(\frac{P(Y=l|X=x)}{P(Y=k|X=x)}) = (\beta_l-\beta_k)^T x$ , the log ratio of the posterior probabilities as the logit transformation. The decision boundary between the 2 classes is the set of points where the logit transformation is 0.

For each class from 1 to K-1 we then have:

$log(\frac{P(Y=1|X=x)}{P(Y=K|X=x)}) = \beta_1^T x$

$log(\frac{P(Y=2|X=x)}{P(Y=K|X=x)}) = \beta_2^T x$

$log(\frac{P(Y=K-1|X=x)}{P(Y=K|X=x)}) = \beta_{K-1}^T x$

Note that choosing Y=K is arbitrary and any other choice is equally valid.

Based on the above the posterior probabilities are given by: $P(Y=k|X=x) = \frac{e^{\beta_k^T x}}{1 + \sum_{i=1}^{K-1}{e^{\beta_i^T x}}}\;\;for \; k=1,\ldots, K-1$

$P(Y=K|X=x)=\frac{1}{1+\sum_{i=1}^{K-1}{e^{\beta_i^T x}}}$

### Logistic Regression Vs. Linear Discriminant Analysis (LDA)

Logistic Regression Model and Linear Discriminant Analysis (LDA) are widely used for classification. Both models build linear boundaries to classify different groups. Also, the categorical outcome variables (i.e. the dependent variables) must be mutually exclusive.

LDA used more parameters.

However, these two models differ in their basic approach. While Logistic Regression is more relaxed and flexible in its assumptions, LDA assumes that its explanatory variables are normally distributed, linearly related and have equal covariance matrices for each class. Therefore, it can be expected that LDA is more appropriate if the normality assumptions and equal covariance assumption are fulfilled in its explanatory variables. But in all other situations Logistic Regression should be appropriate.

Also, the total number of parameters to compute is different for Logistic Regression and LDA. If the explanatory variables have d dimensions and there are two classes to categorize, we need to estimate $\ d+1$ parameters in Logistic Regression (all elements of the d by 1 $\ \beta$ vector plus the scalar $\ \beta_0$) and the number of parameters grows linearly w.r.t. dimension, while we need to estimate $2d+\frac{d*(d+1)}{2}+2$ parameters in LDA (two mean values for the Gaussians, the d by d symmetric covariance matrices, and two priors for the two classes) and the number of parameters grows quadratically w.r.t. dimension.

Note that the number of parameters also corresponds to the minimum number of observations needed to compute the coefficients of each function. Techniques do exist though for handling high dimensional problems where the number of parameters exceeds the number of observations. Logistic Regression can be modified using shrinkage methods to deal with the problem of having less observations than parameters. When maximizing the log likelihood, we can add a $-\frac{\lambda}{2}\sum^{K}_{k=1}\|\beta_k\|_{2}^{2}$ penalization term where K is the number of classes. This resulting optimization problem is convex and can be solved using Newton-Raphson method as given in Zhu and hastie (2004). LDA involves the inversion of a d x d covariance matrix. When d is bigger than n (where n is the number of observations) this matrix has rank n < d and thus is singular. When this is the case, we can either use the pseudo inverse or perform regularized discriminant analysis which solves this problem. In RDA, we define a new covariance matrix $\, \Sigma(\gamma) = \gamma\Sigma + (1 - \gamma)diag(\Sigma)$ with $\gamma \in [0,1]$. Cross validation can be used to calculate the best $\, \gamma$. More details on RDA can be found in Guo et al. (2006).

Because the Logistic Regression model has the form $log\frac{f_1(x)}{f_0(x)} = \beta{x}$, we can clearly see the role of each input variable in explaining the outcome. This is one advantage that Logistic Regression has over other classification methods and is why it is so popular in data analysis.

In terms of the performance speed, since LDA is non-iterative, unlike Logistic Regression which uses the iterative Newton-Raphson method, LDA can be expected to be faster than Logistic Regression.

### Example

(Not discussed in class.) One application of logistic regression that has recently been used is predicting the winner of NFL games. Previous predictors, like Yards Per Carry (YPC), were used to build probability models for games. Now, the Success Rate (SR), defined as the percentage of runs in which the a team’s point expectancy has improved, is shown to be a better predictor of a team's performance. SR is based on down, distance and yard line and is less susceptible to rare breakaway plays that can be considered outliers. More information can be found at [2].

## Perceptron

Simple perceptron
Simple perceptron where $\beta_0$ is defined as 1

Perceptron is a simple, yet effective, linear separator classifier. The perceptron is the building block for neural networks. It was invented by Rosenblatt in 1957 at Cornell Labs, and first mentioned in the paper "The Perceptron - a perceiving and recognizing automaton". The perceptron is used on linearly separable data sets. The LS computes a linear combination of factor of input and returns the sign.

For a 2 class problem, and a set of inputs with d features, a perceptron will use a weighted sum and it will classify the information using the sign of the result (i.e it uses a step function as it's activation function ). The figures on the right give an example of a perceptron. In these examples, $\ x^i$ is the i-th feature of a sample and $\ \beta_i$ is the i-th weight. $\beta_0$ is defined as the bias. The bias alters the position of the decision boundary between the 2 classes. From a geometrical point of view, Perceptron assigns label "1" to elements on one side of vector $\ \beta$ and label "-1" to to elements on the other of $\ \beta$, where $\ \beta$ is a vector of $\ \beta_i$s.

Perceptrons are generally trained using gradient descent. This type of learning can have 2 side effects:

• If the data sets are well separated, the training of the perceptron can lead to multiple valid solutions.
• If the data sets are not linearly separable, the learning algorithm will never finish.

Perceptrons are the simplest kind of a feedforward neural network. A perceptron is the building block for other neural networks such as Multi-Layer Perceptron (MLP) which uses multiple layers of perceptrons with nonlinear activation functions so that it can classify data that is not linearly separable.

### History of Perceptrons and Other Neural Models

One of the first perceptron-like models is the "McCulloch-Pitts Neuron" model developed by McCulloch and Pitts in the 1940's <ref> W. Pitts and W. S. McCulloch, "How we know universals: the perception of auditory and visual forms," Bulletin of Mathematical Biophysics, 1947.</ref>. It uses a weighted sum of the inputs that is fed through an activation function, much like the perceptron. However, unlike the perceptron, the weights in the "McCulloch-Pitts Neuron" model are not adjustable, so the "McCulloch-Pitts Neuron" is unable to perform any learning based on the input data.

As stated in the introduction of the perceptron section, the Perceptron was developed by Rosenblatt around 1960. Around the same time as the perceptron was introduced, the Adaptive Linear Neuron (ADALINE) was developed by Widrow <ref name="Widrow"> B. Widrow, "Generalization and information storage in networks of adaline 'neurons'," Self Organizing Systems, 1959.</ref>. The ADALINE differs from the standard perceptron by using the weighted sum (the net) to adjust the weights in the learning phase. The standard perceptron uses the output to adjust its weights (i.e. the net after it passed through the activation function).

Since both the perceptron and ADALINE are only able to handle data that is linearly separable Multiple ADALINE (MADALINE) was introduced <ref name="Widrow"/>. MADALINE is a two layer network to process multiple inputs. Each layer contains a number of ADALINE units. The lack of an appropriate learning algorithm prevented more layers of units to be cascaded at the time and interest in "neural networks" receded until the 1980's when the backpropagation algorithm was applied to neural networks and it became possible to implement the Multi-Layer Perceptron (MLP).

Many importand advances have been boosted by the use of inexpensive computer emulations. Following an initial period of enthusiasm, the field survived a period of frustration and disrepute. During this period when funding and professional support was minimal, important advances were made by relatively few reserchers. These pioneers were able to develop convincing technology which surpassed the limitations identified by Minsky and Papert. Minsky and Papert, published a book (in 1969) in which they summed up a general feeling of frustration (against neural networks) among researchers, and was thus accepted by most without further analysis. Currently, the neural network field enjoys a resurgence of interest and a corresponding increase in funding.<ref> http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html#Historical background </ref>

## Perceptron Learning Algorithm (Lecture: Oct. 13, 2011)

Perceptron learning is done by minimizing the cost function, $\phi(\boldsymbol{\beta}, \beta_0)$, which is same as minimizing the sum of the distances between the misclassified points and the decision boundary. To achieve this, we define a cost function, $\phi(\boldsymbol{\beta}, \beta_0)$, as a summation of the distance between all misclassified points and the hyper-plane, or the decision boundary. To minimize this cost function, we need to estimate $\boldsymbol{\beta, \beta_0}$.

$\min \phi(\boldsymbol{\beta}, \beta_0)$ = {distance of all misclassified points}

The logic is as follows:

Distance between the point $\ x$ and the decision boundary hyperplane $\ L$ (black line). Note that the vector $\ \beta$ is orthogonal to the decision boundary hyperplane and that points $\ x_0, x_1, x_2$ are arbitrary points on the decision boundary hyperplane.

1) Because a hyper-plane $\,L$ can be defined as

$\, L=\{x: f(x)=\beta^Tx+\beta_0=0\},$

For any two arbitrary points $\,x_1$ and $\,x_2$ on $\, L$, we have

$\,\beta^Tx_1+\beta_0=0$,

$\,\beta^Tx_2+\beta_0=0$,

such that

$\,\beta^T(x_1-x_2)=0$.

Therefore, $\,\beta$ is orthogonal to the hyper-plane and it is the normal vector.

2) For any point $\,x_0$ in $\ L,$ $\,\;\;\beta^Tx_0+\beta_0=0$, which means $\, \beta^Tx_0=-\beta_0$.

3) We set $\,\beta^*=\frac{\beta}{||\beta||}$ as the unit normal vector of the hyper-plane$\, L$. For simplicity we call $\,\beta^*$ norm vector. The distance of point $\,x$ to $\ L$ is given by

$\,\beta^{*T}(x-x_0)=\beta^{*T}x-\beta^{*T}x_0 =\frac{\beta^Tx}{||\beta||}+\frac{\beta_0}{||\beta||} =\frac{(\beta^Tx+\beta_0)}{||\beta||}$

Where $\,x_0$ is any point on $\ L$. Hence, $\,\beta^Tx+\beta_0$ is proportional to the distance of the point $\,x$ to the hyper-plane$\, L$.

4) The distance from a misclassified data point $\,x_i$ to the hyper-plane $\, L$ is

$\,d_i = -y_i(\boldsymbol{\beta}^Tx_i+\beta_0)$

where $\,y_i$ is a target value, such that $\,y_i=1$ if $\boldsymbol{\beta}^Tx_i+\beta_0\lt 0$, $\,y_i=-1$ if $\boldsymbol{\beta}^Tx_i+\beta_0\gt 0$

Since we need to find the distance from the hyperplane to the misclassified data points, we need to add a negative sign in front. When the data point is misclassified, $\boldsymbol{\beta}^Tx_i+\beta_0$ will produce an opposite sign of $\,y_i$. Since we need a positive sign for distance, we add a negative sign.

### Perceptron Learning using Gradient Descent

The gradient descent is an optimization method that finds the minimum of an objective function by incrementally updating its parameters in the negative direction of the derivative of this function. In our case, the objective function to be minimized is classification error and the parameters of this function are the weights associated with the inputs, $\beta$ . The gradient descent algorithm updates the weights as follows:

$\beta^{\mathrm{new}} \leftarrow \beta^{\mathrm{old}} \rho \frac{\partial Err}{\partial \beta}$

$\rho$ is called the learning rate.
The Learning Rate $\rho$ is positively related to the step size of convergence of $\min \phi(\boldsymbol{\beta}, \beta_0)$. i.e. the larger $\rho$ is, the larger the step size is.

The classification error is defined as the distance of misclassified observations to the decision boundary:

To minimize the cost function $\phi(\boldsymbol{\beta}, \beta_0) = -\sum\limits_{i\in M} y_i(\boldsymbol{\beta}^Tx_i+\beta_0)$ where $\ M=\{\text {all points that are misclassified}\}$
$\cfrac{\partial \phi}{\partial \boldsymbol{\beta}} = - \sum\limits_{i\in M} y_i x_i$ and $\cfrac{\partial \phi}{\partial \beta_0} = -\sum\limits_{i \in M} y_i$

Therefore, the gradient is $\nabla D(\beta,\beta_0) = \left( \begin{array}{c} -\displaystyle\sum_{i \in M}y_{i}x_i \\ -\displaystyle\sum_{i \in M}y_{i} \end{array} \right)$

Using the gradient descent algorithm to solve these two equations, we have $\begin{pmatrix} \boldsymbol{\beta}^{\mathrm{new}}\\ \beta_0^{\mathrm{new}} \end{pmatrix} = \begin{pmatrix} \boldsymbol{\beta}^{\mathrm{old}}\\ \beta_0^{\mathrm{old}} \end{pmatrix} + \rho \begin{pmatrix} y_i x_i\\ y_i \end{pmatrix}$

If the data is linearly-separable, the solution is theoretically guaranteed to converge to a separating hyperplane in a finite number of iterations. In this situation the number of iterations depends on the learning rate and the margin. However, if the data is not linearly separable there is no guarantee that the algorithm converges.

$\begin{pmatrix} \beta^0\\ \beta_0^0 \end{pmatrix}$

Note that we consider the offset term $\,\beta_0$ separately from $\ \beta$ to distinguish this formulation from those in which the direction of the hyperplane ($\ \beta$) has been considered.

A major concern about gradient descent is that it may get trapped in local optimal solutions. Many works such as this paper by Cetin et al. and this paper by Atakulreka et al. have been done to tackle this issue.

Features

• A Perceptron can only discriminate between two classes at a time.
• When data is (linearly) separable, there are many solutions depending on the starting point.
• Even though convergence to a solution is guaranteed if the solution exists, the finite number of steps until convergence can be very large.
• The smaller the gap between the two classes, the longer the time of convergence.
• When the data is not separable, the algorithm will not converge.
• A learning rate that is too high will make the perceptron periodically oscillate around the solution unless additional steps are taken.
• The L.S compute a linear combination of feature of input and return the sign.
• This were called Perceptron in the engineering literate in late 1950.
• Learning rate affects the accuracy of the solution and the number of iterations directly.

Separability and convergence

The training set D is said to be linearly separable if there exists a positive constant $\,\gamma$ and a weight vector $\,\beta$ such that $\,(\beta^Tx_i+\beta_0)y_i\gt \gamma$ for all $\,1 \lt i \lt n$. That is, if we say that $\,\beta$ is the weight vector of Perceptron and $\,y_i$ is the true label of $\,x_i$, then the signed distance of the $\,x_i$ from $\,\beta$ is greater than a positive constant $\,\gamma$ for any $\,(x_i, y_i)\in D$.

Novikoff (1962) proved that the perceptron algorithm converges after a finite number of iterations if the data set is linearly separable. The idea of the proof is that the weight vector is always adjusted by a bounded amount in a direction that it has a negative dot product with, and thus can be bounded above by $O(\sqrt{t})$where t is the number of changes to the weight vector. But it can also be bounded below by$\, O(t)$because if there exists an (unknown) satisfactory weight vector, then every change makes progress in this (unknown) direction by a positive amount that depends only on the input vector. This can be used to show that the number t of updates to the weight vector is bounded by $(\frac{2R}{\gamma} )^2$ , where R is the maximum norm of an input vector.<ref>http://en.wikipedia.org/wiki/Perceptron</ref>

### Choosing a Proper Learning Rate

choosing different learning rates affect the performance of gradient descent optimization algorithm.

Choice of a learning rate value will affect the final result of gradient descent algorithm. If the learning rate is too small then the algorithm would take too long to converge which could cause problems for the situations where time is an important factor. If the learning rate is chosen too be too large, then the optimal point can be skipped and never converge. In fact, if the step size is too large, larger than twice the largest eigenvalue of the second derivative matrix (Hessian) of cost function, then gradient steps will go upward instead of downward. However, the step size is not the only factor than can cause these kind of situations: even with the same learning rate and different initial values algorithm might end up in different situations. In general it can be said that having some prior knowledge could help in choice of initial values and learning rate.

There are different methods of choosing the step size in an gradient descent optimization problem. The most common method is choosing a fixed learning rate and finding a proper value for it by trial and error. This for sure is not the most sophisticated method, but the easiest one. Learning rate can also be adaptive; that means the value of learning rate can be different at each step of the algorithm. This can be specially a helpful approach when one is dealing with on-line training and non-stationary environments (i.e. when data characteristics vary over time). In such a case learning rate has to be adapted at each step of the learning algorithm. Different approaches and algorithms for learning rate adaptation can be found in <ref> V P Plagianakos, G D Magoulas, and M N Vrahatis, Advances in convex analysis and global optimization Pythagorion 2000 (2001), Volume: 54, Publisher: Kluwer Acad. Publ., Pages: 433-444. </ref>.

The learning rate leading to a local error minimum in the error function in one learning step is optimal. <ref>[Duda, Richard O., Hart, Peter E., Stork, David G. "Pattern Classification". Second Edition. John Wiley & Sons, 2001.]</ref>

### Application of Perceptron: Branch Predictor

Perceptron could be used for both online and batch learning. Online learning tasks take place in a sequence of trials. In each round of trial, the learner is given an instance and is asked to use his current knowledge to predict a label for the point. In online learning, the true label of the point is revealed to learner at each round after he makes a prediction. At the last stage of each round the learner has a chance to use the feedback he received on the true label of the instance to help improve his belief about the data for future trials.

Instruction pipelining is a technique to increase the throughput in modern microprocessor architecture. A microprocessor instruction can be broken into several independent steps. In a single CPU clock cycle, several instructions at different stage can be executed at the same time. However, a problem arises with a branch, e.g. if-and-else- statement. It is not known whether the instructions inside the if- or else- statements will be executed until the condition is executed. This stalls the pipeline.

A branch predictor is used to address this problem. Using a predictor the pipelined processor predicts the execution path and speculatively executes instructions in the branch. Neural networks are good technique for prediction; however, they are expensive for microprocessor architecture. A research studied the use of perceptron, which is less expensive and simpler to implement, as the branch predictor. The inputs are the history of binary outcomes of the executed branches. The output of the predictor is whether a particular branch will be taken. Every time a branch is executed and its true outcome is known, it can be used to train the predictor. The experiments showed that with a 4 Kb hardware, a global perceptron predictor has a misprediction rate of 1.94%, a superior accuracy. <ref>Daniel A. Jimenez , Calvin Lin, "Neural Methods for Dynamic Branch Prediction", ACM Transactions on Computer Systems, 2002</ref>

## Feed-Forward Neural Networks

• The term 'neural networks' is used because historically, it was used to describe the processes of the brain (e.g. synapses).
• A neural network is a multistate regression model which is typically represented by a network diagram (see right).
Feed Forward Neural Network
• The feedforward neural network was the first and arguably simplest type of artificial neural network devised. In this network, the information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network.<ref>http://en.wikipedia.org/wiki/Feedforward_neural_network</ref>
• For regression, typically k = 1 (the number of nodes in the last layer), there is only one output unit $y_1$ at the end.
• For c-class classification, there are typically c units at the end with the cth unit modelling the probability of class c, each $y_c$ is coded as 0-1 variable for the cth class.
• Neural networks are known as universal approximators, where a two-layer feed-forward neural network can approximate any continuous function to an arbitrary accuracy (assuming sufficient hidden nodes exist and that the necessary parameters for the neural network can be found) <ref name="CMBishop">C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006</ref>. It should be noted that fitting training data to a very high accuracy may lead to overfitting, which is discussed later in this course.
• We often use Perceptron to blocks in Feed-Forward neural networks. We can easily to solve the problem by using Perceptron in many different classes. Feed-Forward neural networks looks like a complicated system of Perceptrons. We can regard the neural networks as an unit or a subset of Neural Network. Feed-Forward neural networks include many hidden layers of perceptron.

### Backpropagation (Finding Optimal Weights)

There are many algorithms for calculating the weights in a feed-forward neural network. One of the most used approaches is the backpropagation algorithm. The application of the backpropagation algorithm for neural networks was popularized in the 1980's by researchers like Rumelhart, Hinton and McClelland (even though the backpropagation algorithm had existed before then). <ref>S. Seung, "Multilayer perceptrons and backpropagation learning" class notes for 9.641J, Department of Brain & Cognitive Sciences, MIT, 2002. Available: [3] </ref>

As the learning part of the network (the first part being feed-forward), backpropagation consists of "presenting an input pattern and changing the network parameters to bring the actual outputs closer to the desired teaching or target values." It is one of the "simplest, most general methods for the supervised training of multilayer neural networks." (pp. 288-289) <ref>[Duda, Richard O., Hart, Peter E., Stork, David G. "Pattern Classification". Second Edition. John Wiley & Sons, 2001.]</ref>

For the backpropagation algorithm, we consider three hidden layers of nodes

Refer to figure from October 18th lecture where $\ l$ represents the column of nodes in the first column,
$\ i$ represents the column of nodes in the second column, and
$\ k$ represents the column of nodes in the third column.

We want the output of the feed forward neural network $\hat{y}$ to be as close to the known target value $\ y$ as possible (i.e. we want to minimize the distance between $\ y$ and $\hat{y}$). Mathematically, we would write it as: Minimize $(\left| y- \hat{y}\right|)^2$

Instead of the sign function that has no derivative we use the so called logistic function (a smoothed form of the sign function):

$\sigma(a)=\frac{1}{1+e^{-a}}$

"Notice that if σ is the identity function, then the entire model collapses to a linear model in the inputs. Hence a neural network can be thought of as a nonlinear generalization of the linear model, both for regression and classification." <ref>Friedman, J., Hastie, T. and Tibshirani, R. (2008) “The Elements of Statistical Learning”, 2nd ed, Springer.</ref>

Logistic function is a common sigmoid curve .It can model the S-curve of growth of some population $\sigma$. The initial stage of growth is approximately exponential; then, as saturation begins, the growth slows, and at maturity, growth stops.

To solve the optimization problem, we take the derivative with respect to weight $u_{il}$:
$\cfrac{\partial \left|y- \hat{y}\right|^2}{\partial u_{il}} = \cfrac{\partial \left|y- \hat{y}\right|^2}{\partial a_j} \cdot \cfrac{\partial a_j}{\partial u_{il}}$ by Chain rule
$\cfrac{\partial \left|y- \hat{y}\right|^2}{\partial u_{il}} = \delta_j \cdot z_l$

where $\delta_j = \cfrac{\partial \left|y- \hat{y}\right|^2}{\partial a_j}$ which will be computed recursively.

$\ a_i=\sum_{l}z_lu_{il}$

$\ z_i=\delta(a_i)$

$\ a_j=\sum_{i}z_iu_{ji}$

## Backpropagation Continued (Lecture: Oct. 18, 2011)

Nodes from three hidden layers within the neural network are considered for the backpropagation algorithm. Each node has been divided into the weighted sum of the inputs $\ a$ and the output of the activation function $\ z$. The weights between the nodes are denoted by $\ u$.

From the figure to the right it can be seen that the input ($\ a$'s) can be expressed in terms of the weighted sum of the outputs of the previous nodes and output ($\ z$'s) can be expressed as the input as follows:

$\ a_i = \sum_l z_l u_{il}$

$\ z_i = \sigma(a_i)$

The goal is to optimize the weights to reduce the L2-norm between the target output values $\ y$ (i.e. the correct labels) and the actual output of the neural network $\ \hat{y}$:

$\left|y - \hat{y}\right|^2$

Since the L2-norm is differentiable, the optimization problem can be tackled by differentiating $\left|y - \hat{y}\right|^2$ with respect to each weight in the hidden layers. By using the chain rule we get:

$\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{il}} = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_i}\cdot \cfrac{\partial a_i}{\partial u_{il}} = \delta_{i}z_l$

where $\ \delta_i = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_i}$

The above equation essentially shows the effect of changes in the input $\ a_i$ on the overall output $\ \hat{y}$ as well as the effect of changes in the weights $\ u_{il}$ on the input $\ a_i$. In the above equation, $\ z_l$ is a known value (i.e. it can be calculated directly), whereas $\ \delta_i$ is unknown but can be expressed as a recursive definition in terms of $\ \delta_j$:

$\delta_i = \cfrac{\partial |y - \hat{y}|^2}{\partial a_i} = \sum_{j} \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_j}\cdot \cfrac{\partial a_j}{\partial a_i}$

$\delta_i = \sum_{j}\delta_j\cdot\cfrac{\partial a_j}{\partial z_i}\cdot\cfrac{\partial z_i}{\partial a_i}$

$\delta_i = \sum_{j} \delta_j\cdot u_{ji} \cdot \sigma'(a_i)$

where $\delta_j = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_j}$

The above equation essentially shows the effect of changes in the input $\ a_j$ on the overall output $\ \hat{y}$ as well as the effect of changes in input $\ a_i$ on the input $\ a_j$.

The recursive definition of $\ \delta_i$ can be considered as a cost function for achieving the original goal of optimizing the weights to minimize $\left|y - \hat{y}\right|^2$:

$\delta_i= \sigma'(a_i)\sum_{j}\delta_j \cdot u_{ji}$.

Now considering $\ \delta_k$ for the output layer:

$\delta_k= \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial a_k}$.

where $\,a_k = \hat{y}$ because an activation function is not applied in the output layer. So, our calculation becomes:

$\delta_k = \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial \hat{y}}$

$\delta_k = -2|y - \hat{y}|$
$u_{il} \leftarrow u_{il} - \rho \cfrac{\partial |y - \hat{y}| ^2}{\partial u_{il}}$

Since $\ y$ is known and $\ \hat{y}$ can be computed for each data point (assuming small, random, initial values for the weights of the neural network), $\ \delta_k$ can be calculated and "backpropagated" (i.e. the $\ \delta$ values for the layer before the output layer can be computed using $\ \delta_k$ and then the $\ \delta$ values for the layer before the layer before the output layer can be computed etc.). Once all $\ \delta$ values are known, the errors due to each of the weights $\ u$ will be known and techniques like gradient descent can be used to optimize the weights. However, as the cost function for $\ \delta_i$ shown above is not guaranteed to be convex, convergence to a global minimum is no guaranteed. This also means that changing the order in which the training points are fed into the network or changing the initial random values for the weights may lead to finding different results for the optimized weights (i.e. different local minima may be reached).

### Overview of Full Backpropagation Algorithm

The network weights are updated using the backpropagation algorithm when each training data point $\ x$is fed into the feed forward neural network (FFNN). This update procedure is done using the following steps:

• First arbitrarily choose some random weights (preferably close to zero) for your network.
• Apply $\ x$ to the FFNN's input layer, and calculate the outputs of all input neurons.
• Propagate the outputs of each hidden layer forward, one hidden layer at a time, and calculate the outputs of all hidden neurons.
• Once $\ x$ reaches the output layer, calculate the output(s) of all output neuron(s) given the outputs of the previous hidden layer.
• At the output layer, compute $\,\delta_k = -2(y_k - \hat{y}_k)$ for each output neuron(s).
• Then compute $\cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{il}} = \delta_{i}z_l$ for all weights $\,u_{il}$, where $\delta_i= \sigma'(a_i)\sum_{j}\delta_j \cdot u_{ji}$.
• Then update $u_{il}^{\mathrm{new}} \leftarrow u_{il}^{\mathrm{old}} - \rho \cdot \cfrac{\partial \left|y - \hat{y}\right|^2}{\partial u_{il}}$ for all weights $\,u_{il}$.
• Continue for next data points until weights converge.

#### Epochs

It is common to cycle through the all of the data points multiple times in order to reach convergence. An epoch represents one cycle in which you feed all of your datapoints through the neural network. It is good practice to randomized the order you feed the points to the neural network within each epoch; this can prevent your weights changing in cycles. The number of epochs required for convergence depends greatly on the learning rate & convergence requirements used.

### Limitations

• The convergence obtained from backpropagation learning is very slow.
• The convergence in backpropagation learning is not guaranteed.
• The result may generally converge to any local minimum on the error surface, since stochastic gradient descent exists on a surface which is not flat.
• Numerical problems may be encountered when there are a large number of hidden layers, as the errors at each layer may become very small and vanish.

### Deep Neural Network

Increasing the number of units within a hidden layer can increase the "flexibility" of the neural network, i.e. the network is able to fit to more complex functions. Increasing the number of hidden layers on the other hand can increase the "generalizability" of the neural network, i.e. the network is able to generalize well to new data points that it was not trained on. A deep neural network is a neural network with many hidden layers. Deep neural networks were introduced in recent years by the same researchers (Hinton et al. <ref name="HintonDeepNN"> G. E. Hinton, S. Osindero and Y. W. Teh, "A Fast Learning Algorithm for Deep Belief Nets", Neural Computation, 2006. </ref>) that introduced the backpropagation algorithm to neural networks. The increased number of hidden layers in deep neural networks cannot be directly trained using backpropagation, because the errors at each layer will become very small and vanish as stated in the limitations section. To get around this problem, deep neural networks are trained a few layers at a time (i.e. two layers at a time). This process is still not straightforward as the target values for the hidden layers are not well defined (i.e. it is unknown what the correct target values are for the hidden layers given a data point and a label). Restricted Boltzmann Machines (RBM) and Greedy Learning Algorithms have been used to address this issue. For more information about how deep neural networks are trained, please refer to <ref name="HintonDeepNN"/>. A comparison of various neural network layouts including deep neural networks on a database of handwritten digits can be found at THE MNIST DATABASE.

An interesting structure of the deep neural network is where the number of nodes in each hidden layer decreases towards the "center" of the network and then increases again. See figure below for an illustration.

A specific architecture for deep neural networks with a "bottleneck".

The central part with the least number of nodes in the hidden layer can be seen a reduced dimensional representation of the input data features. It would be interesting to compare the dimensionality reduction effect of this kind of deep neural network to a cascade of PCA.

It is known that training DNNs is hard <ref>http://ecs.victoria.ac.nz/twiki/pub/Courses/COMP421_2010T1/Readings/TrainingDeepNNs.pdf</ref> since randomly initializing weights for the network and applying gradient descent can find poor local minimums. In order to better train DNNs, Exploring Strategies for Training Deep Neural Networks looks at 3 principles to better train DNNs:

1. Pre-training one layer at a time in a greedy way,
2. Using unsupervised learning at each layer,
3. Fine-tuning the whole network with respect to the ultimate criterion.

Their experiments show that by providing hints at each layer for the representation, the weights can be initialized such that a more optimal minimum can be reached.

### Applications of Neural Networks

• Sales forecasting
• Industrial process control
• Customer research
• Data validation
• Risk management
• Target marketing

<ref> Reference:http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html#Applications of neural networks </ref>

## Model Selection (Complexity Control)

Selecting a proper statistical model for a given data set is a well-known problem in pattern recognition and machine learning. Systems with the optimal complexity have a good generalization to yet unobserved data. In the complexity control problem, we are looking for an appropriate model order which gives us the best generalization capability for the unseen data points, while fitting the seen data well. Model complexity here can be defined in terms of over-fitting and under-fitting situations defined in the following section.

## Over-fitting and Under-fitting

File:overfitting-model.png
Example of overfitting and underfitting situations. The blue line is a high-degree polynomial which goes through most of the training data points and gives a very low training error, however has a very poor generalization for the unseen data points. The red line, on the other hand, is underfitted to the training data samples.

There are two situations which should be avoided in classification and pattern recognition systems:

1. Overfitting
2. Underfitting

In short, Overfitting occurs when the model tries to capture every detail of the data. This can happen if the model has too many parameters compared to the number of observations. Overfitted models have large testing errors but small training errors. On the other hand, Underfitting occurs when the model does not capture the complexity of the data. This happens when the model has a large training error.

Suppose there is no noise in the training data, then we would face no problem with over-fitting, because in this case every training data point lies on the underlying function, and the only goal is to build a model that is as complex as needed to pass through every training data point.

However, in the real-world, the training data are noisy, i.e. they tend to not lie exactly on the underlying function, instead they may be shifted to unpredictable locations by random noise. If the model is more complex than what it needs to be in order to accurately fit the underlying function, then it would end up fitting most or all of the training data. Consequently, it would be a poor approximation of the underlying function and have poor prediction ability on new, unseen data.

The danger of overfitting is that the model becomes susceptible to predicting values outside of the range of training data. It can cause wild predictions in multilayer perceptrons, even with noise-free data. To avoid Overfitting, techniques such as Cross Validation and Model Comparison might be necessary. The size of the training set is also important. The training set should have a sufficient number of data points which are sampled appropriately, so that it is representative of the whole data space.

In a Neural Network, if the number of hidden layers or nodes is too high, the network will have many degrees of freedom and will learn every characteristic of the training data set. That means it will fit the training set very precisely, but will not be able to generalize the commonality of the training set to predict the outcome of new cases.

Underfitting occurs when the model we picked to describe the data is not complex enough, and has a high error rate on the training set. There is always a trade-off. If our model is too simple, underfitting could occur and if it is too complex, overfitting can occur.

### Different Approaches for Complexity Control

We would like to have a classifier that minimizes the true error rate $\ L(h)$:

$\ L(h)=Pr\{h(x)\neq y\}$

Model complexity

Because the true error rate cannot be determined directly in practice, we can try using the empirical true error rate (i.e. training error rate):

$\ \hat L(h)= \frac{1}{n} \sum_{i=1}^{n} I(h(x_{i}) \neq y_{i})$

However, the empirical true error rate (i.e. training error rate) is biased downward. Minimizing this error rate does not find the best classifier model, but rather ends up overfitting to the training data. Thus, this error rate cannot be used.

The complexity of a fitting model depends on the degree of the fitting function. According to the graph, the area on the LHS of the critical point is considered as under-fitting. This inaccuracy is resulted by the low complexity of fitting. The area on the RHS of the critical point is over-fitting, because it's not generalized.

As illustrated in the figure to the right, the training error rate is always less than the true error rate, i.e. "biased downward". Also, the training error will always decrease with an increase in the complexity of the model used to fit the data. This does not reflect the behavior of the true error rate. The true error rate will have a unique minimum as the model complexity changes.

So, if the training error rate is the only criteria used for picking a model, overfitting can occur. An overfitted model has low training error rate, but is not able to generalize well to new test data points. On the other hand, underfitting can occur when a model that is not complex enough is picked (e.g. using a first order model for data that follows a second order trend). Both training and test error rates will be high in that case. The best choice for the model complexity is where the true error rate reaches its minimum point. Thus, model selection involves controlling the complexity of the model. The true error rate can be approximated using the test error rate, i.e. the test error follows the same trend that the true error rate does when the model complexity is changed. In this case, we assume there is a test data set $\,x_1, . . . ,x_n$ and these points follow some unknown distribution. In order to find out this distribution, we can make some estimationg of some unknown parameters, such as $\,f$, the mean $\,E(x_i)$, the variance $\,var(x_i)$ and more.

To estimate $\,f$, we use an observation function as our estimator.

$\hat{f}(x_1,...,x_n)$.

$Bias (\hat{f}) = E(\hat{f}) - f$

$MSE (\hat{f}) = E[(\hat{f} - f)^2]=Varince (\hat f)+Bias^2(\hat f )$

$Variance (\hat{f}) = E[(\hat{f} - E(\hat{f}))^2]$

This estimaor is unbiased.

$Bias (\hat{f}) = E(\hat{f}) - f=0$

which means that we just need to minimize $MSE (\hat{f})$.

$\implies MSE (\hat{f})=Variance (\hat{f})+Bias ^2(\hat{f})$.

Thus given the Mean Squared Error (MSE), if we have a low bias, then we will have a high variance and vice versa.

In order to avoid overfitting, there are two main strategies:

1. Estimate the error rate
1. Cross-validation
2. Computing error bound ( probability in-equality )
2. Regulazition
1. We basically make the function (model) smooth by limiting the complexity or by limiting the size of weights.

### Cross Validation

File:k-fold.png
Graphical illustration of 4-fold cross-validation. V is the part used for validation and T is used for training.

Cross-validation is an approach for avoiding overfitting while modelling data that bases the choice of model parameters on a portion of the training set, while using the rest of the set for validation, i.e., some of the data is left out when fitting the model. One round of the process involves partitioning the data set into two complementary subsets, fitting the model to one subset (called the training set), and testing the model against the other subset (called the validation or testing subset). This is usually repeated several times using different partitions in order to reduce variability, and the validation results are then averaged over the rounds.

#### LOO: Leave-one-out cross-validation

When the dataset is very small, leaving one tenth out depletes our data too much, but making the validation set too small makes the estimate of the true error unstable (noisy). One solution is to do a kind of round-robin validation: for each complexity setting, learn a classifier on all the training data minus one example and evaluate its error the remaining example. Leave-one-out error is defined as:

LOO error: $\frac {1}{n} \sum_{i} 1 (h(x_i; D_-i)\neq y_i)$ where $D_-i$ is the dataset minus ith example and $h(x_i; D_-i)$ is the classifier learned on $D_-i$. LOO error is an unbiased estimate of the error of our learning algorithm (for a given complexity setting) when given $n-1$ examples.

#### K-Fold Cross Validation

Instead of minimizing the training error, here we minimize the validation error.

A common type of cross-validation that is used for relatively small data sets is K-fold cross-validation, the algorithm for which can be stated as follows:

Let h denote a classification model to be fitted to a given data set.

1. Randomly partition the original data set into K subsets of approximately the same size. A common choice for K is K = 10.
2. For k = 1 to K do the following
1. Remove subset k from the data set
2. Estimate the parameters of each different classification model based only on the remaining data points. Denote the resulting function by h(k)
3. Use h(k) to predict the data points in subset k. Denote by \begin{align}\hat L_k(h)\end{align} the observed error rate.
3. Compute the average error $\hat L(h) = \frac{1}{K} \sum_{k=1}^{K} \hat L_k(h)$

The best classifier is the model that results in the lowest average error rate.

A common variation of k-fold cross-validation uses a single observation from the original sample as the validation data, and the remaining observations as the training data. This is then repeated such that each sample is used once for validation. It is the same as a K-fold cross-validation with K being equal to the number of points in the data set, and is referred to as leave-one-out cross-validation. <ref> stat.psu.edu/~jiali/course/stat597e/notes2/percept.pdf</ref>

#### Alternatives to Cross Validation for model selection:

1. Akaike Information Criterion (AIC): This approach ranks models by their AIC values. The model with the minimum AIC is chosen. The formula of AIC value is: $AIC = 2k + 2log(L_{max})$, where $k$ is the number of parameters and $L_{max}$ is the maximum value of the likelihood function of the model. This selection method penalizes the number of parameters.<ref>http://en.wikipedia.org/wiki/Akaike_information_criterion</ref>
2. Bayesian Information Criterion (BIC): It is similar to AIC but penalizes the number of parameters even more. The formula of BIC value is: $BIC = klog(n) - 2log(L)$, where $n$ is the sample size.<ref>http://en.wikipedia.org/wiki/Bayesian_information_criterion</ref>

## Model Selection Continued (Lecture: Oct. 20, 2011)

### Error Bound Computation

Apart from cross validation, another approach for estimating the error rates of different models is to find a bound to the error. This works well theoretically to compare different models, however, in practice the error bounds are not a good indication of which model to pick because the error bounds are not tight. This means that the actual error observed in practice may be a lot better than what was indicated by the error bounds. This is because the error bounds indicate the worst case errors and by only comparing the error bounds of different models, the worst case performance of each model is compared, but not the overall performance under normal conditions.

### Penalty Function

Another approach for model selection to avoid overfitting is to use regularization. Regularization involves adding extra information or restrictions to the problem in order to prevent overfitting. This additional information can be in the form of a function penalizing high complexity (penalty function). So in regularization, instead of minimizing the squared error alone we attempt to minimize the squared error plus a penalty function. A common penalty function is the euclidean norm of the parameter vector multiplied by some scaling parameter. The scaling parameter allows for balancing the relative importance of the two terms.
This means minimizing the following new objective function:
$\left|y-\hat{y}\right|^2+f(\theta)$
where $\ \theta$ is model complexity and $\ f(\theta)$ is the penalty function. The penalty function should increase as the model increases in complexity. This way it counteracts the downward bias of the training error rate. There is no optimal choice for a penalty function but they should all increase all the complexity and size of the estimates increase.

There is no optimal choice for the penalty function but they all seek to solve the same problem. Suppose you have models of order 1,2,...,K such that the models of class k-1 are a subset of the models in class k. An example of this is linear regression where a model of order k is the model with the first k explanatory covariates. If you do not include a penalty term and minimize the squared error alone you will always choose the largest most complex model (K). But the problem with this is the gain from including more complexity might be incredibly small. The gain in accuracy may in fact be no better than you would expect from including a covariate drawn from a N(0,1) distribution. If this is the case then clearly we don't want to include such a covariate. And in general if the increase in accuracy is below a certain level then it is preferable to stay with the simpler model. By adding a penalty term, no matter how small it is, you know at least at some point these insignificant gains in accuracy will be outweighed by increase in penalty. By effectively choosing and scaling your penalty function you can have your objective function approximate the true error as opposed to the training error.

#### Example: Penalty Function in Neural Network Model Selection

In MLP neural networks, the activation function is of the form of a logistic function, where the function behaves almost linearly when the input is close to zero (i.e., the weights of the neural network are close to zero), while the function behaves non-linearly as the magnitude of the input increases (i.e., the weights of the neural network become larger). In order to penalize additional model complexity (i.e., unnecessary non-linearities in the model), large weights will be penalized by the penalty function.

The objective function to minimize with respect to the weights $\ u_{ji}$ is:

$\ Reg=\left|y-\hat{y}\right|^2 + \lambda*\sum_{i=1}^{n}(u_{ji})^2$ If the weight start to grow, then $\sum_{i=1}^{n}(u_{ji})^2$ becomes larger and $\left|y-\hat{y}\right|^2$ becomes smaller.

The derivative of the objective function with respect to the weights $\ u_{ji}$ is:
$\cfrac{\partial Reg}{\partial u_{ji}} = \cfrac{\partial \left|y-\hat{y}\right|^2}{\partial u_{ji}}+2*\lambda*u_{ji}$

This objective function is used during gradient descent. In practice, cross validation is used to determine the value of $\ \lambda$ in the objective function.

We can do CV to choose $\lambda$. In case of any model, the least "complex" is the linear model. gradually let the complexity to grow then complexity begins to rise.

We want non-linear model but not too curvy.

#### Penalty Functions in Practice

In practice, we only apply the penalty function to the parametrized terms. That is, the bias term is not regularized, since it is simply the DC component and is not associated with a feature. Although this makes little difference, the concept is clear that the bias term should not be considered when determining the relative weights of the features.

In particular, we update the weights as follows:

$u_{ji} := \begin{cases} u_{ji} + \alpha * \cfrac{\partial \left|y-\hat{y}\right|^2}{\partial u_{ji}} &bias term\\ u_{ji} + \alpha * \cfrac{\partial \left|y-\hat{y}\right|^2}{\partial u_{ji}}+2*\lambda*u_{ji} &otherwise \end{cases}$

## Radial Basis Function Neural Network (RBF NN)

Radial Basis Function Network(RBF) NN is a type of neural network with only one hidden layer in addition to an input and output layer. Each node within the hidden layer uses a radial basis activation function, hence the name of the RBF NN. A radial basis function is a real-valued function whose value depends only on the distance from center. One of the most commonly used radial basis functions is Gaussian. The weights from the input layer to the hidden layer are always "1" in a RBF NN, while the weights from the hidden layer to the output layer are adjusted during training. The output unit implements a weighted sum of hidden unit outputs. The input into an RBF NN is nonlinear while the output is linear. Due to their nonlinear approximation properties, RBF NNs are able to model complex mappings, which perceptron based neural networks can only model by means of multiple hidden layers. It can be trained without back propagation since it has a closed-form solution. RBF NNs have been successfully applied to a large diversity of applications including interpolation, chaotic time series modeling, system identification, control engineering, electronic device parameter modeling, channel equalization, speech recognition, image restoration, shape-form-shading, 3-D object modeling, motion estimation and moving object segmentation, data fusion, etc. <ref>www-users.cs.york.ac.uk/adrian/Papers/Others/OSEE01.pdf</ref>

#### The Network System

1. Input:
n data points $\mathbf{x}_i\subset \mathbb{R}^d, \quad i=1,...,n$
2. Basis function (the single hidden layer):
$\mathbf{\phi}_{n*m}$, where $m$ is the number of the neurons/basis functions that project original data points into a new space.
There are many choices for the basis function. The commonly used is radial basis:
$\phi_j(\mathbf{x}_i)=e^{-|\mathbf{x}_i-\mathbf{\mu}_j|^2}$
3. Weights associated with the last layer: $\mathbf{W}_{m*k}$, where k is the number of classes in the output $\mathbf{Y}$.
4. Output: $\mathbf{Y}$, where
$y_k(x)=\sum_{j=1}^{m}(W_{jk}*\phi_j(x))$
Alternatively, the output $\mathbf{Y}$ can be written as $Y=\phi*W$

where

$\hat{Y}_{n,k} = \left[ \begin{matrix} \hat{y}_{1,1} & \hat{y}_{1,2} & \cdots & \hat{y}_{1,k} \\ \hat{y}_{2,1} & \hat{y}_{2,2} & \cdots & \hat{y}_{2,k} \\ \vdots &\vdots & \ddots & \vdots \\ \hat{y}_{n,1} & \hat{y}_{n,2} & \cdots & \hat{y}_{n,k} \end{matrix}\right]$ is the matrix of output variables.
$\Phi_{n,m} = \left[ \begin{matrix} \phi_{1}(\mathbf{x}_1) & \phi_{2}(\mathbf{x}_1) & \cdots & \phi_{m}(\mathbf{x}_1) \\ \phi_{1}(\mathbf{x}_2) & \phi_{2}(\mathbf{x}_2) & \cdots & \phi_{m}(\mathbf{x}_2) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_{1}(\mathbf{x}_n) & \phi_{2}(\mathbf{x}_n) & \cdots & \phi_{m}(\mathbf{x}_n) \end{matrix}\right]$ is the matrix of Radial Basis Functions.
$W_{m,k} = \left[ \begin{matrix} w_{1,1} & w_{1,2} & \cdots & w_{1,k} \\ w_{2,1} & w_{2,2} & \cdots & w_{2,k} \\ \vdots & \vdots & \ddots & \vdots \\ w_{m,1} & w_{m,2} & \cdots & w_{m,k} \end{matrix}\right]$ is the matrix of weights.

Here, $k$ is the number of outputs, $n$ is the number of data points, and $m$ is the number of hidden units. If $k = 1$, $\hat Y$ and $W$ are column vectors. If m = n, then $\mathbf{\mu}_i = \mathbf{x}_i$, so $\phi_{i}$ checks to see how similar the two data points are.

$Y=\phi W$ where Y and $\phi$ are known while W is unknown. The object function is $\psi=|Y-\Phi W|^2$ and we want to $\underset{W}{\mbox{min}} |Y-\Phi W|^2$. Therefore, to get the optimal weight, $W=(\phi^T \phi)^{-1}\phi^TY$

#### Network Training

To construct m basis functions, first cluster data points into m groups. Then find the centre of each cluster $\mu_1$ to $\mu_m$.

Clustering: the K-means algorithm <ref>This section is taken from Wikicourse notes stat441/841 fall 2010.</ref>
K-means is a commonly applied technique in clustering observations into groups by minimizing the distance of individual observations from the center of the cluster it is in. The most common K-means algorithm used is referred to as Lloyd's algorithm:

Step1: Select the number of clusters m

Step2: Randomly select m observations from the n observations, to be used as m initial centers.

Step2 (Alternative): Randomly assign all data points to clusters and use the means of those clusters as the initial centers.

Step3: For each data point from the rest of observations, compute the distance to each of the initial centers and classify it into the cluster with the minimum distance.

Step4: Obtain updated cluster centers by computing the mean of all the observations in the corresponding clusters.

Step5: Repeat Step 3 and Step 4 until all of the differences between the old cluster centers and new cluster centers are acceptable.

Having constructed the basis functions, next minimize the objective function with respect to $\mathbf{W}$:
$min \;\left|| Y-\phi*W\right ||_2^{2}$

The solution to the problem is $\ W=(\phi^T*\phi)^{-1}*\phi^T*Y$

Matlab example:

clear all;
clc;
P=ionosphere(:,1:(end-1));
P=P';
T=ionosphere(:,end);
T=T';
net=newff(minmax(P),[4,1],{'logsig','purelin'},'trainlm');
net.trainParam.show=100;
net.trainParam.mc=0.9;
net.trainParam.mu=0.05;
net.trainParam.mu_dec=0.1;
net.trainParam.mu_inc=5;
net.trainParam.lr=0.5;
net.trainParam.goal=0.01;
net.trainParam.epochs=5000;
net.trainParam.max_fail=10;
net.trainParam.mem_reduc=2;
net.trainParam.alpha=0.1;
net.trainParam.delt_inc=1;
net.trainParam.delt_dec=0.1;
net=init(net);
net,tr]=train(net,P,T);
A = sim(net,P);
E = T - A;
disp('the training error:')
MSE=mse(E)


### Single Basis Function vs. Multiple Basis Functions

Suppose the data points belong to a mixture of Gaussian distributions.

Under single basis function approach, every class in $\mathbf{Y}$ is represented by a single basis function. This approach is similar to the approach of linear discriminant analysis.

Compare $y_k(x)=\sum_{j=1}^{m}(W_{jk}*\phi_j(x))$
with $P(Y|X)=\frac{P(X|Y)*P(Y)}{P(X)}$.
Here, the basis function $\mathbf{\phi}_{j}$ can be thought of as equivalent to $\frac{P(X|Y)}{P(X)}$.

Under multiple basis function approach, a layer of j basis functions are placed between $\mathbf{Y}$ and $\mathbf{X}$. The probability function of the joint distribution of $\mathbf{X}$, $\mathbf{J}$ and $\mathbf{Y}$ is

$\,P(X,J,Y)=P(Y)*P(J|Y)*P(X|J)$

Here, instead of using single Gaussian to represent each class, we use a "mixture of Gaussian" to represent.
The probability funcion of $\mathbf{Y}$ conditional on $\mathbf{X}$ is

$P(Y|X)=\frac{P(X,Y)}{P(X)}=\frac{\sum_{j}{P(X,J,Y)}}{P(X)}$

Multiplying both the nominator and the denominator by $\ P(J)$ yields

$\ P(Y|X)=\sum_{j}{P(J|X)*P(Y|J)}$
where $\ P(J|X)$ tells that, with given X (data), how likely the data is in the Gaussian J, and $\ P(Y|J)$ tells that, with given Gaussian J, how likely this Gaussian belongs to class K.

since
$\ P(J|X)=\frac{P(X|J)*P(J)}{P(X)}$ and $\ P(Y|J)=\frac{P(Y|J)*P(Y)}{P(J)}$

If the weights in the radial basis neural network have proper properties of probability function, then the basis function $\mathbf{\phi}_j$ can be thought of as $\ P(J|X)$, representing the probability that $\mathbf{x}$ is in Gaussian class j; and the weight function W can be thought of as $\ P(Y|J)$, representing the probability that a data point belongs to class k given that the point is from Gaussian class j.

In conclusion, given a mixture of Gaussian distributions, multiple basis function approach is better than single basis function, since the former produces a non-linear boundary.

## RBF Network Complexity Control (Lecture: Oct. 25, 2011)

When performing model selection, overfitting is a common issue. As model complexity increases, there comes a point where the model becomes worse and worse at fitting real data even though it fits the training data better. It becomes too sensitive to small perturbations in the training data that should be treated as noise to allow flexibility in the general case. In this section we will show that training error (empiricial error from the training data) is a poor estimator for true error and that minimizing training error will increase complexity and result in overfitting. We will show that test error (empirical error from the test data) is a better estimator of true error.

First, some notation is defined.

The assumption for the training data set is that it consists of the true model values $\ f(x_i)$ plus some additive Gaussian noise $\ \epsilon_i$:

$\ y_i = f(x_i)+\epsilon_i$ where $\ \epsilon \sim N(0,\sigma^2)$

$\ y_i = true\,model + noise$

### Important Notation

Let:

• $\displaystyle f(x)$ denote the true model.
• $\hat f(x)$ denote the prediction/estimated model, which is generated from a training data set $\displaystyle T = \{(x_i, y_i)\}^n_{i=1}$. The observation is not accurate.

Remark: $\hat f(x_i) = \hat y_i$.

• $\displaystyle err$ denote the empirical error based on actual data points. This can be either test error or training error depending on the data points used. This is the difference between $(y-\hat{y})^2$
• $\displaystyle Err$ denote the true error or generalization error, and is what we are trying to minimize. It is the difference between $(f-\hat{f})^2$
• $\displaystyle MSE=E[(\hat f(x)-f(x))^2]$ denote the mean squared error.

We use the training data to estimate our model parameters.

$D=\{(x_i,y_i)\}_{i=1}^n$

For a given point $y_0$, the expectation of the empirical error is:

\begin{align} E[(\hat{y_0}- y_0)^2] &= E[(\hat{f_0}- f_0 -\epsilon_0)^2] \\ &=E[(\hat{f_0}-f_0)^2 + \epsilon_0^2 - 2 \epsilon_0 (\hat{f_0}-f_0)] \\ &=E[(\hat{f_0}-f_0)^2] + E[\epsilon_0^2] - 2 E [ \epsilon_0 (\hat{f_0}-f_0)] \\ &=E[(\hat{f_0}-f_0)^2] + \sigma^2 - 2 E [ \epsilon_0 (\hat{f_0}-f_0)] \end{align}

This is the formula partitions the training error into the true error and others errors. Our goal is to select the model that minimizes the true error so we must try to understand the effects of these other error terms if we are to use training error as a estimate for the true error.

The first term is essentially true error. The second term is a constant. The third term is problematic, since in general this expectation is not 0. We will break this into 2 cases to simplify the third term.

##### Case 1: Estimating Error using Data Points from Test Set

In Case 1, the empirical error is test error and the data points used to calculate test error are from the test set, not the training set. That is, $y_0 \notin T$.

We can rewrite the third term in the following way, since both $y_0$ and $\hat{f_0}$ have expectation $f_0$, the true value, which is a constant and not random.

\begin{align} E [ \epsilon_0 (\hat{f_0}-f_0)] &= E [ (y_0-f_0) (\hat{f_0}-f_0)] \\ & = cov{(y_0,\hat{f_0})} \end{align}

(The reason why covariance is here since $\displaystyle y_i$ is a new point, $\hat f$ and $\displaystyle y_i$ are independent.)

Consider $\ f_0$ is a mean.

Since $y_0$ is not part of the training set, it is independent of the model $\hat{f_0}$ generated by the training set. Therefore,

$y_0 \notin T \to y_0 \perp \hat{f}$

$\ cov{(y_0,\hat{f}_0)}=0$

The equation for the expectation of empirical error simplifies to the following:

$E[(y_0-\hat{y_0})^2] = E[(f_0-\hat{f_0})^2] + \sigma^2$

This result applies to every output value in the test data set, so we can generalize this equation by summing over all m data points that have NOT been seen by the model:

\begin{align} \sum_{i=1}^m{(y_i-\hat{y_i})^2} &= \sum_{i=1}^m{(f_i-\hat{f_i})^2)} + m \sigma^2 \\ err &= Err + m \sigma^2 \\ & = Err + constant\\ \end{align}

Rearranging to solve for true error, we get

$\ Err = err - m \sigma^2$

We see that test error is a good estimator for true error upto a constant additive value, since they only differ by a constant. Minimizing test error is equal to minimize true error. Moreover, the true error is less than the empirical error. There is no term adding unnecessary complexity. This is the justification for Cross Validation.

To avoid over-fitting or under-fitting using cross-validation, a validation data set selected so that it is independent from the estimated model.

### Case 2: Estimating Error using Data Points from Training Set

In Case 2, the data points used to calculate error are from the training set, so $\ y_0 \in T$, i.e. $\ (x_i, y_i)$ is in the training set. We will show that this results in a worse estimator for true error.

Now $\ y_0$ has been used to estimate $\ \hat{f}$ so they are not independent. We use Stein's lemma to simplify the term $\ E[\epsilon_0 (\hat{f_0} - f_0)]$.

Stein's Lemma states that if $\ x \sim N(\theta,\sigma^2)$ and $\ g(x)$ is differentiable, then

$E\left[g(x) (x - \theta)\right] = \sigma^2 E \left[ \frac{\partial g(x)}{\partial x} \right]$

Substitute $\ \epsilon_0$ for $\ x$ and $\ (\hat{f_0}-f_0)$ for $\ g(x)$. Note that $\ \hat{f_0}$ is a function of the noise, since as noise changes, $\hat{f_0}$ will change. Using Stein's Lemma, we get:

\begin{align} E[\epsilon_0 (\hat{f_0}-f_0)] &= \sigma^2 E \left[ \frac{\partial (\hat{f_0}-f_0)}{\partial \epsilon_0} \right]\\ &=\sigma^2 E\left[\frac{\partial \hat{f_0}}{\partial \epsilon_0}\right]\\ &=\sigma^2 E\left[\frac{\partial \hat{f_0}}{\partial y_0}\right]\\ &=\sigma^2 E\left[D_0\right] \end{align}

Remark: $\frac{\partial (\hat{f_0} - f_0)}{\partial y_0} = \frac{\partial (\hat{f_0} - f_0)}{\partial \epsilon_0} * \frac{\partial \epsilon_0}{\partial y} = \frac{\partial (\hat{f_0} - f_0)}{\partial \epsilon_0} * \frac{\partial (y_0 - \hat{y_0})}{\partial y}$

The reason why $\frac{\partial (\hat{f_0})}{\partial \epsilon_0} = 0$ is that $f_0$ is a constant instead of a function.

where $\frac{\partial (y_0 - \hat{y_0})}{\partial y} = 1$

We take $\ D_0 = \frac{\partial \hat{f_0}}{\partial y_0}$, where $\ D_0$ represents the derivative of the fitted model with respect to the observations. The equation for the expectation of empirical error becomes:

$E[(y_0-\hat{y_0})^2] = E[(f_0-\hat{f_0})^2] + \sigma^2 - 2 \sigma^2 E[D_0]$

Generalizing the equation for all n data points in the training set:

$\sum_{i=1}^n{(y_i-\hat{y_i})^2} = \sum_{i=1}^n{(f_i-\hat{f_i})^2} + n \sigma^2 - 2 \sigma^2 \sum_{i=1}^n{D_i}$

Based on the notation defined above, we then have:

$err = Err + n \sigma^2 - 2 \sigma^2 \sum_{i=1}^n{D_i}$

$Err = err - n \sigma^2 + 2 \sigma^2 \sum_{i=1}^n{D_i}$

This equation for the true error is called Stein's unbiased risk estimator (SURE). It is an unbiased estimator of the mean-squared error of a given estimator, in a deterministic estimation scenario. In other words, it provides an indication of the accuracy of a given estimator. This is important since, in deterministic estimation, the true mean-squared error of an estimator generally depends on the value of the unknown parameter and thus cannot be determined completely.

Note that $\ D_i$ depends on complexity of the model. It measures how sensitive the model is to small perturbations in a single $\ y_i$ in the training set. As complexity increases, the model will try to chase every little change and will be more sensitive to such perturbations. Minimizing training error without accounting for the impact of this term will result in overfitting. Thus, we need to know how to find $\ D_i$. Below we show an example, applying SURE to RBFs, where computing $\ D_i$ is straightforward.

### SURE for RBF Network Complexity Control

Problem: Assuming we want to fit our data using a radial basis function network, how many radial basis functions should be used? The network size has to compromise the approximation quality, which usually improves as the network grows, and the training effort, which increases with the network size. Moreover, too complex models can show insufficient generalization properties (overfitting) requiring small networks. Furthermore, in terms of hardware or software realization smaller networks occupy less area due to reduced memory needs. Hence, controlling the network size is one major task during training. For further information about RBF network complexity control check [4]

We can use Stein's unbiased risk estimator (SURE) to give us an approximation for how many RBFs to use.

The SURE equation is

$\mbox{Err}=\mbox{err} - n\sigma^2 + 2\sigma^2\sum_{i=1}^n D_i$

where $\ Err$ is the true error, $\ err$ is the empirical error, $\ n$ is the number of training samples, $\ \sigma^2$ is the variance of the noise of the training samples and $\ D_i$ is derivative of the model output with respect to true output as shown below

$D_i=\frac{\partial \hat{f_i}}{\partial y_i}$

Optimal Number of Basis in RBF

The optimal number of basis functions should be rearranged in order to minimize the generalization error $\ err$.

The formula for an RBF network is:

$\hat{f}=\Phi W$

where $\ \hat{f}$ is a matrix of RBFN outputs for each training sample, $\ \Phi$ is the matrix of neuron outputs for each training sample, and $\ W$ is the weight vector between each neuron and the output. Suppose we have m + 1 neurons in the network, where one has a constant function.

Given the training labels $\ Y$ we define the empirical error and minimize it

$\underset{W}{\mbox{min}} |Y-\Phi W|^2$

$\, W=(\Phi^T \Phi)^{-1} \Phi^T Y$

$\hat{f}=\Phi(\Phi^T \Phi)^{-1} \Phi^T Y$

For simplification let $\ H$ be the hat matrix defined as

$\, H=\Phi(\Phi^T \Phi)^{-1} \Phi^T$

Our optimal output then becomes

$\hat{f}=H Y$

We calculate $D$ from the SURE equation. We now consider applying SURE to Radial Basis Function networks specifically. Based on SURE, the optimum number of basis functions should be assigned so that the generalization error $\displaystyle err$ is minimized. Based on the RBF Network, by setting $\frac{\partial err}{\partial W}$ equal to zero we obtain the least squares solution of $\ W = (\Phi^{T}\Phi)^{-1}\Phi^{T}Y$. Then the fitted values are $\hat{Y} = \hat{f} = \Phi W = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}Y = HY$, where $\ H = \Phi(\Phi^{T}\Phi)^{-1}\Phi^{T}$ is the hat matrix for this model.

Consider only one node of the network. In this case we can write: $\hat f_i=\,H_{i1}y_1+\,H_{i2}y_2+\cdots+\,H_{ii}y_i+\cdots+\,H_{in}y_n$.

Note here that $\,H$ depends on the input vector $\displaystyle x_i$ but not on the observation $\displaystyle y_i$.

By taking the derivative of $\ \hat f_i$ with respect to $\displaystyle y_i$, we can readily obtain:

$\sum_{i=1}^n \frac {\partial \hat f}{\partial y_i}=\sum_{i=1}^n \,H_{ii}$

$D_i= \frac{\partial \hat f_i}{\partial y_i}=\frac{\partial [HY]_i}{\partial y_i}$ , $\hat f_i=\sum_{j}\,H_{ij}*Y_j$

Here we recall that $\sum_{i=1}^n\,D_{i}= \sum_{i=1}^n \,H_{ii}= \,Trace(H)$, the sum of the diagonal elements of $\,H$. Using the permutation property of the trace function we can further simplify the expression as follows: $\,Trace(H)= Trace(\Phi(\Phi^{T}\Phi)^{-1}\Phi^{T})= Trace(\Phi^{T}\Phi(\Phi^{T}\Phi)^{-1})=m$, by the trace cyclical permutation property, where $\displaystyle m$ is the number of basis functions in the RBF network (and hence $\displaystyle \Phi$ has dimension $\displaystyle n \times m$).

#### Sketch of Trace Cyclical Property Proof:

For $\, A_{mn}, B_{nm}, Tr(AB) = \sum_{i=1}^{n}\sum_{j=1}^{m}A_{ij}B_{ji} = \sum_{j=1}^{m}\sum_{i=1}^{n}B_{ji}A_{ij} = Tr(BA)$.
With that in mind, for $\, A_{nn}, B_{nn} = CD, Tr(AB) = Tr(ACD) = Tr(BA)$ (from above) $\, = Tr(CDA)$.

Note that since $\displaystyle \Phi$ is a projection of the input matrix $\,X$ onto a basis set spanned by $\,m$, the number of basis functions, that sometimes an extra $\displaystyle \Phi_0$ term is included without any input to represent the intercept of a fitted model. In this case, if considering an intercept, then $\,Trace(H)= m+1$.

The SURE equation then becomes

$\, \mbox{Err}=\mbox{err} - n\sigma^2 + 2\sigma^2(m+1)$

As the number of RBFs $\ m$ increases the empirical error $\ err$ decreases, but the right term of the SURE equation increases. An optimal true error $\ Err$ can be found by increasing $\ m$ until $\ Err$ begins to grow. At that point the estimate to the minimum true error has been reached.

The value of m that gives the minimum true error estimate is the optimal number of basis functions to be implemented in the RBF network, and hence is also the optimal degree of complexity of the model.

One way to estimate the noise variance is

$\hat{\sigma}^2=\frac{\sum (y-\hat{y})^2}{n-1}$

This application of SURE is straightforward because minimizing Radial Basis Function error reduces to a simple least squares estimator problem with a linear solution. This makes computing $\ D_i$ quite simple. In general, $\ D_i$ can be much more difficult to solve for.

### RBF Network Complexity Control (Alternate Approach)

An alternate approach (not covered in class) to tackling RBF Network complexity control is controlling the complexity by similarity <ref name="Eickhoff">R. Eickhoff and U. Rueckert, "Controlling complexity of RBF networks by similarity," Proceedings of European Symposium on Artificial Neural Networks, 2007</ref>. In <ref name="Eickhoff" />, the authors suggest looking at the similarity between the basis functions multiplied by their weight by determining the cross-correlations between the functions. The cross-correlation is calculated as follows:

$\ \rho_{ij} = \frac{E[g_i(x)g_j(x)]}{\sqrt(E[g^2_i(x)]E[g^2_j(x)])}$

where $\ E[]$ denotes the expectation and $\ g_i(x)$ and $\ g_j(x)$ would denote two of the basis functions multiplied by their respective weights.

If the cross-correlation between two functions is high, <ref name="Eickhoff" /> suggests that the two basis functions be replaced with one basis function that covers the same region of both basis functions and that the corresponding weight of this new basis function be the average of the weights of the two basis functions. For the case of Gaussian radial basis functions, the equations for finding the new weight ($\ w_{new}$), mean ($\ c_{new}$) and variance ($\ \sigma_{new}$) are as follows:

$\ w_{new} = \frac{w_i + w_j}{2}$

$\ c_{new} = \frac{1}{w_i \sigma^n_i + w_j \sigma^n_j}(w_i \sigma^n_i c_i + w_j \sigma^n_j c_j)$

$\ \sigma^2_{new} = \left(\frac{\sigma_i + \sigma_j}{2}+ \frac{min(||m-c_i||,||m-c_j||)}{2}\right)^2$

where $\ n$ denotes the input dimension and $\ m$ denotes the total number of radial basis functions.

This process is repeated until the cross-correlation between the basis functions falls below a certain threshold, which is a tunable parameter.

Note 1) Though not extensively discussed in <ref name="Eickhoff" />, this approach to RBF Network complexity control presumably requires a starting RBF Network with a large number basis functions.

Note 2) This approach does not require the repeated implementation of differently sized RBF Networks to determine the empirical error, unlike the approach using SURE. However, the SURE approach is backed up by theory to find the number of radial basis functions that optimizes the true error and does not rely on some tunable threshold. It would be interesting to compare the results of both approaches (in terms of the resulting RBF Network obtained and the test error).

### Generalized SURE for Exponential Families

As we know, Stein’s unbiased risk estimate (SURE) is limited to be applied for the independent, identically distributed (i.i.d.) Gaussian model. However, in some recent work, some researchers tried to work on obtaining a SURE counterpart for general, instead of deriving estimate by dominating least-squares estimation, and this technique made SURE extend its application to a wider area.

You may look at Yonina C. Eldar, Generalized SURE for Exponential Families: Applications to Regularization, IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 2, FEBRUARY 2009 for more information.

Fully Tuned Radial Basis Function Neural Networks for Flight Control <ref> http://www.springer.com/physics/complexity/book/978-0-7923-7518-0;jsessionid=985F21372AC7AE1B654F1EADD11B296F.node3 </ref>

Radial Basis Function (RBF) Networks <ref>http://documents.wolfram.com/applications/neuralnetworks/index6.html</ref>

An Example of RBF Networks <ref>http://reference.wolfram.com/applications/neuralnetworks/ApplicationExamples/12.1.2.html</ref>

This paper suggests an objective approach in determining proper samples to find good RBF networks with respect to accuracy <ref>http://www.wseas.us/e-library/conferences/2009/hangzhou/MUSP/MUSP41.pdf</ref>.

## Support Vector Machines (Lecture: Oct. 27, 2011)

Support vector machines (SVMs) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory. SVMs are kernel machines based on the principle of structural risk minimization, which are used in applications of regression and classification; however, they are mostly used as binary classifiers. Although the subject can be said to have started in the late seventies (Vapnik, 1979), it is receiving increasing attention recently by researchers. It is such a powerful method that in the few years since its introduction has outperformed most other systems in a wide variety of applications, especially in pattern recognition.

The current standard incarnation of SVM is known as "soft margin" and was proposed by Corinna Cortes and Vladimir Vapnik [5]. In practice the data is not usually linearly separable. Although theoretically we can make the data linearly separable by mapping it into higher dimensions, the issues of how to obtain the mapping and how to avoid overfitting are still of concern. A more practical approach to classifying non-linearly separable data is to add some error tolerance to the separating hyperplane between the two classes, meaning that a data point in class A can cross the separating hyperplane into class B by a certain specified distance. This more generalized version of SVM is the so-called "soft margin" support vector machine and is generally accepted as the standard form of SVM over the hard margin case in practice today. [6]

Support Vector Machines are motivated by the idea of training linear machines with margins. It involves preprocessing the data to represent patterns in a high dimension (generally much higher than the original feature space). Note that using a suitable non-linear mapping to a sufficiently high dimensional space, the data will always be separable. (p. 263) <ref>[Duda, Richard O., Hart, Peter E., Stork, David G. "Pattern Classification". Second Edition. John Wiley & Sons, 2001.]</ref>

A suitable way to describe the interest in SVM can be seen in the following quote. "The problem which drove the initial development of SVMs occurs in several guises - the bias variance tradeoff (Geman, Bienenstock and Doursat, 1992), capacity control (Guyon et al., 1992), overfitting (Montgomery and Peck, 1992) - but the basic idea is the same. Roughly speaking, for a given learning task, with a given finite amount of training data, the best generalization performance will be achieved if the right balance is struck between the accuracy attained on that particular training set, and the “capacity” of the machine, that is, the ability of the machine to learn any training set without error. A machine with too much capacity is like a botanist with a photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything she has seen before; a machine with too little capacity is like the botanist’s lazy brother, who declares that if it’s green, it’s a tree. Neither can generalize well. The exploration and formalization of these concepts has resulted in one of the shining peaks of the theory of statistical learning (Vapnik, 1979). A Tutorial on Support Vector Machines for Pattern Recognition

##### Support Vector Method Solving Real-world Problems

No matter whether the training data are linearly-separable or not, the linear boundary produced by any of the versions of SVM is calculated using only a small fraction of the training data rather than using all of the training data points. This is much like the difference between the median and the mean.

SVM can also be considered a special case of Tikhonov regularization. A special property is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers. The key features of SVM are the use of kernels, the absence of local minima, the sparseness of the solution (i.e. few training data points are needed to construct the linear decision boundary) and the capacity control obtained by optimizing the margin.(Shawe-Taylor and Cristianini (2004)).

Another key feature of SVM, as discussed below, is the use of slack variables to control the amount of tolerable misclassification on the training data, which form the soft margin SVM. This key feature can serve to improve the generalization of SVM to new data. SVM has been used successfully in many real-world problems:

- Pattern Recognition, such as Face Detection , Face Verification, Object Recognition, Handwritten Character/Digit Recognition, Speaker/Speech Recognition, Image Retrieval , Prediction;

- Text and Hypertext categorization;

- Image classification;

- Bioinformatics, such as Protein classification, Cancer classification;

Please refer to here for more applications.

##### Structural Risk Minimization and VC Dimension

Linear learning machines are the fundamental formulations of SVMs. The objective of the linear learning machine is to find the linear function that minimizes the generalization error from a set of functions which can approximate the underlying mapping between the input and output data. Consider a learning machine that implements linear functions in the plane as decision rules

$f(\mathbf{x},\boldsymbol{\beta}, \beta_0)=sign (\boldsymbol{\beta}^T\mathbf{x}+\beta_0)$

With n given training data with input values $\mathbf{x}_i \in \mathbb{R}^d$ and output values $y_i\in\{-1,+1\}$. The empirical error is defined as

$\Re_{emp} (\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^n |y_i-f(\mathbf{x},\boldsymbol{\beta}, \beta_0)|= \frac{1}{n}\sum_{i=1}^n |y_i-sign (\boldsymbol{\beta}^T\mathbf{x}+\beta_0)|$

where $\boldsymbol{\theta}=(\mathbf{x},\boldsymbol{\beta})$

The generalization error can be expressed as

$\Re (\boldsymbol{\theta}) = \int|y-f(\mathbf{x},\boldsymbol{\theta})|p(\mathbf{x},y)dxdy$

which measures the error for all input/output patterns that are generated from the underlying generator of the data characterized by the probability distribution $p(\mathbf{x},y)$ which is considered to be unknown. According to statistical learning theory, the generalization (test) error can be upper bounded in terms of training error and a confidence term as shown in

$\Re (\boldsymbol{\theta})\leq \Re_{emp} (\boldsymbol{\theta}) +\sqrt{\frac{h(ln(2n/h)+1)-ln(\eta/4)}{n}}$

The term on left side represents generalization error. The first term on right hand side is empirical error calculated from the training data and the second term is called VC confidence which is associated with the VC dimension h of the learning machine. VC dimension is used to describe the complexity of the learning system. The relationship between these three items is illustrated in figure below:

File:risk.png
The relation between expected risk, empirical risk and VC confidence in SVMs.

Thus, even though we don’t know the underlying distribution based on which the data points are generated, it is possible to minimize the upper bound of the generalization error in place of minimizing the generalization error. That means one can minimize the expression in the right hand side of the inequality above.

Unlike the principle of Empirical Risk Minimization (ERM) applied in Neural Networks which aims to minimize the training error, SVMs implement Structural Risk Minimization (SRM) in their formulations. SRM principle takes both the training error and the complexity of the model into account and intends to find the minimum of the sum of these two terms as a trade-off solution (as shown in figure above) by searching a nested set of functions of increasing complexity.

##### Introduction

Support Vector Machineis a popular linear classifier. Suppose that we have a data set with two classes which could be separated using a hyper-plane. Support Vector Machine (SVM) is a method which will give us the "best" hyper-plane. There are other classifiers that find a hyper-plane that separate the data, namely Perceptron. However, the output of Perceptron and many other algorithms depends on the input parameters, so every run of Percetron can give you a different output. On the other hand, SVM tries to find the hyper-plane that separates the data and have the farthest distance from the points. This is also known as the Max-Margin hyper-plane.

No matter whether the training data are linearly-separable or not, the linear boundary produced by any of the versions of SVM is calculated using only a small fraction of the training data rather than using all of the training data points. This is much like the difference between the median and the mean. SVM can also be considered a special case of Tikhonov regularization. A special property is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers. The key features of SVM are the use of kernels, the absence of local minima, the sparseness of the solution (i.e. few training data points are needed to construct the linear decision boundary) and the capacity control obtained by optimizing the margin.(Shawe-Taylor and Cristianini (2004)). Another key feature of SVM, as discussed below, is the use of slack variables to control the amount of tolerable misclassification on the training data, which form the soft margin SVM. This key feature can serve to improve the generalization of SVM to new data.

With Perceptron, there can be infinitely many separating hyperplanes such that the training error will be zero. But the question is that among all these possible solution which one is the best. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. This makes sense because at test time, more points will be observed and they may be closer to the other class, so the safest choice for the hyper-plane would be the one farthest from both classes.

One of the great things about SVM is that not only it has solid theoretical guarantees, but also it works very well in practice.

To summarize:

What we mean by margin is the distance between the hyperplane and the closest point in a class.

If the data is Linearly separable, then there exists infinitely many solution hyperplanes. Of those, infinitely many hyperplanes, one of them is the best choice for the solution. Then the best decision to make is the hyperplane which is furthest from both classes. Our goal is to find a hyperplane among all possible hyperplanes which is furthest from both classes. This is to say, find the hyperplane that has maximum margin. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum margin classifier; or equivalently, the perceptron of optimal stability.

What we mean by margin is the distance between the hyperplane and the closest point in a class.

If the mean value were to be used instead of the closest point, then an outlier may pull the hyperplane into the data which would incorrectly classify the known data points. This is the reason why we use the closest point instead of the expected value.
##### Setting
What is $d_i$
• We assume that the data is linearly separable
• Our classifier will be of the form $\boldsymbol\beta^T\mathbf{x} + \beta_0$
• We will assume that our labels are $y_i \in \{-1,1\}$

The goal is to classify the point $\mathbf{x_i}$ based on the $sign \{d_i\}$ where $d_i$ is the signed distance between $\mathbf{x_i}$ and the hyperplane.

Now we are going to check how far this point is from the hyperplane, and the parts on one side of the hyperplane will have a negative value and the parts on the other side will have a positive value. Points are classified by the sign of the data point. So $\mathbf{x_i}$ would be classified using $d_i$

### Side Note: A memory from the past of Dr. Ali Ghodsi

When the aforementioned Professor was a small child, grade 2. He was often careless with the accuracy of certain curly brackets, when writing what one can only assume was math proofs. One day, his teacher grew impatient and demanded that a page of perfect curly brackets be produced by the young Dr. (He may or may not have been a doctor at the time) And now, whenever Dr. Ghodsi writes a tidy curly bracket, he is reminded of this and it always brings a smile to his face.

From memories of the past.

(the number 20 was involved in the story, either the number of pages or the number of lines)

##### Case 1: Linearly Separable (Hard Margin)

In this case, the classifier will be $\boldsymbol {\beta^T}x \boldsymbol {} + \beta_0$ and $\ y \in \{-1, 1\}$ The classify point $\boldsymbol {x_i}$ based on sign $\ \{d_i\}$, where $\ d_i$ is the signed distance between $\boldsymbol {x_i}$ and the hyperplane.

##### Objective Function
Look at it being perpendicular

Observation 1: $\boldsymbol\beta$ is orthogonal to hyper-plane. Because, for any two arbitrary points $\mathbf{x_1, x_2}$ on the plane we have:

$\boldsymbol\beta^T\mathbf{x_1} + \beta_0 = 0$

$\boldsymbol\beta^T\mathbf{x_2} + \beta_0 = 0$

So $\boldsymbol\beta^T (\boldsymbol{x_1}-\boldsymbol{x_2}) = 0$. Thus, $\boldsymbol\beta^T \perp (\boldsymbol{x_1} - \boldsymbol{x_2})$, which implies that $\boldsymbol \beta$ is a normal vector to the hyper-plane.

Observation 2: If $\boldsymbol x_0$ is a point on the hyper-plane, then there exists a $\ \beta_0$ such that, $\boldsymbol\beta^T\boldsymbol{x_0}+\beta_0 = 0$. So $\boldsymbol\beta^T\boldsymbol{x_0} = - \beta_0$. This along with observation 1 imply there exists a $\ \beta_0$ such that, $\boldsymbol\beta^T\boldsymbol{x} = - \beta_0$ for all $\boldsymbol{x}$ on the hyperplane.

Observation 3: Let $\ d_i$ be the signed distance of point $\boldsymbol{x_i}$ from the plane. The $\ d_i$ is the projection of $(\boldsymbol{x_i} - \boldsymbol{x_0})$ on the direction of $\boldsymbol\beta$. In other words, $d_i \propto \boldsymbol\beta^T(\mathbf{x - x_0})$.(normalize $\beta$)

\begin{align} \displaystyle d_i &= \frac{\boldsymbol\beta^T(\boldsymbol{x_i} - \boldsymbol{x_0})}{\vert \boldsymbol\beta\vert}\\ & = \frac{\boldsymbol{\beta^Tx_i}- \boldsymbol{\beta^Tx_0}}{\vert \boldsymbol\beta\vert}\\ & = \frac{\boldsymbol{\beta^Tx_i}+ \beta_0}{\vert \boldsymbol\beta\vert} \end{align}

Observation 4: Let margin be the distance between the hyper-plane and the closest point. Since $d_i$ is the signed distance between the hyperplane and point $\boldsymbol{x_i}$, we can define the positive distance of point $\boldsymbol{x_i}$ from the hyper-plane as $(y_id_i)$.

\begin{align} \displaystyle \text{Margin} &= \min\{y_i d_i\}\\ &= \min\{ \frac{y_i(\boldsymbol\beta^T\mathbf{x_i} + \beta_0)}{|\boldsymbol\beta|} \} \end{align}

Our goal is to maximize the margin. This is also known as the Max/Min problem in Optimization. When defining the hyperplane, what is important is the direction of $\boldsymbol\beta$. Value of $\beta_0$ does not change the direction of the hyper-plane, it is only the distance from the origin. Note that if we assume that the points do not lie on the hyper-plane, then the margin is positive:

\begin{align} \displaystyle &y_i(\boldsymbol\beta^T\mathbf{x_i} + \beta_0) \geq 0 &&\\ &y_i(\boldsymbol\beta^T\mathbf{x_i} + \beta_0) \geq C &&\mbox{ for some positive C } \\ &y_i(\frac{\boldsymbol\beta^T}{C}\mathbf{x_i} + \frac{\beta_0}{C}) \geq 1 &&\mbox{ Divide by C}\\ &y_i(\boldsymbol\beta^{*T}\mathbf{x_i} + \beta^*_0) \geq 1 && \mbox{ By setting }\boldsymbol\beta^* = \frac{\boldsymbol\beta}{C}, \boldsymbol\beta_0^* = \frac{\boldsymbol\beta_0}{C}\\ &y_i(\boldsymbol\beta^{T}\mathbf{x_i} + \beta_0) \geq 1 && \mbox{ By setting }\boldsymbol\beta\gets\boldsymbol\beta^*, \boldsymbol\beta_0\gets\boldsymbol\beta_0^*\\ \end{align}

So with a bit of abuse of notation we can assume that

$y_i(\boldsymbol\beta^T\mathbf{x_i} + \beta_0) \geq 1$

Therefore, the problem translates to:

$\, \min\{\frac{1}{|\boldsymbol\beta|}\}$

So, it is possible to re-interpret the problem as:

$\, \max \frac 12 \vert \boldsymbol\beta \vert^2\,$ s.t. $\,y_i (\boldsymbol\beta^{T} \boldsymbol{x_i}+ \beta_0) \geq 1$

$\, \vert \boldsymbol\beta \vert$ could be any norm, but for simplicity we use L2 norm. We use $\frac 12 \vert \boldsymbol\beta \vert^2$ instead of $|\boldsymbol\beta|$ to make the function differentiable. To solve the above optimization problem we can use Lagrange multipliers as follows

##### Support Vectors

Support vectors are the training points that determine the optimal separating hyperplane that we seek. Also, they are the most difficult points to classify and at the same time the most informative for classification.

##### Writing Lagrangian Form of Support Vector Machine

The Lagrangian form using Lagrange multipliers and constraints that are discussed below is introduced to ensure that the optimization conditions are satisfied, as well as finding an optimal solution (the optimal saddle point of the Lagrangian for the classic quadratic optimization). The problem will be solved in dual space by introducing $\,\alpha_i$ as dual constraints, this is in contrast to solving the problem in primal space as function of the betas. A simple algorithm for iteratively solving the Lagrangian has been found to run well on very large data sets, making SVM more usable. Note that this algorithm is intended to solve Support Vector Machines with some tolerance for errors - not all points are necessarily classified correctly. Several papers by Mangasarian explore different algorithms for solving SVM.

Dual form of the optimization problem:

\begin{align} \displaystyle L(\boldsymbol\beta, \beta_0, \boldsymbol\alpha) &= \frac 12 \vert \boldsymbol\beta \vert^2 - \sum_{i=1}^n \alpha_i \left[ y_i (\boldsymbol{\beta^T x_i}+\beta_0) -1 \right]\\ &= \frac 12 \vert \boldsymbol\beta \vert^2 - \boldsymbol\beta^T \sum_{i=1}^n \alpha_i y_i \boldsymbol{x_i} - \sum_{i=1}^n \alpha_i y_i \beta_0 - \sum_{i=1}^n \alpha_i \end{align}

To find the optimal value, we set the derivatives equal to zero: $\,\frac{\partial L}{\partial \boldsymbol{\beta}} = 0$ and $\,\frac{\partial L}{\partial \beta_0} = 0$.

\begin{align} \displaystyle &\frac{\partial L}{\partial \boldsymbol{\beta}} = \boldsymbol\beta - \sum_{i=1}^n \alpha_i y_i \boldsymbol{x_i} = 0 &\Longrightarrow& \boldsymbol\beta = \sum_{i=1}^n \alpha_i y_i\boldsymbol{x_i}\\ &\frac{\partial L}{\partial \beta_0} = - \sum_{i=1}^n \alpha_i y_i = 0 &\Longrightarrow& \sum_{i=1}^n \alpha_i y_i = 0 \end{align}

To get the dual form of the optimization problem we replace the above two equations in definition of $L(\boldsymbol\beta, \beta_0, \boldsymbol\alpha)$.

We have: \begin{align} \displaystyle L(\boldsymbol\beta, \beta_0, \boldsymbol\alpha) &= \frac 12 \boldsymbol\beta^T\boldsymbol\beta - \boldsymbol\beta^T \sum_{i=1}^n \alpha_i y_i \boldsymbol{x_i} - \sum_{i=1}^n \alpha_i y_i \beta_0 - \sum_{i=1}^n \alpha_i\\ &= \frac 12 \boldsymbol\beta^T \sum_{i=1}^n \alpha_i y_i\boldsymbol{x_i} - \boldsymbol\beta^T \sum_{i=1}^n \alpha_i y_i\boldsymbol{x_i} - 0 + \sum_{i=1}^n \alpha_i\\ &= - \frac 12 \boldsymbol\beta^T \sum_{i=1}^n \alpha_i y_i\boldsymbol{x_i} + \sum_{i=1}^n \alpha_i\\ &= - \frac 12 \sum_{i=1}^n \alpha_i y_i\boldsymbol{x_i}^T \sum_{i=1}^n \alpha_i y_i\boldsymbol{x_i} + \sum_{i=1}^n \alpha_i\\ &= \sum_{i=1}^n \alpha_i - \frac 12 \sum_{i=1}^n\sum_{j=1}^n \alpha_i\alpha_jy_iy_j\boldsymbol{x_i}^T\boldsymbol{x_j} \end{align}

The above function is a dual objective function, so we should minimize it:

\begin{align} \displaystyle \min_\alpha &\sum_{i=1}^n \alpha_i - \frac 12 \sum_{i=1}^n\sum_{j=1}^n \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j}\\ s.t.\; & \alpha_i \geq 0\\ & \sum_{i=1}^n \alpha_i y_i = 0 \end{align}

The dual function is a quadratic function of several variables subject to linear constraints. This optimization problem is called Quadratic Programming and is much easier than the primal function. It is possible to to write to dual form using matrices:

\begin{align} \displaystyle \min_\alpha \,& \boldsymbol\alpha^T\boldsymbol{1} - \frac 12 \boldsymbol\alpha^T S \boldsymbol\alpha\\ s.t.\; & \boldsymbol\alpha \geq 0\\ & \boldsymbol\alpha^Ty = 0\\ & S = ([y_1,\dots, y_n]\odot X)^T ([y_1,\dots, y_n]\odot X) \end{align}

Since $S = ([y_1,\dots, y_n]\odot X)^T ([y_1,\dots, y_n]\odot X)$, S is a positive semi-definite matrix. This means that the dual function is convex.[7]. This means that the dual function does not have any local minimum that is not global. So it is relatively easy to find the global minimum.

This is a much simpler optimization problem and we can solve it by Quadratic programming. Quadratic programming (QP) is a special type of mathematical optimization problem. It is the problem of optimizing (minimizing or maximizing) a quadratic function of several variables subject to linear constraints on these variables. The general form of such a problem is minimize with respect to $\,x$

$f(x) = \frac{1}{2}x^TQx + c^Tx$

subject to one or more constraints of the form:

$\,Ax\le b$, $\,Ex=d$.

A good description of general QP problem formulation and solution can be find link here.

##### Discussion on the Dual of the Lagrangian

As mentioned in the previous section, solving the dual form of the Lagrangian requires quadratic programming. Quadratic programming can be used to minimize a quadratic function subject to a set of constraints. In general, for a problem with N variables, the quadratic programming solution has a computational complexity of $\ O(N^3)$ <ref name="CMBishop" />. The original problem formulation only has (d+1) variables that need to be found (i.e. the values of $\ \beta$ and $\ \beta_0$), where d is the dimensionality of the data points. However, the dual form of the Lagrangian has n variables that need to be found (i.e. all the $\ \alpha$ values), where n is the number of data points. It is likely that n is larger than (d+1) (i.e. the number of data points is larger than the dimensionality of the data plus 1), which makes the dual form of the Lagrangian seem computationally inefficient <ref name="CMBishop" />. However, the dual of the Lagrangian allows the inner product $\ x_i^T x_j$ to be expressed using a kernel formulation which allows the data to be transformed into higher feature spaces and thus allowing seemingly non-linearly separable data points to be separated, which is a highly useful feature described in more detail in the next class <ref name="CMBishop" />.

##### Support Vector Method Packages

One of the popular Matlab toolboxes for SVM is LIBSVM, which has been developed in the department of Computer Science and Information Engineering, National Taiwan University, under supervision of Chih-Chung Chang and Chih-Jen Lin. In this page they have provided the society with many different interfaces for LIBSVM like Matlab, C++, Python, Perl, and many other languages, each one of those has been developed in different institutes and by variety of engineers and mathematicians. In this page you can also find a thorough introduction to the package and its various parameters.

A very helpful tool which you can find on the LIBSVM page is a graphical interface for SVM; it is an applet by which we can draw points corresponding to each of the two classes of the classification problem and by adjusting the SVM parameters, observe the resulting solution.

If you found LIBSVM helpful and wanted to use it for your research, please cite the toolbox.

A pretty long list of other SVM packages and comparison between all of them in terms of language, execution platform, multiclass and regression capabilities, can be found here.

The top 3 SVM software are:

1. LIBSVM

2. SVMlight

3. SVMTorch

More information which introduces SVM software and their comparison can be found here and here.

## Support Vector Machine Continued (Lecture: Nov. 1, 2011)

In the previous lecture we considered the case when data is linearly separable. The goal of the Support Vector Machine classifier is to find the hyper-plane that maximizes the margin from each of the two classes. We derived the following optimization problem based on the SVM methodology:

$\, \min_{\beta , \beta_0} \frac{1}{2}{|\boldsymbol{\beta}|}^2$

Subject to the constraint:

$\,y_i(\boldsymbol{\beta}^T\mathbf{x}_i+\beta_0)\geq1$

This is the primal form of the optimization problem. Then we derived the dual of this problem:

$\, \max_\alpha \sum_i \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j$

Subject to constraints:

$\,\alpha_i\geq 0$

$\,\sum_i \alpha_i y_i =0$

The is a quadratic programming problem. QP problems have been thoroughly studied and they can be efficiently solved. This particular problem has a convex objective function as well as convex constraints. This makes solving the problem easier and ensures that the computed solution is the global maximum. These properties are of significant importance for classifiers and thus are one of the most important strengths of the SVM classifier.

We are able to find $\,\beta$ when $\,\alpha$ is found:

$\, \boldsymbol{\beta} = \sum_i \alpha_i y_i \mathbf{x}_i$

But in order to find the hyper-plane uniquely we also need to find $\,\beta_0$.

When finding the dual objective function, there is a set of conditions called KKT that should be satisfied.

### Examining KKT Conditions

KKT stands for Karush-Kuhn-Tucker (initially named after Kuhn and Tucker's work in the 1950's, however, it was later discovered that Karush had stated the conditions back in the late 1930's) <ref name="CMBishop" />

The K.K.T. conditions are as follows: stationarity, primal feasibility, dual feasibility, and complementary slackness.

It gives us a closer look into the Lagrangian equation and the associated conditions.

Suppose we want to find $\, min_x f(x)$ subject to the constraint $\, g_i(x)\geq 0 , \forall{x}$ :

$\, \mathcal{L} (x,\alpha_i)=f(x)-\sum_i \alpha_i g_i(x)$

If $\, x^*$ is the optimal point, the necessary conditions for $\, x^*$ to be the local minimum :

1) $\, \frac{\partial \mathcal{L}}{\partial x} (x^*) = 0$ that is $\, f'(x^*) - \Sigma_i{\alpha_ig'(x^*)}=0$ stationarity

2) $\, \alpha_i\geq 0 ,$ dual feasibility

3) $\, \alpha_i g_i(x^*)=0 ,$ complementary slackness

4) $\, g_i(x^*)\geq 0 ,$ primal feasibility

If any of the above four conditions are not satisfied, then the primal function is not feasible.

##### Support Vectors

Support vectors are the training points that determine the optimal separating hyperplane that we seek. Also, they are the most difficult points to classify and at the same time the most informative for classification.

In our case, the $g_i({x})$ function is:

$\,g_i(x) = y_i(\beta^Tx_i+\beta_0)-1$

Substituting $\,g_i$ into KKT condition 3, we get $\,\alpha_i[y_i(\beta^Tx_i+\beta_0)-1] = 0$. <br\>In order for this condition to be satisfied either
$\,\alpha_i= 0$ or
$\,y_i(\beta^Tx_i+\beta_0)=1$

All points $\,x_i$ will be either 1 or greater than 1 distance unit away from the hyperplane.

Case 1: a point away from the margin

If $\,y_i(\beta^Tx_i+\beta_0) \gt 1 \Rightarrow \alpha_i = 0$.

If point $\, x_i$ is not on the margin, then the corresponding $\,\alpha_i=0$.

Case 2: a point on the margin

If $\,y_i(\beta^Tx_i+\beta_0) = 1 \Rightarrow \alpha_i \gt 0$. <br\>If point $\, x_i$ is on the margin, then the corresponding $\,\alpha_i\gt 0$.

Points on the margin, with corresponding $\,\alpha_i \gt 0$, are called support vectors.

The optimal hyperplane is determined by only a few support vectors. Since it is impossible for us to know a priori which of the training data points would end up as the support vectors, it is necessary for us to work with the entire training set to find the optimal hyperplane.

To compute $\ \beta_0$, we need to choose any $\,\alpha_i \gt 0$, this will satisfy:

$\,y_i(\beta^Tx_i+\beta_0) = 1$.

We can compute $\,\beta = \sum_i \alpha_i y_i x_i$, substitute $\ \beta$ in $\,y_i(\beta^Tx_i+\beta_0) = 1$ and solve for $\ \beta_0$.

Whatever we derived so far was based on the assumption that the data is linearly separable but that might not be the case always and it might be impossible to find a hyper-plane that completely separates two class of data . This case is called hard margin hyper-plane.

### Kernel Trick

We talked about the curse of dimensionality at the beginning of this course. However, we now turn to the power of high dimensions in order to find a hyperplane between two classes of data points that can linearly separate the transformed (mapped) data in a space that has a higher dimension than the space in which the training data points reside.

To understand this, imagine a two dimensional prison where a two dimensional person is constrained. Suppose magically we give the person a third dimension, then he can escape from the prison. In other words, the prison and the person are linearly separable now with respect to the third dimension. The intuition behind the kernel trick is basically to map data to a higher dimension in which the mapped data are linearly separable by a hyperplane, even if the original data are not linearly separable.

The original optimal hyperplane algorithm proposed by Vladimir Vapnik in 1963 was a linear classifier. However, in 1992, Bernhard Boser, Isabelle Guyon and Vapnik suggested a way to create non-linear classifiers by applying the kernel trick to maximum-margin hyperplanes. The algorithm is very similar, except that every dot product is replaced by a non-linear kernel function as below. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. We have seen SVM as a linear classification problem that finds the maximum margin hyperplane in the given input space. However, for many real world problems a more complex decision boundary is required. The following simple method was devised in order to solve the same linear classification problem but in a higher dimensional space, a feature space, under which the maximum margin hyperplane is better suited.

In Machine Learning, the kernel trick is a way of mapping points into an inner product space, hoping that the new space is more suitable for classification.

Example:

$\left[\begin{matrix} \,x \\ \,y \\ \end{matrix}\right] \rightarrow\ \left[\begin{matrix} \,x^2 \\ \,y^2 \\ \, \sqrt{2}xy \\ \end{matrix}\right]$

$k(x,y)=\phi^{T}(x)\phi(y)$

$\left[\begin{matrix} \,x_1 \\ \,y_1 \\ \end{matrix}\right] \rightarrow\ \left[\begin{matrix} \,x_1^2 \\ \,y_1^2 \\ \, \sqrt{2}x_1y_1 \\ \end{matrix}\right]$

$\left[\begin{matrix} \,x_2 \\ \,y_2 \\ \end{matrix}\right] \rightarrow\ \left[\begin{matrix} \,x_2^2 \\ \,y_2^2 \\ \, \sqrt{2}x_2y_2 \\ \end{matrix}\right]$

$\left[\begin{matrix} \,x_1^2 \\ \,y_1^2 \\ \, \sqrt{2}x_1y_1 \\ \end{matrix}\right] ^{T} * \left[\begin{matrix} \,x_2^2 \\ \,y_2^2 \\ \, \sqrt{2}x_2y_2 \\ \end{matrix}\right] = K(\left[\begin{matrix} \,x_1 \\ \,y_1 \\ \end{matrix}\right],\left[\begin{matrix} \,x_2 \\ \,y_2 \\ \end{matrix}\right] )$

$\sum_i \alpha_i - \frac{1}{2} \sum_{ij} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T\mathbf{x}_j$

where $\mathbf{x}_i^T\mathbf{x}_j = \phi^{T}(x)\phi(y)= k(x_i,x_j)$

$\left[\begin{matrix} \,k(x_1, x_1)& \,k(x_1, x_2)& \cdots &\,k(x_1, x_n) \\ \vdots& \vdots& \vdots& \vdots\\ \,k(x_n, x_1)& \,k(x_n, x_2)& \cdots &\,k(x_n, x_n) \\ \end{matrix}\right]$

In most of the real world cases the data points are not linearly separable. How can the above methods be generalized to the case where the decision function is not a linear function of the data? Boser, Guyon and Vapnik, 1992, showed that a rather old trick (Aizerman, 1964) can be used to accomplish this in an astonishingly straightforward way. First notice that the only way in which the data appears in the dual-form optimization problem is in the form of dot products: $\mathbf{x}_i^T.\mathbf{x}_j$ . Now suppose we first use a non-linear operator $\Phi \mathbf(x)$ to map the data points to some other higher dimensional space (possibly infinite dimensional) $\mathcal{H}$ (called Hilbert space or feature space), where they can be classified linearly. Figure below illustrates this concept:

File:kernell trick.jpg
Mapping of not-linearly separable data points in a two-dimensional space to a three-dimensional space where they can be linearly separable by means of a kernel function.

In other words, a linear learning machine can be employed in the higher dimensional feature space to solve the original non-linear problem. Then of course the training algorithm would only depend on the data through dot products in $\mathcal{H}$, i.e. on functions of the form $\lt \Phi (\mathbf{x}_i),\Phi (\mathbf{x}_j)\gt$. Note that the actual mapping $\Phi \mathbf(x)$ does not need to be known, only the inner product of the mapping is needed for modifying the support vector machine such that it can separate non-linearly separable data. Avoiding the actual mapping to the higher dimensional space is preferable, because higher dimensional spaces may have problems due to the curse of dimensionality.

So the hypothesis in this case would be

$f(\mathbf{x}) = \boldsymbol{\beta}^T \Phi (\mathbf{x}) + \beta_0$

which is linear in terms of the new space that $\Phi (\mathbf{x})$ maps the data to, but non-linear in the original space. Now we can extend all the presented optimization problems for the linear case, for the transformed data in the feature space. If we define the kernel function as

$K (\mathbf{x}_i,\mathbf{x}_j) = \lt \Phi (\mathbf{x}_i),\Phi (\mathbf{x}_j)\gt = \Phi(\mathbf{x}_i)^T \Phi (\mathbf{x}_j)$

where $\ \Phi$ is a mapping from input space to an (inner product) feature space. Then the corresponding dual form is

$L(\boldsymbol{\alpha}) =\sum_{i=1}^n \alpha_i - \frac 12 \sum_{i=1}^n\sum_{j=1}^n \alpha_i\alpha_jy_iy_j K (\mathbf{x}_i,\mathbf{x}_j)$

subject to $\sum_{i=1}^n \alpha_i y_i=0 \quad \quad \alpha_i \geq 0,\quad i=1, \cdots, n$

The cost function $L(\boldsymbol{\alpha})$ is convex and quadratic in terms of the unknown parameters. This problem is solved through quadratic programming. The KKT conditions for this equation lead to the following final decision rule:

$L(\mathbf{x}, \boldsymbol{\alpha}^{\ast}, \beta_0) =\sum_{i=1}^{N_{sv}} y_i \alpha_i^{\ast} K (\mathbf{x}_i,\mathbf{x}) + \beta_0$

where $\ N_{sv}$ and $\ \alpha_i$ denote number of support vectors and the non-zero Lagrange multipliers corresponding to the support vectors respectively.

Several typical choices of kernels are linear, polynomial, Sigmoid or Multi-Layer Perceptron (MLP) and Gaussian or Radial Basis Function (RBF) kernel. Their expressions are as following:

Linear kernel: $K (\mathbf{x}_i,\mathbf{x}_j) = \mathbf{x}_i^T\mathbf{x}_j$

Polynomial kernel: $K (\mathbf{x}_i,\mathbf{x}_j) = (1 + \mathbf{x}_i^T\mathbf{x}_j)^p$

Sigmoid (MLP) kernel: $K (\mathbf{x}_i,\mathbf{x}_j) = \tanh (k_1\mathbf{x}_i^T\mathbf{x}_j +k_2)$

Gaussian (RBF) kernel: $\ K(\mathbf{x}_i,\mathbf{x}_j) = \exp\left[\frac{-(\mathbf{x}_i - \mathbf{x}_j)^T (\mathbf{x}_i - \mathbf{x}_j)}{2\sigma^2 }\right]$

Kernel functions satisfying Mercer's conditions not only enables implicit mapping of data from input space to feature space but also ensure the convexity of the cost function which leads to the unique optimum. Mercer condition states that a continuous symmetric function $K \mathbf(x,y)$ must be positive semi-definite to be a kernel function which can be written as inner product between the data pairs. Note that we would only need to use K in the training algorithm, and would never need to explicitly even know what $\ \Phi$ is.

Furthermore, one can construct new kernels from previously defined kernels.[8] Given two kernels $K_1 (\mathbf{x}_i,\mathbf{x}_j)$ and $K_2 (\mathbf{x}_i,\mathbf{x}_j)$, properties include:

1. $K (\mathbf{x}_i,\mathbf{x}_j) = \alpha K_1 (\mathbf{x}_i,\mathbf{x}_j) + \beta K_2 (\mathbf{x}_i,\mathbf{x}_j)$ for $\alpha , \beta \geq 0$

2. $K (\mathbf{x}_i,\mathbf{x}_j) = K_1 (\mathbf{x}_i,\mathbf{x}_j) K_2 (\mathbf{x}_i,\mathbf{x}_j)$

3. $K (\mathbf{x}_i,\mathbf{x}_j) = K_1 (f ( \mathbf{x}_i ) ,f ( \mathbf{x}_j ) )$ where $\, f \colon X \rightarrow X$

4. $K (\mathbf{x}_i,\mathbf{x}_j) = f ( K_1 ( \mathbf{x}_i , \mathbf{x}_j )$ where $\, f$ is a polynomial with positive coefficients.

In the case of Gaussian or RBF kernel for example, $\mathcal{H}$ is infinite dimensional, so it would not be very easy to work with $\Phi$ explicitly. However, if one replaces $\lt (\mathbf{x}_i). (\mathbf{x}_j)\gt$ by $K (\mathbf{x}_i,\mathbf{x}_j)$ everywhere in the training algorithm, the algorithm will happily produce a support vector machine which lives in an infinite dimensional space, and furthermore do so in roughly the same amount of time it would take to train on the un-mapped data. All the considerations of the previous sections hold, since we are still doing a linear separation, but in a different space.

The choice of which kernel would be best for a particular application has to be determined through trial and error. Normally, the Gaussian or RBF kernel are best suited for classification tasks including SVM.

The video below shows a graphical illustration of how a polynomial kernel works to a get better sense of kernel concept:

#### Kernel Properties

Kernel functions must be continuous, symmetric, and most preferably should have a positive (semi-) definite Gram matrix. The Gram matrix is the matrix whose elements are $\ g_{ij} = K(x_i,x_j)$. Kernels which are said to satisfy the Mercer's theorem are positive semi-definite, meaning their kernel matrices have no non-negative Eigen values. The use of a positive definite kernel ensures that the optimization problem will be convex and solution will be unique. <ref> Reference:http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html#kernel_properties</ref>

Furthermore, kernels can be categorized into classes based on their properties <ref name="Genton"> M. G. Genton, "Classes of Kernels for Machine Learning: A Statistics Perspective," Journal of Machine Learning Research 2, 2001</ref>:

• Nonstationary kernels are explicitly dependent on both inputs (e.g., the polynomial kernel).
• Stationary kernels are invariant to translation (e.g., the Gaussian kernel which only looks at the distance between the inputs).
• Reducible kernels are nonstationary kernels that can be reduced to stationary kernels via a bijective deformation (for more detailed information see <ref name = "Genton" />).

#### Further Information of Kernel Functions

Beyond the kernel functions we discussed in class, such as Linear Kernel, Polynomial Kernel and Gaussian Kernel functions, many more kernel functions can be used in the application of kernel methods for machine learning.

Some examples are: Exponential Kernel, Laplacian Kernel, ANOVA Kernel, Hyperbolic Tangent (Sigmoid) Kernel, Rational Quadratic Kernel, Multiquadric Kernel, Inverse Multiquadric Kernel, Circular Kernel, Spherical Kernel, Wave Kernel, Power Kernel, Log Kernel, Spline Kernel, B-Spline Kernel, Bessel Kernel, Cauchy Kernel, Chi-Square Kernel, Histogram Intersection Kernel, Generalized Histogram Intersection Kernel, Generalized T-Student Kernel, Bayesian Kernel, Wavelet Kernel, etc.

### Case 2: Linearly Non-Separable Data (Soft Margin)

The original SVM was specifically made for separable data. But, this is a very strong requirement, so it was suggested by Vladimir Vapnik and Corinna Cortes later on to remove this requirement. This is called Soft Margin Support Vector Machine. One of the advantages of SVM is that it is relatively easy to generalize it to the case that the data is not linearly separable.

In the case when 2 data sets are not linearly separable, it is impossible to have a hyperplane that completely separates 2 classes of data. In this case the idea is to minimize the number of points that cross the margin and are miss-classified .So we are going to minimize that are going to violate the constraint:

$\, y_i(\beta^T x_i + \beta_0) \geq 1$

Hence we allow some of the points to cross the margin (or equivalently violate our constraint) but on the other hand we penalize our objective function (so that the violations of the original constraint remains low):

$\, min (1/2 |\beta|^2 +\gamma \sum_i \zeta_i)$

And now our constraint is as follows:

$\, y_i(\beta^T x_i + \beta_0) \geq 1-\zeta_i$

$\, \zeta_i \geq 0$

We have to check that all KKT conditions are satisfied:

$\, \mathcal{L}(\beta,\beta_0,\zeta_i,\alpha_i,\lambda_i)=1/2|\beta|^2+\gamma \sum_i \zeta_i -\sum_i \alpha_i(y_i(\beta^T x_i +\beta_0)-1+\zeta_i) - \sum_i \lambda_i \zeta_i$

$\, 1) \frac{\partial\mathcal{L}}{\partial \beta}=\beta-\sum_i \alpha_i y_i x_i \rArr \beta=\sum_i \alpha_i y_i x_i$

$\, 2) \frac{\partial\mathcal{L}}{\partial \beta_0}=\sum_i \alpha_i y_i =0$

$\, 3) \frac{\partial\mathcal{L}}{\partial \zeta_i}=\gamma - \alpha_i - \lambda_i$

Now we have to write this into a Lagrangian form.

## Support Vector Machine Continued (Lecture: Nov. 3, 2011)

### Case 2: Linearly Non-Separable Data (Soft Margin [9]) Continued

Recall from last time that soft margins are used instead of hard margins when we are using SVM to classify data points that are not linearly separable.

##### Soft Margin SVM Derivation of Dual

The soft-margin SVM optimization problem is defined as:

$\min( \frac{1}{2}|\boldsymbol{\beta}|^2 + \gamma\sum_i \zeta_i)$ subject to the constraints $y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0) \ge 1-\zeta_i$,

where $\gamma\sum_i \zeta_i$ is the penalty function that penalizes the slack variable. we have $\zeta_i \ge 0$.

In other words, we have relaxed the constraint for each $\boldsymbol{x_i}$ so that it can violate the margin by an amount $\zeta_i$. As such, we want to make sure that all $\zeta_i$ values are as small as possible. So, we penalize them in the objective function by a factor of some chosen $\gamma$.

##### Forming the Lagrangian

In this case we have have two constraints in the Lagrangian primal form and therefore we optimize with respect to two dual variables $\, \alpha$ and $\,\lambda$,

$L(\boldsymbol{\beta},\beta_0,\zeta_i,\alpha_i,\lambda_i) = \frac{1}{2} |\boldsymbol{\beta}|^2 + \gamma \sum_i \zeta_i - \sum_i \alpha_i [y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0)-1+\zeta_i] - \sum_i \lambda_i \zeta_i$ where $- \sum_i \alpha_i [y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0)-1+\zeta_i] = -\boldsymbol{\beta}^T\sum_i\alpha_i y_i x_i-\beta_0\sum_i\alpha_iy_i+\sum_i\alpha_i-\sum_i\alpha_i\zeta_i$

##### Apply KKT conditions

\begin{align} 1) &\frac{\partial \mathcal{L}}{\partial \boldsymbol{\beta}} = \boldsymbol{\beta}-\sum_i \alpha_i y_i \boldsymbol{x_i} = 0 \\ & \implies \boldsymbol{\beta} = \sum_i \alpha_i y_i \boldsymbol{x_i} \\ &\frac{\partial \mathcal{L}}{\partial \beta_0} = \sum_i \alpha_i y_i = 0 \\ &\frac{\partial \mathcal{L}}{\partial \zeta_i} = \gamma - \alpha_i - \lambda_i = 0 \\ & \implies \boldsymbol{\gamma} = \alpha_i - \lambda_i \\ 2) &\alpha_i \ge 0, \lambda_i \ge 0, dual feasibility \\ 3) &\alpha_i [y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0)-1+\zeta_i] = 0, and \lambda_i \zeta_i = 0 \\ 4) &y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0) \ge 1-\zeta_i, \zeta_i \ge 0 \\ \end{align}

##### Objective Function

Simplifying the Lagrangian the same way we did with the hard margin case, we get the following:

\begin{align} L &= \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j} + \gamma \sum_i \zeta_i - \sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j} - \beta_0 \sum_i \alpha_i y_i + \sum_i \alpha_i - \sum_i \alpha_i \zeta_i - \sum_i \lambda_i \zeta_i \\ &= -\frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j} + \sum_i \alpha_i - 0 + (\sum_i \gamma \zeta_i - \sum_i \alpha_i \zeta_i - \sum_i \lambda_i \zeta_i) \\ &= -\frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j} + \sum_i \alpha_i + \sum_i (\gamma - \alpha_i - \lambda_i) \zeta_i \\ &= -\frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j} + \sum_i \alpha_i \end{align}

subject to the constaints:

$\alpha_i \ge 0$

$\displaystyle\sum_i \alpha_i y_i = 0$

$\lambda_i \ge 0$

Notice that the Lagrangian is the exact same as the hard margin case. The only difference with the soft margin case is the additional constraint $\lambda_i \ge 0$. However, $\gamma$ doesn't actually appear directly in the objective function. But, we can discern the following:

$\lambda_i = 0 \implies \alpha_i = \gamma$

$\lambda_i \gt 0 \implies \alpha_i \lt \gamma$

Thus, we can derive that the only difference with the soft margin case is the constraint $0 \le \alpha_i \le \gamma$. This problem can be solved with quadratic programming.

##### Soft Margin SVM Formulation Summary

In summary, the primal form of the soft-margin SVM is given by:

\begin{align} \min_{\boldsymbol{\beta}, \boldsymbol{\zeta}} \quad & \frac{1}{2}|\boldsymbol{\beta}|^2 + \gamma\sum_i \zeta_i \\ \text{s.t. } & y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0) \ge 1-\zeta_i, \qquad i=1,...,M \end{align}

The corresponding dual form which we derived above is:

\begin{align} \max_{\boldsymbol{\alpha}} \quad & \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \boldsymbol{x_i}^T \boldsymbol{x_j} \\ \text{s.t. } & \sum_i \alpha_i y_i = 0 \\ & 0 \le \alpha_i \le \gamma, \qquad i=1,...,M \end{align}

Note, the soft-margin dual objective is identical to hard margin dual objective! The only difference is now $\,\alpha_i$ variables cannot be unbounded and are restricted to be a maximum of $\,\gamma$. This restriction allows the optimization problem to become feasible when the data is non-seperable. In the hard-margin case, when $\,\alpha_i$ is unbounded there may be no finite maximum for the objective and we would not be able to converge to a solution.

Also note, $\,\gamma$ is a model parameter and must be chosen to a fixed constant. It controls the size of margin versus violations. In a data set with a lot of noise (or non-seperability) you may want to choose a smaller $\,\gamma$ to ensure a large margin. In practice, $\,\gamma$ is chosen by cross-validation---which tests the model on a held out sample to determine which $\,\gamma$ gives the best result. However, it may be troublesome to work with $\,\gamma$ since $\,\gamma \in (0, \infty)$. So often a variant formulation, known as $\,\nu$-SVM is used which uses a better scaled parameter $\,\nu \in (0,1)$ instead of $\,\gamma$ to balance margin versus separability.

Finally note that as $\,\gamma \rightarrow \infty$, the soft-margin SVM converges to hard-margin, as we do not allow any violation.

##### Soft Margin SVM Problem Interpretation

Like in the case of hard-margin the dual formulation for soft-margin given above allows us to interpret the role of certain points as support vectors.

We consider three cases:

Case 1: $\,\alpha_i=\gamma$

From KKT condition 1 (third part), $\,\gamma - \alpha_i - \lambda_i = 0$ implies $\,\lambda_i = 0$.

From KKT condition 3 (second part) $\,\lambda_i \zeta_i = 0$ this now suggests $\,\zeta_i \gt 0$.

Thus this is a point that violates the margin, and we say $\,x_i$ is inside the margin.

Case 2: $\,\alpha_i=0$

From KKT condition 1 (third part), $\,\gamma - \alpha_i - \lambda_i = 0$ implies $\,\lambda_i \gt 0$.

From KKT condition 3 (second part) $\,\lambda_i \zeta_i = 0$ this now implies $\,\zeta_i = 0$.

Finally, from KKT condition 3 (first part), $y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0) \gt 1-\zeta_i$, and since $\,\zeta_i = 0$, the point is classified correctly and we say $\,x_i$ is outside the margin. In particular, $\,x_i$ does not play a role in determining the classifier and if we ignored it, we would get the same result.

Case 3: $\,0 \lt \alpha_i \lt \gamma$

From KKT condition 1 (third part), $\,\gamma - \alpha_i - \lambda_i = 0$ implies $\,\lambda_i \gt 0$.

From KKT condition 3 (second part) $\,\lambda_i \zeta_i = 0$ this now implies $\,\zeta_i = 0$.

Finally, from KKT condition 3 (first part), $y_i(\boldsymbol{\beta}^T \boldsymbol{x_i} + \beta_0) = 1-\zeta_i$, and since $\,\zeta_i = 0$, the point is on the margin and we call it a support vector.

These three scenarios are depicted in Fig..

Case 4: if $\zeta_i \ge 0$ implies $\lambda_i=0$ this now implies $\alpha_i=\gamma$ from which we know that $y_1(\beta^*\mathbf{x}+\beta_0)\ge 1-\zeta_i$ it is closer to the boundary, so $x_i$ is inside the margin.

##### Soft Margin SVM with Kernel

Like hard-margin SVM, we can use the kernel trick to find a non-linear classifier using the dual formulation.

In particular, we define a non-linear mapping for $\boldsymbol{x_i}$ as $\Phi(\boldsymbol{x_i})$, then in dual objective we compute $\Phi^T(\boldsymbol{x_i}) \Phi(\boldsymbol{x_j})$ instead of $\boldsymbol{x_i}^T \boldsymbol{x_j}$. Using a kernel function $K(\boldsymbol{x_i}, \boldsymbol{x_j}) = \Phi^T(\boldsymbol{x_i}) \Phi(\boldsymbol{x_j})$ from the list provided in the previous lecture notes, we then do not need to explicitly map $\Phi(\boldsymbol{x_i})$.

The dual problem we solve is:

\begin{align} \max_{\boldsymbol{\alpha}} \quad & \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j K(\boldsymbol{x_i}, \boldsymbol{x_j}) \\ \text{s.t. } & \sum_i \alpha_i y_i = 0 \\ & 0 \le \alpha_i \le \gamma, \qquad i=1,...,M \end{align}

where $\, K(\boldsymbol{x_i}, \boldsymbol{x_i})$ is an appropriate kernel function specification.

To make it clear why we do not need to explicitly map $\Phi(\boldsymbol{x_i})$: If we use the kernel trick, both hard- and soft-margin SVMs find the following value for the optimum $\boldsymbol{\beta}$:

$\boldsymbol{\beta} = \sum_i \alpha_i y_i \Phi(\boldsymbol{x_i})$

From the definition of the classifier, the class labels for points are given by:

$\boldsymbol{\beta}^T \Phi(\boldsymbol{x}) + \beta_0$

Plugging the formula for $\boldsymbol{\beta}$ in the expression above we get:

$\sum_i \alpha_i y_i \Phi(\boldsymbol{x_i}) \Phi(\boldsymbol{x}) + \Beta_0$

which, from the properties of kernel functions, is equal to:

$\sum_i \alpha_i y_i K(\boldsymbol{x_i}, \boldsymbol{x_i}) + \Beta_0$

Thus, we do not need to explicitly map $\boldsymbol{x_i}$ to a higher dimension.

##### Soft Margin SVM Implementation

The SVM optimization problem is a quadratic program and we can use any quadratic solver to accomplish this. For example, matlab's optimization toolbox provides quadprog. Alternatively, CVX (by Stephen Boyd) is an excellent optimization toolbox that integrates with matlab and allows one to enter convex optimization problems as though they are written on paper (and it is free).

We prefer to solve the dual since it is an easier problem (and also allows to use a Kernel). Using CVX this would be coded as

K = X*X';           % Linear kernel
H = (y*y') .* K;
cvx_begin
variable  alpha(M,1);
maximize  (sum(alpha) - 0.5*alpha'*H*alpha)
subject to
y'*alpha == 0;
alpha >= 0;
alpha <= gamma
cvx_end


which provides us with optimal $\,\boldsymbol{\alpha}$.

Now we can obtain $\,\beta_0$ by using any point on the margin (i.e. $\,0 \lt \alpha_i \lt \gamma$), and solving

$y_i \left(\sum_j y_j \alpha_j K(\boldsymbol{x_j}, \boldsymbol{x_i}) + \beta_0 \right) = 1$

Note, $\,K(\boldsymbol{x_i}, \boldsymbol{x_j}) = \boldsymbol{x_i}^T \boldsymbol{x_j}$ can also be the linear kernel.

Finally, we can classify a new data point $\,\boldsymbol{x}$, according to

$h(\boldsymbol{x}) = \begin{cases} +1, \ \ \text{if } \sum_j y_j \alpha_j K(\boldsymbol{x_j}, \boldsymbol{x}) + \beta_0 \gt 0\\ -1, \ \ \text{if } \sum_j y_j \alpha_j K(\boldsymbol{x_j}, \boldsymbol{x}) + \beta_0 \lt 0 \end{cases}$

##### Intuitive Connection to Hard Margin Case

The form of the dual in both the Hard Margin & Soft Margin case are exceedingly; the only difference is a further restriction($\ \alpha_i \lt \gamma$) on the dual variable. You could even implement the soft margin problem to solve a case where the hard margin problem is feasible. This is not typically done but doing can give considerable insight into how the soft margin problem reacts to changes in $\ \gamma$. If we let $\ \gamma \to +\infty$ we see that the soft margin problem approaches the hard margin problem. If we examine the primal problem this matches our intuitive expectation. As $\ \gamma \to +\infty$ the penalty for being inside the margin increases to infinity and thus the optimal solution will place paramount importance of having a hard margin.

When choosing $\ \gamma$ one needs to be careful and understand the implications. Values of $\ \gamma$ that are too large will result in slavish dedication to getting as close to a hard margin as possible. This can result in poor decisions especially if there are outliers involved. Values of $\ \gamma$ that are too small do not adequate punish the problem for misclassifying points. It is important to both test different values for $\ \gamma$ and to exercise discretion when selecting possible values of $\ \gamma$ to test. It is also important to examine the impact of outliers as their impact can be extremely destructive to the usefulness of the SVM classifier.

### Multiclass Support Vector Machines

A number of approaches exist for applying SVMs to classification tasks with more than two classes. The most straightforward approach given $\ k$ classes, is to use $\ k$ separate SVMs trained using the one-versus-the-rest approach, which basically trains the $\ i^{th}$ SVM such that the $\ i^{th}$ class is the "first class" (where $\ i = 1, 2, ... k$) and the remaining $\ k-1$ classes are the "second class" for the $\ i^{th}$ SVM <ref name="CMBishop" />. Problems arise with this method as a point may be classified as different classes by each of the SVMs, because the class boundaries found by all SVMs may not be consistent such that each point belongs only to one class. Furthermore, the ratio of data points between the two classes in the one-versus-the-rest approach is not well balanced (the $\ k-1$ classes are likely to have a lot more points than the $\ i^{th}$ class) <ref name="CMBishop" />.

Variations of the one-versus-the-rest approach exist to address its problems:

• The class labels of each SVM can be changed from +1 and -1 to +1 (for the $\ i^{th}$ class) and $\ \frac{-1}{k-1}$ (for the remaining $\ k-1$ classes) in an attempt to counterbalance the unbalanced number of data points for each class <ref>Y. Lee, Y. Lin, and G. Wahba, "Multicategory Support Vector Machines," Technical Report, 2001</ref>.
• The $\ k$ SVMs can be trained at the same time using one objective function <ref>J. Weston and C. Watkins, "Multi-class Support Vector Machines," Proceedings ESANN, 1999</ref>.

Another approach for applying SVM to the multiclass case is called one-versus-one, where $\ \frac{k(k-1)}{2}$ separate SVMs are trained on all possible pair-wise combinations of classes. New points will then belong into the class with the most number of votes from all the separate SVMs. This method may still lead to ambiguities in classification and it requires a lot more training time since $\ \frac{k-1}{2}$ times as many SVMs are required for the one-versus-one than for the one-versus-the-rest approach <ref name="CMBishop" />.

### Advantages of Support Vector Machines

• SVMs provide a good out-of-sample generalization. This means that, by choosing an appropriate generalization grade,

SVMs can be robust, even when the training sample has some bias. This is mainly due to selection of optimal hyperplane.

• SVMs deliver a unique solution, since the optimality problem is convex. This is an advantage compared

to Neural Networks, which have multiple solutions associated with local minima and for this reason may not be robust over different samples.

• State-of-the-art accuracy on many problems.
• SVM can handle any data types by changing the kernel.

### Disadvantages of Support Vector Machines

• Difficulties in choice of the kernel (Which we will study about in future).
• limitation in speed and size, both in training and testing
• Discrete data presents another problem, although with suitable rescaling excellent results have nevertheless been obtained.
• The optimal design for multiclass SVM classifiers is a further area for research.
• A problem with SVMs is the high algorithmic complexity and extensive memory requirements of the required quadratic programming in large-scale tasks.

### Comparison with Neural Networks <ref>www.cs.toronto.edu/~ruiyan/csc411/Tutorial11.ppt</ref>

1. Neural Networks:
1. Hidden Layers map to lower dimensional spaces
2. Search space has multiple local minima
3. Training is expensive
4. Classification extremely efficient
5. Requires number of hidden units and layers
6. Very good accuracy in typical domains

1. SVMs
1. Kernel maps to a very-high dimensional space
2. Search space has a unique minimum
3. Training is extremely efficient
4. Classification extremely efficient
5. Kernel and cost the two parameters to select
6. Very good accuracy in typical domains
7. Extremely robust

### The Naive Bayes Classifier

The naive Bayes classifier is a very simple (and often effective) classifier based on Bayes rule. For further reading check [10]

Bayes assumption is that all the features are conditionally independent given the class label. Even though this is usually false (since features are usually dependent), the resulting model is easy to fit and works surprisingly well.

Each feature or variable $\,x_{ij}$ is independent for $\,j = 1, ..., d$, where $\, \mathbf{x}_i \in \mathbb{R}^d$.

Thus the Bayes classifier is $h(\mathbf{x}) = \arg\max_k \quad \pi_k f_k(\mathbf{x})$

where $\hat{f}_k(\mathbf{x}) = \hat{f}_k(x_1 x_2 ... x_d)= \prod_{j=1}^d \hat{f}_{kj}(x_j)$.

We can see this a direct application of Bayes rule $P(Y=k|X=\mathbf{x}) =\frac{P(X=\mathbf{x}|Y=y) P(Y=y)} {P(X=\mathbf{x})} = \frac{f_k(\mathbf{x}) \pi_k} {\sum_k f_k \pi_k}$,

with $\, f_k(\mathbf{x})=f(x^1)f(x^2)...f(x^d)$ and $\ \mathbf{x} \in \mathbb{R}^d$.

Note, earlier we assume class-conditional densitites which were multivariate normal with a dense covariance matrix. In this case we are forcing the covariance matrix to be a diagonal. This simplification, while not realistic, can provide a more robust model.

As another example, consider the 'iris' dataset in R. We would like to use known data (sepal length, sepal width, petal length, and petal width) to predict species of iris. As is typically done, we will use the maximum a posteriori (MAP) rule to decide the class to which each observation belongs. The code for using a built-in function in R to classify is:

 #If you were to use a built-in function for Naive Bayes Classification,
#this is how it would work:

library(lattice) #these are the libraries from which packages are needed
library(class)
library(e1071)

count = 0 #This will keep track of properly classified objects
attach(iris)
model <- (Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width)
m <- naiveBayes(model, data = iris)
p <- predict(m, iris) 			#You could also use a table here
for(i in 1:length(Species)) {
if (p[i] == Species[i]) {
count = count + 1
}}
misclass = (length(Species)-count)/length(Species)
misclass
#So we get that 4% of the points are misclassified.


In this particular dataset, we would not expect naïve Bayes to be the best approach for classification, since the assumption of independent predictor variables is violated (sepal length and sepal width are related, for example). However, misclassification rate is low, which indicates that naïve Bayes does a good job of classifying these data.

### K-Nearest-Neighbors(k-NN)

Classifying x by assigning it the label most frequently represented among k nearest samples and use a voting scheme.

Given a data point x, find the k nearest data points to x and classify x using the majority vote of these k neighbors (k is a positive integer, typically small.) If k=1, then the object is simply assigned to the class of its nearest neighbor.

1. Ties can be broken randomly.
2. k can be chosen by cross-validation
3. k-nearest neighbor algorithm is sensitive to the local structure of the data<ref>
1. Nearest neighbor rules in effect compute the decision boundary in an implicit manner.
##### Requirements of k-NN:
1. An integer k
2. A set of labeled examples (training data)
3. A metric to measure “closeness”
1. Able to obtain optimal solution in large sample.
2. Simple implementation
3. There are some noise reduction techniques that work only for k-NN to improve the efficiency and accuracy of the classifier.
1. If the training set is too large, it may have poor run-time performance.
2. k-NN is very sensitive to irrelevant features since all features contribute to the similarity and thus to classification.<ref>
1. small training data can lead to high misclassification rate.
2. kNN suffers from the curse of dimensionality. As the number of dimensions of the feature space increases, points become further apart from each other, making it harder to classify new points. Algorithms to solve this problem include approximate nearest neighbour. <ref>P. Indyk and R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality. STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing. pg 604-613.</ref>
##### Extensions and Applications

In order to improve the obtained results, we can do following:

1. Preprocessing: smoothing the training data (remove any outliers and isolated points)

Besides classification, k-nearest-neighbours is useful for other tasks as well. For example, the k-NN has been used in Regression or Product Recommendation system<ref> http://www.cs.ucc.ie/~dgb/courses/tai/notes/handout4.pdf</ref>.

SVM is commonly used in Bioinformatics. Common uses include classification of DNA sequences and promoter recognition and identifying disease-related microRNAs. Promoters are short sequences of DNA that act as a signal for gene expression. In one paper, Robertas Damaševičius tries using a power series kernel function and 11 classification rules for data projection to classifty these sequences, to aid active gene location.<ref>Damaševičius, Robertas. "Analysis of Binary Feature Mapping Rules for Promoter Recognition in Imbalanced DNA Sequence Datasets using Support Vector Machine". Proceedings from 4th International IEEE Conference "Intelligent Systems". 2008.</ref> MicroRNAs are non-coding RNAs that target mRNAs for cleavage in protein synthesis. There is growing evidence suggesting that mRNAs "play important roles in human disease development, progression, prognosis, diagnosis and evaluation of treatment response". Therefore, there is increasing research in the role of mRNAs underlying human diseases. SVM has been proposed as a method of classifying positive mRNA disease-associations from negative ones.<ref>Jiang, Qinghua; Wang, Guohua; Zhang, Tianjiao; Wang, Yadong. "Predicting Human microRNA-disease Associations Based on Support Vector Machine". Proceedings from IEEE International Conference on Bioinformatics and Biomedicine. 2010.</ref>

##### Selecting k

Generally speaking, a large k classifies data more precisely than a smaller k as it reduces the overall noise. But as k increases so does the complexity of computation. To determine an optimal k, cross-validation can be used.<ref>http://chem-eng.utoronto.ca/~datamining/dmc/k_nearest_neighbors_reg.htm</ref> Traditionally, k is fixed for each test example. Another approach, namely Adaptive k-nearest neighbor algorithm, was proposed to improve the selection of k. In the algorithm, k is not a fixed number but is dependent on the nearest neighbour of the data point. In training phase, the algorithm calculates the optimal k for each training data point, which is the minimum number of neighbors required to get the correct class label. In the testing phase, it finds out the nearest neighbor of the testing data point and its corresponding optimal k. Then it performs the k-NN algorithm using such k to classify the data point. <ref>Shiliang Sun, Rongqing Huang, "An adaptive k-nearest neighbor algorithm", 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2010.</ref>

1- SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition here

2- SVM application listhere

3- The kernel trick for distances here

4- Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry here

5- General overview of SVM and Kernel Methods. Easy to understand presentation. here

## Supervised Principal Component Analysis (Lecture: Nov. 8, 2011)

Recall that PCA finds the direction of maximum variation and may be used as a pre-processing operation for classification. FDA is a form of supervised dimensionality reduction or feature extraction that finds the direction in which to project the data in order for the data points to be easily separated into their respective classes. PCA differs from FDA because it does not use the labels of the data but FDA uses the labels of the data and thus, is better at finding the directions separating the data points for classification.

Supervised PCA (SPCA) is a generalization of PCA, i.e. PCA is just a special case of SPCA. SPCA can use label information for classification tasks and it has some advantages over FDA. For example, FDA will only project into $\ k-1$ dimensional space regardless of the dimensionality of the data where $\ k$ is the number of classes. This is not always desirable for dimensionality reduction.

Supervised PCA estimatest the sequence of principal components having the maximum dependency on the response variable. It can be solved in closed-form, has a dual formulations that reduces the computational complexity when the dimension of the data is significantly greater than the number of data points, and it can be kernelized. <ref>Elnaz Barshan, Ali Ghodsi, Zohreh Azimifar, and Mansoor Zolghadri. Supervised Principal Component Analysis: Visualization, Classification and Regression on Subspaces and Submanifolds , Journal of Pattern Recognition, to appear 2011</ref>

### SPCA Problem Statement

Suppose we are given a set of data $\ \{x_i, y_i\}_{i=1}^n , x_i \in R^{p}, y_i \in R^{l}$ this is a more general than just having one label for each data point, because $\ y$ here could accept any value in $\ R^{l}$. So the assumption of having only discrete values for labels is relaxed here, which means this model can be used for regression as well. Target values ($\ y$) don't have to be in a one dimensional space. Just as for PCA, we are looking for a lower dimensional subspace $\ S = U^T X$, where $\ U$ is an orthogonal projection. However, instead of finding the direction of maximum variation, we are looking for the subspace that contains as much predictive information about $\ Y$ as the original covariate $\ X$, i.e. $\ P(Y|X)=P(Y|U^TX)$. We know that the predictive information must exist between the original covariate $\ X$ and $\ Y$, which are drawn independently from the distribution $\ \{x_i, y_i\}_{i=1}^n$, because if they are completely independent there is no way of doing classification or regression.

### Warning

If we project our data into a high enough dimension, we can fit any data - even noise. In his book "The God gene: how faith is hardwired into our genes", Dean H. Hamer discusses how factor analysis (model which "uses regression modelling techniques to test hypotheses producing error terms" <ref>use regression modelling techniques to test hypotheses producing error terms</ref>) was used to find a correlations between the gene (VMAT2) and a person's belief in God. The full book is available at: <ref>http://books.google.ca/books?id=TmR6uAAHEssC&pg=PA33&lpg=PA33&dq=god+gene+statistics&source=bl&ots=8q-jSwKZ8O&sig=O8OBe2YaPbE0vMp9A6PxEC9DwL0&hl=en&ei=lWO8Tp_nN4H40gGA2uXjBA&sa=X&oi=book_result&ct=result&resnum=2&ved=0CCEQ6AEwAQ#v=onepage&q&f=false </ref>.

It is actually quite a common problem to find far fetched correlations. One such study found correlations between people shopping habits and their genetics. Family members were shown to have far more similar consumer habits than those who did not share DNA. This was then used to explain "fondness for specific products such as chocolate, science-fiction movies, jazz, hybrid cars and mustard." <ref>http://www.businessnewsdaily.com/genetics-incluence-shopping-habits-0593/</ref>.

### Different Techniques for Dimensionality Reduction

• Classical Fisher's Discriminant Analysis (FDA)

The goal of FDA is to reduce the dimensionality of data in order to have separable data points in a new space.

• Metric Learning (ML)

This is a large family of methods.

• Sufficient Dimensionality Reduction (SDR)

This is also a family of methods. In recent years SDR has been used to denote a body of new ideas and methods for dimension reduction. Like Fisher's classical notion of a sufficient statistic, SDR strives for reduction without loss of information. But unlike sufficient statistics, sufficient reductions may contain unknown parameters and thus need to be estimated.

• Supervised Principal Components (BSPC)

A method proposed by Bair et al. This is a different method from the SPCA method discussed in class despite having a similar name.

### Metric Learning

First define a new metric as:

$\ d_A(x_i, x_j)=||x_i -x_j|| = \sqrt{(x_i - x_j)^TA(x_i - x_j)}$

This metric will only satisfy the requisite properties of a metric if $\ A$ is a positive definite matrix. This restriction is often relaxed to positive semi-definate. Relaxing this condition may be required if we wish to disregard uninformative covariated.

Note 1: $\ A$ being positive semi-definite ensures that this metric respects non-negativity and the triangle inequality, but allows $\ d_A(x_i,x_j)=0$ to not imply $\ x_i=x_j$ <ref name="Xing">Xing, EP. Distance metric learning with application to clustering with side-information. [11]</ref>.

Common choices for A

1)$\ A=I$ This represents Euclidean distance.

2)$\ A=D$ where $\ D$ is a diagonal matrix. The diagonal values can be thought of reweighting the importance of each covariate and these weights are learned can be learned from training data.

3)$\ A=D$ where $\ D$ is a diagonal matrix with $\ D_{ii} = Var(i^{th} covariate)^{-1}$ This represents scaling down each covariate so that they all have equal variance and thus have equal impact on the distance. This metric is consistant with and works very well for covariates that are independant and normally distributed.

4)$\ A=\Sigma^{-1}$ where $\ \Sigma$ is the covariance matrix for your set of covariates. This metric is consistant with and works very well for covariates that are normally distributed. The corresponding metric is called Mahalanobis distance.

When dealing with data that are on different measurement scales using choices 3 or 4 are vastly preferable to Euclidean distance as it prevents covariates with large measurement scales from dominating the metric.

For metric learning, construct the Mahalonobis distance over the input space and use it instead of the Euclidean distance. This is really equivalent to transforming the data points using a linear transformation and then computing the Euclidean distance in the new transformed space. To see that this is true, suppose we project each data points on a subspace $\ S$ using $\ x^' = U^Tx$ and calculate the Euclidean distance:

$\ ||x_i^' - x_j^'||_2^2= (U^Tx_i -U^Tx_j)^T(Ux_i -Ux_j) = (x_i -x_j)^TU^TU(x_i -x_j)$

This is the same as Mahalanobis distance in the new space for $\ A=UU^T$.

1 One way to find $\ A$ is to consider the set of similar pairs $\ (x_i,x_j) \in S$ and the set of dissimilar pairs $\ (x_i,x_j) \in D$. Then we can solve the convex optimization problem below <ref name="Xing" />.

$min_A \sum_{(x_i,x_j)\in S} (x_i - x_j)^TA(x_i - x_j)$

s.t. $\sum_{(x_i,x_j)\in D} (x_i - x_j)^TA(x_i - x_j)\ge 1$ and $\ A$ positive semi-definite.

Overall, the metric learning technique will attempt to minimize the squared induced distance between similar points while maximizing the squared induced distance between dissimilar points and search for a metric which allows points from the same class to be near one another and points from different classes to be far from one another.

### Sufficient Dimensionality Reduction (SDR)

The goal of dimensional reduction is to find a function $\ S(\mathbf{x})$ that maps $\ \mathbf{x}$ from $\ \mathbb{R}^n$ to a proper subspace, which means that the dimension of $\ \mathbf{x}$ is being reduced. An example of $\ S(\mathbf{x})$ would be a function that uses several linear combinations of $\ \mathbf{x}$.

For a dimensional reduction to have sufficiency the following statement must hold:

$\ P_{Y|X}(y|x) = P_{Y|S(X)}(y|S(x))$

Which is equivalent to saying that the distribution of $\ y|S(\mathbf{x})$ is the same as $\ y |\mathbf{x}$ [12]

This method aims to find a linear subspace $\ R$ such that the projection onto this subspace preserves $\ P_{Y|X}(y|x)$.

Suppose that $\ S(\mathbf{x}) = U^T\mathbf{x}$ is a sufficient dimensional reduction, then

$\ P_{Y|X}(y|x) = P_{Y|U^TX}(y|U^T x)$

for all $\ x \in X$, and $\ y \in Y$, where $\ U^T X$ is the orthogonal projection of $\ X$ onto $\ R$.

#### Graphical Motivation

In a regression setting, it is often useful to summarize the distribution of $y|\textbf{x}$ graphically. For instance, one may consider a scatter plot of $y$ versus one or more of the predictors. A scatter plot that contains all available regression information is called a sufficient summary plot.

When $\textbf{x}$ is high-dimensional, particularly when the number of features of $\ X$ exceed 3, it becomes increasingly challenging to construct and visually interpret sufficiency summary plots without reducing the data. Even three-dimensional scatter plots must be viewed via a computer program, and the third dimension can only be visualized by rotating the coordinate axes. However, if there exists a sufficient dimension reduction $R(\textbf{x})$ with small enough dimension, a sufficient summary plot of $y$ versus $R(\textbf{x})$ may be constructed and visually interpreted with relative ease.

Hence sufficient dimension reduction allows for graphical intuition about the distribution of $y|\textbf{x}$, which might not have otherwise been available for high-dimensional data.

Most graphical methodology focuses primarily on dimension reduction involving linear combinations of $\textbf{x}$. The rest of this article deals only with such reductions.[13]

#### Other Methods for Reduction

Two very common examples of SDR are Sliced Inverse Regression (SIR) and Sliced Average Variance Estimation (SAVE). More information on SIR can be found here[14].

### Supervised Principal Components (BSPC)

BSPC algorithm:

1. Compute (univariate) standard regression coefficients for each feature j using the following formula:

$\ s_j=\frac{{X_j}^TY}{\sqrt{X_j^T X_j}}$

2. Reduce the data matrix $Xo$ corresponding to all the columns where $\ |S_j|\gt \theta$. Find $\ \theta$ by cross-validation.

3. Compute the first principal component of the reduced data matrix $Xo$

4. Use the principal component calculated in step (3) in a regression model or a classification algorithm to produce the outcome

Bair's SPCA is consistent. In Normal PCA as the number of data points increases PCA takes different directions for the components. However the direction of the first component of SPCA remains consistent as the number of points increase <ref>Bair E., Prediction by supervised principal components. [15]</ref>.

### Hilbert-Schmidt Independence Criterion (HSIC)

"Hilbert-Schmidt Norm of the Cross-Covariance operator" is proposed as an independence criterion in reproducing kernel Hilbert spaces (RKHSs).

The measure is refered to as Hilbert-Schmidt Indepence Criterion (HSIC).

Let $\ z:={(x_1,y_1),...,(x_n,y_n)} \in\ \mathcal{X}$x$\mathcal{Y}$ be a series of n independent observation drawn from $P_{(X,Y)}(x,y)$ . An estimator of HSIC is given by

$HSIC:=\frac{1}{(n-1)^2}Tr(KHBH)$

where H,K,B $\in\mathbb{R}^{n x n}$

$K_{ij} :=k(x_i,x_j),B_{ij}:=b(y_i,y_j), H:=I-\frac{1}{n}ee^{T}$ and where k and b are positive semidefinite kernel functions.

XH is centralized version of X ( subtracting the mean of each row):

$XH=X(I- \frac{1}{n}ee^T)=X -\frac{1}{n}Xee^T$ where each entry in row i of $\frac{1}{n}Xee^T$ is mean of $i^{th}$ row of X

$HBH$ is double centeralized version of B (subtracting mean of row and column)

We introduced a way of measuring independence between two distributions. The key idea is that good features should maximize such dependence. Feature selection for various supervised learning problems is unified under HSIC, and the solutions can be approximated using a backward-elimination algorithm. To explain this, we started by explaining how to tell if two distributions are same. Specifically, if two distributions have different mean values, then we can say right away that these are two different distributions. However, if they share a same mean value, then we need to look at second moments of these distributions, from which we can derive variance. Hence we need to look at higher dimension to tell if two distributions are equal.

It can be mathematically shown(although not done in class) that if we define a mapping,$\ \phi$ of random variable X, which maps X to higher dimension, then there exists a unique mapping between the $\ \mu_x$, which is the average of x in the higher dimension, and the distribution of X. This suggests that $\ \mu_x$ can reproduce the distribution of P.

Hence to figure out if two random variables X and Y have the same distribution, we can take the difference between E$\ \phi$(x) and E$\ \phi$(y), and take the norm of this to see if two distributions are equal. i.e. $|| E \phi (x) - E\phi(y) ||^2$ If this value is equal to 0, then we know that they have the same distribution.

Now to test the independence of $\ P_x$ and $\ P_y$ then we can use the previous formula on $\ P_{xy}$ and $\ (P_x)(P_y)$ - if it equals 0, then two distributions $\ P_x$ and $\ P_y$ are independent. The larger the difference is, then the distributions of X and Y are more different.

Utilizing this, we can find the $\ U^TX$ from $\ P(Y|X)=P(Y|U^TX)$ such that it maximizes the HSIC between $\ Y$, which implies the maximum dependence between $\ U^TX$ and $\ Y$.

you come up with index called HSIC:

$\ KHBH$

X, Y random variables.

K- kernel matrix over X.

B- kernel matrix over Y.

### Kernel Function

Kernel function will be used when calculating $|| E\phi(x) - E\phi(y) ||^2$ The possible kernel functions we can choose are:

• Linear kernel: $\,k(x,y)=x \cdot y$
• Polynomial kernel: $\,k(x,y)=(x \cdot y)^d$
• Gaussian kernel: $e^{-\frac{|x-y|^2}{2\sigma^2}}$
• Delta Kernel: $\,k(x_i,x_j) = \begin{cases} 1 & \text{if }x_i=x_j \\ 0 & \text{if }x_i\ne x_j \end{cases}$

H is a constant matrix of the form: $\ H = I - frac{1}{n}ee^T$

where, $\ e = \left( \begin{array}{c}1 \\ \\ \vdots \\ \\ 1 \\ \\ 1 \end{array} \right)$.

H centralizes any matrix that you multiply it to. So HBH makes B double centred

We wanted the transformation $\ U^TX$ such that it had the maximum dependance to Y. So we use the index HSIC to find the dependance between U^TX and Y and maximize it.

H centralize the mean of X by XH X-mu: the larger the value is, they dependence more of each other.

So basically we want to maximize $\ Tr(KHBH)$

$\ max Tr(KHBH)$

$\ max Tr(X^TUU^TXHBH)$

$\ max Tr(U^TXHBHX^TU)$

we add a constraint to solve this problem

$\ U^TU=I$

Then this is identical to PCA if $\ B=I$

### SPCA: Supervised Principle Compenent Analysis

We need to find $\ U$ to maximize $\ Tr(HKHB)$ where K is a Kernel of $\ U^T X$ (eg: $\ X^T UU^T X$) and $\ B$ is a Kernel of $\ Y$(eg: $\ Y^T Y$):

$\ X$ $\ Y$
$\ U^T X$ $\ Y$
$\ (U^T X)^T (U^T X) = X^T UU^T X$ $\ B$
   $\max \; Tr(HKHB)$
$\ \; \; = \; \max Tr(HX^T UU^T XHB)$
$\ \; \; = \; \max Tr(U^T XHBHX^T U)$
$\ subject \; to \; U^T U = I$


### Supervised Principle Components Analysis and Conventional PCA

Dimensionality Reduction of the 0-1-2 Data, Using PCA
Dimensionality Reduction of the 0-1-2 Data, Using Supervised PCA

This is idential to PCA if B = I

    $(XHBHX^T) = cov(x) = (x-\mu)(x-\mu)^T$


### SPCA

Algorithm 1
- Recover basis: Calculate $Q=XHBHX^T$ and let u=eigenvector of Q corresponding to the top d eigenvalues.
- Encode training data: $Y=U^TXH$ where Y is dxn matrix of the original data
- Reconstruct training data: $\hat{X}=UY=UU^TX$
- Encode test example: $y=U^T(x-\mu)$ where y is a d dimensional encoding of x.
- Reconstruct test example: $\hat{X}=U_y=UU^T(x-\mu)$

### Dual Supervised Principle Component Analysis

Let $Q = XHBHX^T$ and B are both PSD

      $Q = \psi\psi^T$
$B = \Delta\Delta^T$
$\psi = XH\Delta^T$


The solution for U can be expressed as singular value decomposition (SVD) of $\psi$:

      $\psi = U \Sigma V^T$
$\rightarrow \psi V = U \Sigma$
$\rightarrow \psi V \Sigma^-1 = U$
$\rightarrow \Sigma^{-1} V^T \psi^T XH$
$\rightarrow \Sigma^{-1} V^T V \Sigma^T U^T XH$


It gives a relationship between V and U. Your can replace these in the algorithm above and define everything based on V instead of U. By doing this you do not need to find eigenvectors of Q which have a high dimensionality.

Algorithm 2
Recover basis: calculate $\psi^T \psi$ and let V=eigenvector of $\psi^T \psi$ corresponding to the top d eigenvalues. Let $\Sigma$=diagonal matrix of square roots of the top d eigenvalues.

Reconstruct training data: $\hat{X}=UZ=XH\Delta^T V \Sigma^{-2}V^T\Delta H(X^T X)H$

Encode test examples: $y=U^T(x-\mu)=\Sigma^{-1}V^T \Delta H[X^T(x-\mu)]$ where y is a d dimensional encoding of x.

### Towards a Unified Network

B Constraint Component
PCA I $\omega^T \omega = I$
FDA$^{(1)}$ $B_0$ $\omega^T S_\omega \omega = I$ $S_\omega = X B_s X^T$
CFML I$^{(2)}$ $B_0 - B_s$ $\omega^T \omega = I$
CFML II$^{(2)}$ $B_0$ $\omega^T S_\omega \omega = I$ $S_\omega = X B_s X^T$

(1)$B_s=F(F^{T}F)^{-1}F^T$, (2) $B_s=\tfrac{1}{n}FF^{T}$ ,$B_D=H-B_s$, $n$ # of data points, $F$ indicator matrix of cluster, $H$ the centering matrix

### Dual Supervised PCA

B Constraint Component
KPCA I $UU^T = I$ Arbitrary
K-means I $UU^T = I, U\ge 0$ Linear

## Boosting (Lecture: Nov. 10, 2011)

Boosting is a meta-algorithm for starting with a simple classifier and improving the classifer by refitting the data giving higher weight to misclassified samples.

Suppose that $\mathcal{H}$ is a collection of classifiers. Assume that $\ y_i \in \{-1, 1\}$ and that each $\ h(x)\in \{-1, 1\}$. Start with $\ h_1(x)$. Based on how well $\ h_1 (x)$ classifies points, adjust the weights of each input and reclassify. Misclassified points are given higher weight to ensure the classifier "pays more attention" to them, to fit better in the next iteration. The idea behind boosting is to obtain a classification rule from each classifer $h_i(x)\in\mathcal{H}$, regardless of how well it classifies the data on its own (with the proviso that its performance be better than chance), and combine all of these rules to obtain a final classifier that performs well.

An intuitive way to look at boosting and the concept of weight is to think about extreme weightings. Suppose you are doing classification on a set with some points being misclassified. Suppose that any points that have been classified correctly are to be removed from the data. So the weak classifier may do a good job on these new data. This is how early versions of boosting worked, instead of re-weighting.

Adaptive Boosting (AdaBoost) was formulated by Yoav Freund and Robert Schapire. AdaBoost is defined as an algorithm for constructing a “strong” classifier as linear combination $f(x) = \sum_{t=1}^T \alpha_t h_t(x)$ of simple “weak” classifiers $\ h_t(x)$ <ref>http://www.cs.ubbcluj.ro/~csatol/mach_learn/bemutato/BenkKelemen_Boosting.pdf </ref>.

It has the following properties:

• It is a linear classifier with all its desirable properties
• It has good generalization properties
• It is a feature selector with a principled strategy (minimisation of upper bound on empirical error)
• It is close to sequential decision making

#### Algorithm Version 1

1 Set the weights $\ w_i=\frac{1}{n}, i = 1,...,n.$

2 For $\ j =1,...,J$, do the following steps:

a) Find the classifier $\ h_j: x \rightarrow \{-1,1\}$ that minimizes the weighted error $\ L_j$:
$\ h_j= arg \underset{h_j\in \mathcal{H}}{\mbox{min}} L_j$
where $\ L_j = \frac{\sum_{i=1}^{n}w_iI[y_i\ne h_j(x_i)]}{\sum_{i=1}^{n} w_i}$
$\ H$ is a set of classifiers which need to be improved and $\ I$ is
$\, I= \left\{\begin{matrix} 1 & for \quad y_i\neq h_j(x_i) \\ 0 & for \quad y_i = h_j(x_i) \end{matrix}\right.$
b) Let $\alpha_j= log(\frac{1-L_j}{L_j})$
Note that $\ \alpha$ indicates the "goodness" of the classifier, where a larger $\ \alpha$ value indicates a better classifier. Also, $\ \alpha$ is always 0 or positive as long as the classification accuracy is 0.5 or higher. For example, if working with coin flips, then $\ L_j=0.5$ and $\ \alpha=0$.
c) Update the weights:
$\ w_i \leftarrow w_i e^{\alpha_j I[y_i\ne h_j(x_i)]}$
Note that the weights are only increased for points that have been misclassified by a good classifier.

3 The final classifier is: $\ h(x) = sign (\sum_{j=1}^{J}\alpha_j h_j(x))$.

Note that this is basically an aggregation of all the classifiers found and the classification outcomes of better classifiers are weighted more using $\ \alpha$.

#### Algorithm Version 2 <ref>http://www.cs.ubbcluj.ro/~csatol/mach_learn/bemutato/BenkKelemen_Boosting.pdf</ref>

One of the main ideas of this algorithm is to maintain a distribution or set of weights over the training set. Initially, all weights are set equally, but on each round, the weights of incorrectly classified examples are increased so that the weak learner is forced to focus on the hard examples in the training set.

• Given $\left(x_1,y_1\right),\dots,\left(x_m,y_m\right)$ where ${x_i \in X}$, ${y_i \in \{-1,+1\}}$.
• Initialize weights $D_1(i) = \frac{1}{m}$
• Iterate $t=1,\dots, T$
• Train weak learner using distribution $\ D_t$
• Get weak classifier: $h_t:X\rightarrow R$
• Choose ${\alpha_t \in R}$
• Update the weights: $D_t+1(i) = \frac {D_i e^{-\alpha_t y_i h_t(x_i)}} {Z_t}$
where $\ Z_t$ is a normalization factor (chosen so that $\ D_t+1$ will be a distribution)
• The final classifier is:
$H(x)=\mbox{sign}\left(\sum_{t=1}^T \alpha_t h_t(x)\right)$

#### Example

In R, we can do boosting on a simulated classifer. Suppose we are working with the built-in R dataset "iris". These data consist of petal length, sepal length, petal width, and sepal width of three different species of iris. This is an adaptive boosting algorithm as applied to these data.

> crop1 <- iris[1:100,1] #the function "ada" will only handle two classes
> crop2 <- iris[1:100,2] #and the iris dataset has 3. So crop the third off.
> crop3 <- iris[1:100,3]
> crop4 <- iris[1:100,4]
> crop5 <- iris[1:100,5] #This is the response variable, indicating species of iris
> x <- cbind(crop1, crop2, crop3, crop4, crop5) #combine all the columns
> fr1 <- as.data.frame(x, row.names=NULL) #and coerce into a data frame
>
> a = 2 #number of iterations
> AdaBoostDiscrete <- ada(crop5~., data=fr1, iter=a, loss="e", type = "discrete", control = rpart.control())
Call:
ada(crop5 ~ ., data = fr1, iter = a, loss = "e", type = "discrete",
control = rpart.control())

Loss: exponential Method: discrete   Iteration: 2

Final Confusion Matrix for Data:
Final Prediction
True value  1  2
1 50  0
2  0 50

Train Error: 0

Out-Of-Bag Error:  0  iteration= 1

Additional Estimates of number of iterations:

train.err1 train.kap1
1          1

> #Since this yields "perfect" results, we may not need boosting here after all.
> #This was just an illustration of the ada function in R.


• Very simple to implement
• Fairly good generalization
• The prior error need not be known ahead of time

• Suboptimal solution
• Can over fit in presence of noise

### Other boosters

There are many other boosters such as LPBoost, TotalBoost, BrownBoost, MadaBoost, LogitBoost, stochastic boost etc. Some of these boosters, such as AdaBoost, MadaBoost and LogitBoost, can be interpreted as performing a gradient descent to minimize the convex cost function. However, a research study showed that this class of boosters are prone to random classification noise. <ref>Pillip M. Long, Rocco A. Servedio, "Random Classiﬁcation Noise Defeats All Convex Potential Boosters", 2000</ref>

### Relation to SVM

SVM and Boosting are very similar except for the way to measure the margin or the way they optimize their weight vector. SVMs use the $l_2$ norm for both the instance vector and the weight vector, while Boosting uses the $l_1$ norm for the weight vector. ie. SVMs need to use the $l_2$ norm to implicitly compute scalar products in feature space with the help of the kernel trick. No other norm can be expressed in terms of scalar products.

Although SVM and AdaBoost share some similarities. However, there are several important differences:

• Different norms can result in very different margins: In boosting or in SVM, the dimension is usually very high, this makes the difference between $l_1$ norm and $l_2$ norm can be significant enough in the margin values.

e.g suppose the weak hypotheses all have range {-1,1} and that the label y on all examples can be computed by a majority vote of k of the weak hypotheses. In this case, it can be shown that if the number of relevant weak hypotheses is a small fraction of the total number of weak hypotheses then the margin associated with AdaBoost will be much larger than the one associated with support vector machines.

• The computation requirements are different: The difference between the two methods in this regard is that SVM cor-responds to quadratic programming, while AdaBoost corresponds only to linear programming.

## Bagging

Bagging (Bootstrap aggregating) was proposed by Leo Breiman in 1994. Bagging is another meta-algorithm for improving classification results by combining the classification of ramdomly generated training sets. [17][18]

The idea behind bagging is very similar to that behind boosting. However, instead of using multiple classifiers on essentially the same dataset (but with adaptive weights), we sample from the original dataset containing m items B times with replacement, obtaining B samples each with m items. This is called bootstrapping. Then, we train the classifier on each of the bootstrapped samples. Taking a majority vote of a combination of all the classifiers, we arrive at a final classifier for the original dataset. [19]

Bagging is the effective intensice prodecure that can improve on unstable classifiers. It is most useful for highly nonlinear classifiers, such as trees.

As we know the idea of boosting is to incorporate unequal weights in learning h given higher weight to misclassified points. Bagging is a method for reducing the variability of a classifier. The idea is to train classifiers $\ h_{1}(x)$ to $\ h_{B}(x)$ using B bootstrap samples from the data set. The final classification is obtained using an average or 'plurality vote' of the B classifiers as follows:

$\, h(x)= \left\{\begin{matrix} 1 & \frac{1}{B} \sum_{i=1}^{B} h_{b}(x) \geq \frac{1}{2} \\ 0 & \mathrm{otherwise} \end{matrix}\right.$

### Relation to Boosting

We can compare boosting and bagging:

First of all, we found that boosting can help us do the prodedure on stable models, but bagging may not work for stable models.

Secondly, bagging is easier to parallelize and more helpful in practice.

Many classifiers, such as trees, already have underlying functions that estimate the class probabilities at x. An alternative strategy is to average these class probabilities instead of the final classifiers. This approach can produce bagged estimates with lower variance and usually better performance.

## Decision Trees

Decision tree learning is a method commonly used in statistics, data mining and machine learning. The goal is to create a model that predicts the value of a target variable based on several input variables. It is a very flexible classifier, can classify non-linear data and it can be used for classification, regression, or both. A tree is usually used as a visual and analytical decision support tool, where the expected values of competing alternatives are calculated.

It uses principle of divide and conquer for classification. The trees have traditionally been created manually. Trees map features of a decision problem onto a conclusion, or label. We fit a tree model by minimizing some measure of impurity. For a single covariate $\ X_1$ we choose a point t on the real line that splits the real line into two sets $\ R_1 = (-\infty, t] , R_2 = [ t, \infty)$ in a way that minimizes impurity.

File:p.jpg
Node impurity for two-class classification, as a function of the proportion p in class 2. Cross-entropy has been scaled to pass through (0.5,0.5).

Let $\hat{p_s}(j)$ be the proportion of observations in $\boldsymbol R_s$ such that $\ Y_i = j$

$\hat{p_s}(j) = \frac {\sum_{i=1}^n I(Y_i = j, X_i \in \boldsymbol R_s)}{\sum_{i=1}^n I(x_i \in \boldsymbol R_s)}$

Node impurity measures (see figure to the right):

Misclassification error: $\ 1 - \hat{p_s}(j)$
Gini index:$\sum_{j \neq 1} \hat{p_s}(j)\hat{p_s}(i)$

Limitions in Decision Trees

1. Overfitting problem: Decision Trees are extremely flexible models; this flexibility means that they can easily perfectly match any training set. This makes overfitting a prime consideration when training a decision tree. There is no robust way to avoid fitting noise in the data but two common approaches include:

• do not fit all trees, stop when the training set reaches perfection
• fully grow the tree and then prune the resulting tree. Pruning algorithms include cost complexity pruning, minimum description length pruning and pessimistic pruning. This results in a tree with less branches, which can generalize better. <ref>J. R. Quinlan, Decision Trees and Decision Making, IEEE Transactions on Systems, Man and Cybernetics, vol 20, no 2, March/April 1990, pg 339-346.</ref>

2. time-consuming and complex: compare to other decision-making models, decision trees is a relatively easier tool to use, however, if the tree contains a large amount branches, it will become complex in nature and take time to solve the problem. Moreover, decision trees only examine a single field at a time, which leads to rectangular classification boxes. And the complexity adds costs to train people to have the extensive knowledge to complete the decision tree analysis. <ref> http://www.brighthub.com/office/project-management/articles/106005.aspx </ref>

Some specific decision-tree algorithms:

A comparison of bagging and boosting methods using the decision trees classifiers: [22]

### CART (Classfication and Regression Tree)

The Classification and Regression Tree (CART) is a non-parametric Decision tree learning technique that produces either classification or regression trees, depending on whetehr the dependent variable is categorical or numeric, repectively. (Wikipedia) The CART is good in working with outliers during the process. CART will isolate the outliers in a separate node.

1. Easy to understand

2. Map nicely to a set of business rules

3. Applied to real problems

4. Make no prior assumptions about the data

5. Able to process both numerical and categorical data

1. Output attribute must be categorical

2. Limited to one output attribute

3. Decision tree algorithms are unstable

4. Trees created from numeric datasets can be complex

### Ranking Features

In implementation of a tree model it is important how the features are ranked (i.e. in what order the features appear in the tree). The general way to do this is to choose the features with the highest dependence on Y to be the first feature in the tree and then going down the tree with lower dependence.

Feature ranking strategies

1. Fisher score (F-score)

• simple in nature
• efficient in measuring the the discrimination between a feature and the label.
• independent of the classifier.

2 Linear SVM Weight

The following is an algorithm based on linear SVM weights:

• input the training sets: $(x_i, y_i), i = 1, \dots l$
• obtain the sorted feature ranking list as output:
• Using grid search to find the best parameter C.
• Training a $L2-$loss linear SVM model using the best available C.
• Then features can be sorted according to the absolute values of weights.

3. Change of AUC with/without Removing Each Feature

4. Change of Accuracy with/without Removing Each Feature

5. Normalized Information Gain (difference in entropy)

### Random Forest

Decision trees are unstable. An application of bagging is to combine trees into random forest. A random forest is a classiﬁer consisting of a collection of tree-structured classiﬁers $\left \lbrace \ h(x, \Theta_k ), k = 1, . . . \right \rbrace$ where the ${\Theta_k }$ are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input $x$ <ref>Breiman N., Random Forests Machine learning [23]</ref>.

In a random forest, the trees are grown quite similarly to the standard classification tree. However, no pruning is done in the random forest technique.

Compared with other methods, random forests have some positive characteristics:

• runs faster than bagging or boosting
• has similar accuracy as Adaboost, and sometimes even better than Adaboost
• relatively robust to noise
• delivers useful estimates of error, correlation

For larger data sets, more accuracy can be obtained by combining random features with boosting.

This is how a single tree is grown:

First, suppose the number of elements in the training set is K. We then sample K elements with replacement. Second, if there are a total of N inputs to the tree, choose an integer n << N such that for each node of the tree, n variables are randomly selected from N. the best split on these n variables is used to allow the node to make a decision (hence a "decision tree"). Third, grow the tree as large as possible.

Each tree contributes one classification. That is, each tree gets one "vote" to classify an element. The beauty of random forest is that all of these votes are added up, similar to boosting, and the final decision is the result of the vote. This is an extremely robust algorithm.

There are two things that can contribute to error in random forest:

1. correlation between trees 2. the ability of an individual tree to classify well.

This is seen intuitively, since if many trees are very similar to one another, then it is likely they will all classify the elements in the same way. If a single tree is not a very good classifier, it does not matter in the long run because the other trees will compensate for its error. However, if many trees are bad classifiers, the result will be garbage.

To avoid both of the above problems, there is an algorithm to optimize n, the number of variables to use in each decision tree. Unfortunately, an optimal value is not found on its own; instead, an optimal range is found. Thus, to properly program a random forest, there is a parameter that must be "tuned". Looking at various types of error rate, this is easily found (we want to minimize error, as characterized by the Gini index, or the misclassification rate, or the entropy). [24]

An algorithm for the Random Forest can be described as below: we first let $N_trees$ to be the number of trees need to build for each of $N_trees$ iterations, then we select a new bootstrap sample from training set and grow an un-pruned tree on this bootstrap, next, at each internal node, randomly select m predictors and determine the best split using only these predictors. Finally do not perform cost complexity pruning and save tree as is, along side those built thus far. <ref> Albert A. Montillo,Guest lecture: Statistical Foundations of Data Analysis "Random Forests", April,2009. <http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf> </ref>

Boosting: <ref>Chunhua Shen; Zhihui Hao. “A direct formulation for totally-corrective multi-class boosting”. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2011.</ref>

Bagging: <ref>Xiaoyuan Su; Khoshgoftarr, T.M.; Xingquan Zhu. “VoB predictors: Voting on bagging classifications”. 19th IEEE International Conference on Pattern Recognition. 2008.</ref>

Decision Tree: <ref> Zhuowen Tu. “Probabilistic boosting-tree: learning discriminative models for classification, recognition, and clustering”. Tenth IEEE International Conference on Computer Vision. 2005.</ref>

## Graphical Models

Graphical Models are directed graphs where nodes are variables and edges are dependencies.

Y Y
$\downarrow$ $\uparrow$
Generative LDA Linear Discrimanation

Probabilistic Discriminative Models: Model posterior probability P(Y|X) directly (example: LDA).

• Obtain desired posterior probability directly
• Less parameters

• Can generate new points
• Can sample a new point

# Boltzmann Machines

## Introduction

Reference: [2]

Boltzmann machines are networks of connected nodes which, using a stochastic decision-making process, decide to be on or off. These connections need not be directive; that is, they can go back and forth between layers. This type of formulation leads the reader to think immediately of a binomial distribution, with some probability p of each node being on or off. In a classification problem, a Boltzmann Machine is presented with a set of binary vectors, each entry of the vector called a “unit”, with the goal of learning to generate these vectors. [1]

Similar to the neural networks already discussed in class, a Boltzmann Machine must assign weights to inputs, compute some combination of the weights times contributing node values, and optimize the weights such that a certain cost function (such as the relative entropy, as discussed later) is minimized. The cost function depends on the complexity of the model and the “correctness” of the classification. The main idea is to make small updates in the connection weights iteratively.

Boltzmann Machines are often used in generative models. That is, we start with some process seen in real life and try to reproduce it, with a goal of predicting future behaviour of the system by generating from the probability distribution created by the Boltzmann Machine.

## How a Boltzmann Machine Works

Suppose we start with a pattern ɣ that represents some real life dynamical system. The true probability distribution function of this system is f_ɣ. For each element in the vectors associated with this system, we create a visible unit in the Boltzmann Machine whose function is directly related to the value of that element. Then, usually, to capture higher order regularities in the pattern, we create hidden units (similar to Feed-Forward Neural Networks). Sometimes researchers choose not to use hidden units, but this leads to a lack of ability to learn high order regularity [5]. There are two possible values for each node in the Boltzmann Machine: “on” or “off”. There is a difference in energy between these states. Each node must then compute the difference in energy to see which state would be more favourable. This difference is called the “energy gap”.

Each node of the Boltzmann Machine is presented an opportunity to update its status. When a set of input vectors is shown to the layer, a computation takes place within each node to decide to convert to “on” or to remain “off”. The computation is as follows:

$\Delta E_i = E_{-1} - E_{+1} = \sum_j w_{ij}S_j$

Where $w_{ij}$ represents the weight between nodes i and j, and $S_j$ is the state of the jth component.

Then the probability that the node will adopt the “on” state is:

$P(+1) = \frac{1}{(1 + exp( \frac{\delta E}{T}))}$

Where T is the temperature of the system. The probability of any vector v being an output of the system is just the energy of the vector divided by the total energy of the system, or

$P(v) = \frac{e^{-E(v)}}{e^{E(system)}}$

And the energy of a vector is defined as:

$E({v}) = -\sum_i s^{v}_i b_i -\sum_{i\lt j} s^{v}_i s^{v}_j w_{ij}$ [1]

Simulated annealing, a method to improve the search for a global minimum, is being used here. It may not succeed in finding the global minimum on its own [3]. This may be a foreign concept to statisticians. For more information, consult [6] and [7]. The state gets changed to whichever calculation in the logistic function step yields a decrease in energy.

Eventually, through learning, the Boltzmann Machine will reach an equilibrium state, much like a Markov Chain. This equilibrium state will have a low temperature. Once equilibrium has been reached, we can estimate the probability distribution across the nodes of the Boltzmann Machine. Using this information, we can model how the dynamical system will behave in the long run.

Since the system is in equilibrium, we can use the mean value of each visible unit to build a probability model. We wouldn’t want to do these calculations before reaching equilibrium, because they would not be representative of the long-term behaviour of the system. Let this measured distribution be denoted f_δ. Then we are interested in measuring the difference between the true distribution and this measured distribution.

There are several different methods that can be used to compare distributions. One that is commonly used is the relative entropy:

$G(f_\gamma ||f_\delta) = \sum_\gamma f_\gamma ln(\frac{f_\gamma}{f_\delta})$ [5]

We want to minimize this distance, since we want the measured distribution to be as close as possible to the true distribution

## Learning in Boltzmann Machines

The Two-Phase Method

Boltzmann Machines using hidden units are very robust tools. Visible units are coupled, leading to a problem when trying to capture the effects of higher-dimensional regularities. When hidden units are introduced, the system has the ability to define and use these regularities.

One approach to learning Boltzmann Machines is discussed thoroughly in [5]. To summarize, this approach makes use of two phases.

Phase 1: Fix all visible units. Allow the hidden units to change as necessary to obtain equilibrium. Then, look at pairs of units. If two elements of a pair are both “on”, then increment the weight associated with them. So this phase consists entirely of “learning”. There is no control for spurious data.

Phase 2: No units are fixed. Allow all units to change as necessary to obtain equilibrium. Then sample the final equilibrium distribution to find reliable averages of the term s¬_i s_j. Then as before, look for pairs of units that are both “on”, and decrement the weight associated with them. So this is the phase in which spurious data are eliminated.

Alternate between these two phases. Eventually, the equilibrium distribution will be reached and we see that $“ \frac{\partial {G}}{\partial {w_{ij}}} = \frac{-1}{T} (\lt s_i s_j\gt ^{+} - \lt s_i s_j\gt ^{-})$ where $s_i s_j$ are the probabilities of finding units i and j both “on”, when the network is ‘fixed’ and ‘free-running’, respectively” [5]. Another method, for learning Deep Boltzmann Machines, is presented in [2].

## Pros and Cons of using Boltzmann Machines

Pros

• More accurate than backpropagation [5]
• Bayesian interpretation of how good a model is [5]

Cons

• Very slow, because of nested loops necessary to perform phases [5]

There are many topics on which this discussion could be expanded. For example, we could get into a more in-depth discussion of simulated annealing, or look at Restricted Boltzmann Machines (RBMs) for deep learning, or different methods of learning and different measures of error. Another interesting topic would be a discussion on mean field approximation of Boltzmann Machines, which supposedly runs faster.

References:
[1] http://www.scholarpedia.org/article/Boltzmann_machine
[2] http://www.mit.edu/~rsalakhu/papers/dbm.pdf
[3] http://mathworld.wolfram.com/SimulatedAnnealing.html
[4] http://waldron.stanford.edu/~jlm/papers/PDP/Volume%201/Chap7_PDP86.pdf
[5] http://cs.nyu.edu/~roweis/notes/boltz.pdf
[6] http://neuron.eng.wayne.edu/tarek/MITbook/chap8/8_3.html
[7] Bertsimas and Tsitsiklis. Simulated Annealing. Statistical Science. 1993. Vol. 8, No. 1, 10 – 15.

<references />