# Difference between revisions of "stat841f10"

## Proposal Fall 2010

{{

 Template:namespace detect


| type = style | image = | imageright = | style = | textstyle = | text = This article may require cleanup to meet Wikicoursenote's quality standards. The specific problem is: Provide a summary for each topic here.. Please improve this article if you can. (October 8 2010) | small = | smallimage = | smallimageright = | smalltext = }}

## Reference Textbook

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman (3rd Edition is available)

## Classification - September 21, 2010

### Classification

Statistical classification, or simply known as classification, is an area of supervised learning that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A classifier is a specific technique or method for performing classification. To classify new data, a classifier first uses labeled (classes are known) training data to train a model, and then it uses a function known as its classification rule to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.

Classification has been an important task for people and society since the beginnings of history. According to this link, the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which ones were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle (384 BC - 322 BC) when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being regression, clustering, and dimensionality reduction (feature extraction or manifold learning). Please be noted that some people consider classification to be a broad area that consists of both supervised and unsupervised methods of classifying data. In this view, as can be seen in this link, clustering is simply a special case of classification and it may be called unsupervised classification.

In classical statistics, classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When machine learning was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers, a link to a source of which can be found here.

       "We are drowning in information and starving for knowledge."
- Rutherford D. Rogers


In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.

The formal mathematical definition of classification is as follows:

Definition: Classification is the prediction of a discrete random variable $\mathcal{Y}$ from another random variable $\mathcal{X}$, where $\mathcal{Y}$ represents the label assigned to a new data input and $\mathcal{X}$ represents the known feature values of the input.

A set of training data used by a classifier to train its model consists of $\,n$ independently and identically distributed (i.i.d) ordered pairs $\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}$, where the values of the $\,ith$ training input's feature values $\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}$ is a d-dimensional vector and the label of the $\, ith$ training input is $\,Y_{i} \in \mathcal{Y}$ that can take a finite number of values. The classification rule used by a classifier has the form $\,h: \mathcal{X} \mapsto \mathcal{Y}$. After the model is trained, each new data input whose feature values is $\,x$ is given the label $\,\hat{Y}=h(x)$.

As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.

After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule $\ h$ to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.

As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data $\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}$, we could then use the classifier's classification rule $\,h$ to assign any newly-given fruit having known feature values $\,x = (\,x_{color}, x_{diameter} , x_{weight})$ the label $\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}$.

### Examples of Classification

• Email spam filtering (spam vs not spam).

• Detecting credit card fraud (fraudulent or legitimate).

• Face detection in images (face or background).

• Web page classification (sports vs politics vs entertainment etc).

• Steering an autonomous car across the US (turn left, right, or go straight).

• Medical diagnosis (classification of disease based on observed symptoms).

### Independent and Identically Distributed (iid) Data Assumption

Suppose that we have training data X which contains N data points. The Independent and Identically Distributed (IID) assumption declares that the datapoints are drawn independently from identical distributions. This assumption implies that ordering of the data points does not matter, and the assumption is used in many classification problems. For an example of data that is not IID, consider daily temperature: today's temperature is not independent of the yesterday's temperature -- rather, there is a strong correlation between the temperatures of the two days.

### Error rate

The empirical error rate (or training error rate) of a classifier having classification rule $\,h$ is defined as the frequency at which $\,h$ does not correctly classify the data inputs in the training set, i.e., it is defined as $\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})$, where $\,I$ is an indicator variable and $\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.$. Here, $\,X_{i} \in \mathcal{X}$ and $\,Y_{i} \in \mathcal{Y}$ are the known feature values and the true class of the $\,ith$ training input, respectively.

The true error rate $\,L(h)$ of a classifier having classification rule $\,h$ is defined as the probability that $\,h$ does not correctly classify any new data input, i.e., it is defined as $\,L(h)=P(h(X) \neq Y)$. Here, $\,X \in \mathcal{X}$ and $\,Y \in \mathcal{Y}$ are the known feature values and the true class of that input, respectively.

In practice, the empirical error rate is obtained to estimate the true error rate, whose value is impossible to be known because the parameter values of the underlying process cannot be known but can only be estimated using available data. The empirical error rate, in practice, estimates the true error rate quite well in that, as mentioned here, it is an unbiased estimator of the true error rate.

An Error Rate Comparison of Classification Methods [1]

### Decision Theory

we can identify three distinct approaches to solving decision problems, all of which have been used in practical applications. These are given, in decreasing order of complexity, by:

a. First solve the inference problem of determining the class-conditional densities $\ p(x|C_k)$ for each class $\ C_k$ individually. Also separately infer the prior class probabilities $\ p(C_k)$. Then use Bayes’ theorem in the form

\begin{align}p(C_k|x)=\frac{p(x|C_k)p(C_k)}{p(x)} \end{align}

to find the posterior class probabilities $\ p(C_k|x)$. As usual, the denominator in Bayes’ theorem can be found in terms of the quantities appearing in the numerator, because

\begin{align}p(x)=\sum_{k} p(x|C_k)p(C_k) \end{align}

Equivalently, we can model the joint distribution $\ p(x, C_k)$ directly and then normalize to obtain the posterior probabilities. Having found the posterior probabilities, we use decision theory to determine class membership for each new input $\ x$. Approaches that explicitly or implicitly model the distribution of inputs as well as outputs are known as "generative models", because by sampling from them it is possible to generate synthetic data points in the input space.

b. First solve the inference problem of determining the posterior class probabilities $\ p(C_k|x)$, and then subsequently use decision theory to assign each new $\ x$ to one of the classes. Approaches that model the posterior probabilities directly are called "discriminative models".

c. Find a function $\ f(x)$, called a discriminant function, which maps each input $\ x$ directly onto a class label. For instance, in the case of two-class problems, $\ f(.)$ might be binary valued and such that $\ f = 0$ represents class $\ C_1$ and $\ f = 1$ represents class $\ C_2$. In this case, probabilities play no role.

This topic has brought to you from Pattern Recognition and Machine Learning by Christopher M. Bishop (Chapter 1)

### Bayes Classifier

A Bayes classifier is a simple probabilistic classifier based on applying Bayes' Theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".

In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.

Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without believing in Bayesian probability or using any Bayesian methods.

In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable efficacy of Bayes classifiers [1]. Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as boosted trees or random forests[2].

An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix.

After training its model using training data, the Bayes classifier classifies any new data input in two steps. First, it uses the input's known feature values and the Bayes formula to calculate the input's posterior probability of belonging to each class. Then, it uses its classification rule to place the input into the most-probable class, which is the one associated with the input's largest posterior probability.

In mathematical terms, for a new data input having feature values $\,(X = x)\in \mathcal{X}$, the Bayes classifier labels the input as $(Y = y) \in \mathcal{Y}$, such that the input's posterior probability $\,P(Y = y|X = x)$ is maximum over all of the members of $\mathcal{Y}$.

Suppose there are $\,k$ classes and we are given a new data input having feature values $\,x$. The following derivation shows how the Bayes classifier finds the input's posterior probability $\,P(Y = y|X = x)$ of belonging to each class $y \in \mathcal{Y}$.

\begin{align} P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\ &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)} \end{align}

Here, $\,P(Y=y|X=x)$ is known as the posterior probability as mentioned above, $\,P(Y=y)$ is known as the prior probability, $\,P(X=x|Y=y)$ is known as the likelihood, and $\,P(X=x)$ is known as the evidence.

In the special case where there are two classes, i.e., $\, \mathcal{Y}=\{0, 1\}$, the Bayes classifier makes use of the function $\,r(x)=P\{Y=1|X=x\}$ which is the posterior probability of a new data input having feature values $\,x$ belonging to the class $\,Y = 1$. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates $\,r(x)$ as follows:

\begin{align} r(x)&=P(Y=1|X=x) \\ &=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\ &=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)} \end{align}

The Bayes classifier's classification rule $\,h^*: \mathcal{X} \mapsto \mathcal{Y}$, then, is

$\, h^*(x)= \left\{\begin{matrix} 1 &\text{if } \hat r(x)\gt \frac{1}{2} \\ 0 &\mathrm{otherwise} \end{matrix}\right.$.

Here, $\,x$ is the feature values of a new data input and $\hat r(x)$ is the estimated value of the function $\,r(x)$ given by the Bayes classifier's model after feeding $\,x$ into the model. Still in this special case of two classes, the Bayes classifier's decision boundary is defined as the set $\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}$. The decision boundary $\,D(h)$ essentially combines together the trained model and the decision function $\,h^*$, and it is used by the Bayes classifier to assign any new data input to a label of either $\,Y = 0$ or $\,Y = 1$ depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as

$\, h^*(x)= \left\{\begin{matrix} 1 &\text{if } P(Y=1|X=x)\gt P(Y=0|X=x) \\ 0 &\mathrm{otherwise} \end{matrix}\right.$.

Bayes Classification Rule Optimality Theorem The Bayes classifier is the optimal classifier in that it results in the least possible true probability of misclassification for any given new data input, i.e., for any generic classifier having classification rule $\,h$, it is always true that $\,L(h^*(x)) \le L(h(x))$. Here, $\,L$ represents the true error rate, $\,h^*$ is the Bayes classifier's classification rule, and $\,x$ is any given data input's feature values.

Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief. As a result, the estimated values of the components in the trained model may deviate quite a bit from their true population values, and this can ultimately cause the calculated posterior probabilities of inputs to deviate quite a bit from their true values. Estimation of all these probability functions, as likelihood, prior probability, and evidence function is a very expensive task, computationally, which also makes some other classifiers more favorable than Bayes classifier.

A detailed proof of this theorem is available here.

Defining the classification rule:

In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule $\,h^*$:

1) Empirical Risk Minimization: Choose a set of classifiers $\mathcal{H}$ and find $\,h^*\in \mathcal{H}$ that minimizes some estimate of the true error rate $\,L(h^*)$.
2) Regression: Find an estimate $\hat r$ of the function $x$ and define
$\, h^*(x)= \left\{\begin{matrix} 1 &\text{if } \hat r(x)\gt \frac{1}{2} \\ 0 &\mathrm{otherwise} \end{matrix}\right.$.
3) Density Estimation: Estimate $\,P(X=x|Y=0)$ from the $\,X_{i}$'s for which $\,Y_{i} = 0$, estimate $\,P(X=x|Y=1)$ from the $\,X_{i}$'s for which $\,Y_{i} = 1$, and estimate $\,P(Y = 1)$ as $\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}$. Then, calculate $\,\hat r(x) = \hat P(Y=1|X=x)$ and define
$\, h^*(x)= \left\{\begin{matrix} 1 &\text{if } \hat r(x)\gt \frac{1}{2} \\ 0 &\mathrm{otherwise} \end{matrix}\right.$.

Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two.

Multi-class classification:

Suppose there are $\,k$ classes, where $\,k \ge 2$.

In the above discussion, we introduced the Bayes formula for this general case:

\begin{align} P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)} \end{align}

which can re-worded as:

\begin{align} P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i} \end{align}

Here, $\,f_y(x) = P(X=x|Y=y)$ is known as the likelihood function and $\,\pi_y = P(Y=y)$ is known as the prior probability.

In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values $\,x$ into one of the $\,k$ classes.

Theorem

Suppose that $\mathcal{Y}= \{1, \dots, k\}$, where $\,k \ge 2$. Then, the optimal classification rule is $\,h^*(x) = arg max_{i} P(Y=i|X=x)$, where $\,i \in \{1, \dots, k\}$.

Example: We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by $\, \mathcal{Y}\in \{ 0,1 \}$, where 1 refers to pass and 0 refers to fail. Suppose that the prior probabilities are estimated or guessed to be $\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5$. We have data on past student performances, which we shall use to train the model. For each student, we know the following:

Whether or not the student’s GPA was greater than 3.0 (G).
Whether or not the student had a strong math background (M).
Whether or not the student was a hard worker (H).
Whether or not the student passed or failed the course. Note: these are the known y values in the training data.

These known data are summarized in the following tables:

For each student, his/her feature values is $\, x = \{G, M, H\}$ and his or her class is $\, y \in \{0, 1\}$.

Suppose there is a new student having feature values $\, x = \{0, 1, 0\}$, and we would like to predict whether he/she would pass the course. $\,\hat r(x)$ is found as follows:

$\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.125}=\frac{1}{5}\lt \frac{1}{2}.$

The Bayes classifier assigns the new student into the class $\, h^*(x)=0$. Therefore, we predict that the new student would fail the course.

Naive Bayes Classifier:

The naive Bayes classifier is a special (simpler) case of the Bayes classifier. It uses an extra assumption: that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. This assumption allows for an easier likelihood function $\,f_y(x)$ in the equation:

\begin{align} P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i} \end{align}

The simper form of the likelihood function seen in the naive Bayes is:

\begin{align} f_y(x) = P(X=x|Y=y) = {\prod_{i=1}^{n} P(X_{i}=x_{i}|Y=y)} \end{align}

The Bayes classifier taught in class was not the naive Bayes classifier.

### Bayesian vs. Frequentist

The Bayesian view of probability and the frequentist view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event.

The Bayesian view of probability states that, for any event E, event E has a prior probability that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a belief that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability of event E's occurrence, can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no intrinsic probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, $\,50%$ for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to Thomas Bayes (1702–1761).

In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an intrinsic probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined independent random trials. In each trial for an event, the event either occurs or it does not occur. Suppose $n_x$ denotes the number of times that an event occurs during its trials and $n_t$ denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the long run, where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :$P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}$. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as $P(x) \approx \frac{n_x}{n_t}$. If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow. This is because one cannot possibly carry out trials for any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher Aristotle. In his work Rhetoric, Aristotle gave the famous line "the probable is that which for the most part happens".

More information regarding the Bayesian and the frequentist schools of thought are available here. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available here.

There is useful information about Machine Learning, Neural and Statistical Classification in this link [2] Machine Learning, Neural and Statistical Classification; there is some description of Classification in chapter 2 Classical Statistical Methods in chapter 3 and Modern Statistical Techniques in chapter 4.

### Extension: Statistical Classification Framework

In statistical classification, each object is represented by 'd' (a set of features) a measurement vector, and the goal of classifier becomes finding compact and disjoint regions for classes in a d-dimensional feature space. Such decision regions are defined by decision rules that are known or can be trained. The simplest configuration of a classification consists of a decision rule and multiple membership functions; each membership function represents a class. The following figures illustrate this general framework.

Simple Conceptual Classifier.

The classification process can be described as follows:

A measurement vector is input to each membership function. Membership functions feed the membership scores to the decision rule. A decision rule compares the membership scores and returns a class label.

## Linear and Quadratic Discriminant Analysis

### Introduction

Linear discriminant analysis (LDA) and the related Fisher's linear discriminant are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterize or separate two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

LDA is also closely related to principal component analysis (PCA) and factor analysis in that both look for linear combinations of variables which best explain the data. LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities. Discriminant analysis is also different from factor analysis in that it is not an interdependence technique: a distinction between independent variables and dependent variables (also called criterion variables) must be made.

LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is discriminant correspondence analysis.

### Content

First, we shall limit ourselves to the case where there are two classes, i.e. $\, \mathcal{Y}=\{0, 1\}$. In the above discussion, we introduced the Bayes classifier's decision boundary $\,D(h^*)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}$, which represents a hyperplane that determines the class of any new data input depending on which side of the hyperplane the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. Linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and we shall look at each of them in turn.

Let us denote the likelihood $\ P(X=x|Y=y)$ as $\ f_y(x)$ and the prior probability $\ P(Y=y)$ as $\ \pi_y$.

First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that both of the two classes have multivariate normal (Gaussian) distributions and the two classes have the same covariance matrix $\, \Sigma$. Under the assumptions of LDA, we have: $\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma^{-1} (x - \mu_k) \right)$. Now, to derive the Bayes classifier's decision boundary using LDA, we equate $\, P(Y=1|X=x)$ to $\, P(Y=0|X=x)$ and proceed from there. The derivation of $\,D(h^*)$ is as follows:

$\,Pr(Y=1|X=x)=Pr(Y=0|X=x)$
$\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}$ (using Bayes' Theorem)
$\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)$ (canceling the denominators)
$\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0$
$\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0$
$\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0$
$\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)$ (taking the log of both sides).
$\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0$ (expanding out)
$\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1} \mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0$ (canceling out alike terms and factoring).

It is easy to see that, under LDA, the Bayes's classifier's decision boundary $\,D(h^*)$ has the form $\,ax+b=0$ and it is linear in $\,x$. This is where the word linear in linear discriminant analysis comes from.

LDA under the two-classes case can easily be generalized to the general case where there are $\,k \ge 2$ classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes $\,m$ and $\,n$, then all we need to do is follow a derivation very similar to the one shown above, except with the classes $\,1$ and $\,0$ being replaced by the classes $\,m$ and $\,n$. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary $\,D(h^*)$ between classes $\,m$ and $\,n$ to be $\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\left( \mu_m^\top\Sigma^{-1} \mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n) \right)=0$ . In addition, for any two classes $\,m$ and $\,n$ for whom we would like to find the Bayes classifier's decision boundary using LDA, if $\,m$ and $\,n$ both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between the centers (means) of $\,m$ and $\,n$.

The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in this link:

Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice, where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and revisited in 2000, is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.

According to this link, some of the limitations of LDA include:

• LDA implicitly assumes that the data in each class has a Gaussian distribution.
• LDA implicitly assumes that the mean rather than the variance is the discriminating factor.
• LDA may over-fit the training data.

The following link provides a comparison of discriminant analysis and artificial neural networks [3]

#### Different Approaches to LDA

Data sets can be transformed and test vectors can be classified in the transformed space by two different approaches.

• Class-dependent transformation: This type of approach involves maximizing the ratio of between

class variance to within class variance. The main objective is to maximize this ratio so that adequate class separability is obtained. The class-specific type approach involves using two optimizing criteria for transforming the data sets independently.

• Class-independent transformation: This approach involves maximizing the ratio of overall variance

to within class variance. This approach uses only one optimizing criterion to transform the data sets and hence all data points irrespective of their class identity are transformed using this transform. In this type of LDA, each class is considered as a separate class against all other classes.

The following are some applications that use LDA and QDA:

1- Linear discriminant analysis for improved large vocabulary continuous speech recognition here

2- 2D-LDA: A statistical linear discriminant analysis for image matrix here

3- Regularization studies of linear discriminant analysis in small sample size scenarios with application to face recognition here

4- The sparse discriminant vectors are useful for supervised dimension reduction for high dimensional data. Naive application of classical Fisher’s LDA to high dimensional, low sample size settings suffers from the data piling problem. In [4] they have use sparse LDA method which selects important variables for discriminant analysis and thereby yield improved classification. Introducing sparsity in the discriminant vectors is very effective in eliminating data piling and the associated overfitting problem.

## Linear and Quadratic Discriminant Analysis cont'd - September 23, 2010

### LDA x QDA

Linear discriminant analysis[5] is a statistical method used to find the linear combination of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix $\, \Sigma$.

Quadratic Discriminant Analysis[6], on the other hand, aims to find the quadratic combination of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix $\, \Sigma$. Instead, QDA makes the assumption that each class $\, k$ has its own covariance matrix $\, \Sigma_k$.

The derivation of the Bayes classifier's decision boundary $\,D(h^*)$ under QDA is similar to that under LDA. Again, let us first consider the two-classes case where $\, \mathcal{Y}=\{0, 1\}$. This derivation is given as follows:

$\,Pr(Y=1|X=x)=Pr(Y=0|X=x)$
$\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}$ (using Bayes' Theorem)
$\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)$ (canceling the denominators)
$\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0$
$\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0$
$\,\Rightarrow \frac{1}{|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0$ (by cancellation)
$\,\Rightarrow -\frac{1}{2}\log(|\Sigma_1|)-\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1)+\log(\pi_1)=-\frac{1}{2}\log(|\Sigma_0|)-\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0)+\log(\pi_0)$ (by taking the log of both sides)
$\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top\Sigma_1^{-1}x + \mu_1^\top\Sigma_1^{-1}\mu_1 - 2x^\top\Sigma_1^{-1}\mu_1 - x^\top\Sigma_0^{-1}x - \mu_0^\top\Sigma_0^{-1}\mu_0 + 2x^\top\Sigma_0^{-1}\mu_0 \right)=0$ (by expanding out)
$\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top(\Sigma_1^{-1}-\Sigma_0^{-1})x + \mu_1^\top\Sigma_1^{-1}\mu_1 - \mu_0^\top\Sigma_0^{-1}\mu_0 - 2x^\top(\Sigma_1^{-1}\mu_1-\Sigma_0^{-1}\mu_0) \right)=0$

It is easy to see that, under QDA, the decision boundary $\,D(h^*)$ has the form $\,ax^2+bx+c=0$ and it is quadratic in $\,x$. This is where the word quadratic in quadratic discriminant analysis comes from.

As is the case with LDA, QDA under the two-classes case can easily be generalized to the general case where there are $\,k \ge 2$ classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes $\,m$ and $\,n$, then all we need to do is follow a derivation very similar to the one shown above, except with the classes $\,1$ and $\,0$ being replaced by the classes $\,m$ and $\,n$. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary $\,D(h^*)$ between classes $\,m$ and $\,n$ to be $\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\log(\frac{|\Sigma_m|}{|\Sigma_n|})-\frac{1}{2}\left( x^\top(\Sigma_m^{-1}-\Sigma_n^{-1})x + \mu_m^\top\Sigma_m^{-1}\mu_m - \mu_n^\top\Sigma_n^{-1}\mu_n - 2x^\top(\Sigma_m^{-1}\mu_m-\Sigma_n^{-1}\mu_n) \right)=0$.

### Summarizing LDA and QDA

We can summarize what we have learned so far into the following theorem.

Theorem:

Suppose that $\,Y \in \{1,\dots,K\}$, if $\,f_k(x) = Pr(X=x|Y=k)$ is Gaussian, the Bayes Classifier rule is

$\,h^*(x) = \arg\max_{k} \delta_k(x)$

where,

• In the case of LDA, which assumes that a common covariance matrix is shared by all classes, $\,\delta_k(x) = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k)$, and the Bayes classifier's decision boundary $\,D(h^*)$ is linear in $\,x$.
• In the case of QDA, which assumes that each class has its own covariance matrix, $\,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k)$, and the Bayes classifier's decision boundary $\,D(h^*)$ is quadratic in $\,x$.

Note $\,\arg\max_{k} \delta_k(x)$returns the set of k for which $\,\delta_k(x)$ attains its largest value.

### In practice

We need to estimate the prior, so in order to do this, we use the Maximum Likelihood estimates from the sample for $\,\pi,\mu_k,\Sigma_k$ in place of their true values, i.e.

File:estimation.png
Estimation of the probability of belonging to either class k or l

$\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}$

$\,\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i$

$\,\hat{\Sigma_k} = \frac{1}{n_k}\sum_{i:y_i=k}(x_i-\hat{\mu_k})(x_i-\hat{\mu_k})^\top$

Common covariance, denoted $\Sigma$, is defined as the weighted average of the covariance for each class.

In the case where we need a common covariance matrix, we get the estimate using the following equation:

$\,\Sigma=\frac{\sum_{r=1}^{k}(n_r\Sigma_r)}{\sum_{l=1}^{k}(n_l)}$

Where: $\,n_r$ is the number of data points in class r, $\,\Sigma_r$ is the covariance of class r and $\,n$ is the total number of data points, $\,k$ is the number of classes.

See the details about the estimation of covarience matrices.

### Computation For QDA And LDA

First, let us consider QDA, and examine each of the following two cases.

Case 1: (Example) $\, \Sigma_k = I$

$\, \Sigma_k = I$ for every class $\,k$ implies that our data is spherical. This means that the data of each class $\,k$ is distributed symmetrically around the center $\,\mu_k$, i.e. the isocontours are all circles.

We have:

$\,\delta_k = - \frac{1}{2}log(|I|) - \frac{1}{2}(x-\mu_k)^\top I(x-\mu_k) + log (\pi_k)$

We see that the first term in the above equation, $\,\frac{-1}{2}log(|I|)$, is zero since $\ |I|=1$. The second term contains $\, (x-\mu_k)^\top I(x-\mu_k) = (x-\mu_k)^\top(x-\mu_k)$, which is the squared Euclidean distance between $\,x$ and $\,\mu_k$. Therefore we can find the distance between a point and each center and adjust it with the log of the prior, $\,log(\pi_k)$. The class that has the minimum distance will maximize $\,\delta_k$. According to the theorem, we can then classify the point to a specific class $\,k$.

Case 2: (General Case) $\, \Sigma_k \ne I$

We can decompose this as:

$\, \Sigma_k = U_kS_kV_k^\top = U_kS_kU_k^\top$ (In general when $\,X=U_kS_kV_k^\top$, $\,U_k$ is the eigenvectors of $\,X_kX_k^T$ and $\,V_k$ is the eigenvectors of $\,X_k^\top X_k$. So if $\, X_k$ is symmetric, we will have $\, U_k=V_k$. Here $\, \Sigma_k$ is symmetric, because it is the covariance matrix of $X_k$) and the inverse of $\,\Sigma_k$ is

$\, \Sigma_k^{-1} = (U_kS_kU_k^\top)^{-1} = (U_k^\top)^{-1}S_k^{-1}U_k^{-1} = U_kS_k^{-1}U_k^\top$ (since $\,U_k$ is orthonormal)

So from the formula for $\,\delta_k$, the second term is

\begin{align} (x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)&= (x-\mu_k)^\top U_kS_k^{-1}U_k^T(x-\mu_k)\\ & = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-1}(U_k^\top x-U_k^\top \mu_k)\\ & = (U_k^\top x-U_k^\top\mu_k)^\top S_k^{-\frac{1}{2}}S_k^{-\frac{1}{2}}(U_k^\top x-U_k^\top\mu_k) \\ & = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top I(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\ & = (S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top\mu_k)^\top(S_k^{-\frac{1}{2}}U_k^\top x-S_k^{-\frac{1}{2}}U_k^\top \mu_k) \\ \end{align}

where we have the squared Euclidean distance between $\, S_k^{-\frac{1}{2}}U_k^\top x$ and $\, S_k^{-\frac{1}{2}}U_k^\top\mu_k$.

A transformation of all the data points can be done from $\,x$ to $\,x^*$ where $\, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x$.

A similar transformation of all the centers can be done from $\,\mu_k$ to $\,\mu_k^*$ where $\, \mu_k^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top \mu_k$.

It is now possible to do classification with $\,x^*$ and $\,\mu_k^*$, treating them as in Case 1 above. This strategy is correct because by transforming $\, x$ to $\,x^*$ where $\, x^* \leftarrow S_k^{-\frac{1}{2}}U_k^\top x$, the new variable variance is $I$

Note that when we have multiple classes, we also need to compute $\, log{|\Sigma_k|}$ respectively. Then we compute $\,\delta_k$ for QDA .

Note that when we have multiple classes, they must all have the same transformation, in another word, have same covariance $\,\Sigma_k$,else, ahead of time we would have to assume a data point belongs to one class or the other. All classes therefore need to have the same shape for classification to be applicable using this method. So this method works for LDA.

If the classes have different shapes, in another word, have different covariance $\,\Sigma_k$, can we use the same method to transform all data points $\,x$ to $\,x^*$?

The answer is Yes. Consider that you have two classes with different shapes. Given a data point, justify which class this point belongs to. You just do the transformations corresponding to the 2 classes respectively, then you get $\,\delta_1 ,\delta_2$ ,then you determine which class the data point belongs to by comparing $\,\delta_1$ and $\,\delta_2$ .

In summary, to apply QDA on a data set $\,X$, in the general case where $\, \Sigma_k \ne I$ for each class $\,k$, one can proceed as follows:

Step 1: For each class $\,k$, apply singular value decomposition on $\,X_k$ to obtain $\,S_k$ and $\,U_k$.
Step 2: For each class $\,k$, transform each $\,x$ belonging to that class to $\,x_k^* = S_k^{-\frac{1}{2}}U_k^\top x$, and transform its center $\,\mu_k$ to $\,\mu_k^* = S_k^{-\frac{1}{2}}U_k^\top \mu_k$.
Step 3: For each data point $\,x \in X$, find the squared Euclidean distance between the transformed data point $\,x_k^*$ and the transformed center $\,\mu_k^*$ of each class $\,k$, and assign $\,x$ to class $\,k$ such that the squared Euclidean distance between $\,x_k^*$ and $\,\mu_k^*$ is the least for all possible $\,k$'s.

Now, let us consider LDA. Here, one can derive a classification scheme that is quite similar to that shown above. The main difference is the assumption of a common variance across the classes, so we perform the Singular Value Decomposition once, as opposed to k times.

To apply LDA on a data set $\,X$, one can proceed as follows:

Step 1: Apply singular value decomposition on $\,X$ to obtain $\,S$ and $\,U$.
Step 2: For each $\,x \in X$, transform $\,x$ to $\,x^* = S^{-\frac{1}{2}}U^\top x$, and transform each center $\,\mu$ to $\,\mu^* = S^{-\frac{1}{2}}U^\top \mu$.
Step 3: For each data point $\,x \in X$, find the squared Euclidean distance between the transformed data point $\,x^*$ and the transformed center $\,\mu^*$ of each class, and assign $\,x$ to the class such that the squared Euclidean distance between $\,x^*$ and $\,\mu^*$ is the least over all of the classes.

Kernel QDA In actual data scenarios, it is generally true that QDA will provide a better classifier for the data then LDA because QDA does not assume that the covariance matrix for each class is identical, as LDA assumes. However, QDA still assumes that the class conditional distribution is Gaussian, which is not always the case in real-life scenarios. The link provided at the beginning of this paragraph describes a kernel-based QDA method which does not have the Gaussian distribution assumption.

### The Number of Parameters in LDA and QDA

Both LDA and QDA require us to estimate parameters. The more estimation we have to do, the less robust our classification algorithm will be.

LDA: Since we just need to compare the differences between one given class and remaining $\,K-1$ classes, totally, there are $\,K-1$ differences. For each of them, $\,a^{T}x+b$ requires $\,d+1$ parameters. Therefore, there are $\,(K-1)\times(d+1)$ parameters.

QDA: For each of the differences, $\,x^{T}ax + b^{T}x + c$ requires $\frac{1}{2}(d+1)\times d + d + 1 = \frac{d(d+3)}{2}+1$ parameters. Therefore, there are $(K-1)(\frac{d(d+3)}{2}+1)$ parameters.

A plot of the number of parameters that must be estimated, in terms of (K-1). The x-axis represents the number of dimensions in the data. As is easy to see, QDA is far less robust than LDA for high-dimensional data sets.

Discriminant analysis (DA) is widely used in classification problems. Except LDA and QDA, there is also an intermediate method between LDA and QDA, a regularized version of discriminant analysis (RDA) proposed by Friedman [1989], and it has been shown to be more flexible in dealing with various class distributions. RDA applies the regularization techniques by using two regularization parameters, which are selected to jointly maximize the classification performance. The optimal pair of parameters is commonly estimated via cross-validation from a set of candidate pairs. More detail about this method can be found in the book by Hastie et al. [2001]. On the other hand, the time of computing last long for high dimensional data, especially when the candidate set is large, which limits the applications of RDA to low dimensional data. In 2006, Ye Jieping and Wang Tie develop a novel algorithm for RDA for high dimensional data. It can estimate the optimal regularization parameters from a large set of parameter candidates efficiently. Experiments on a variety of datasets confirm the claimed theoretical estimate of the efficiency, and also show that, for a properly chosen pair of regularization parameters, RDA performs favourably in classification, in comparison with other existing classification methods. For more details, see Ye, Jieping; Wang, Tie Regularized discriminant analysis for high dimensional, low sample size data Conference on Knowledge Discovery in Data: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining; 20-23 Aug. 2006

### Further Reading for Regularized Discriminant Analysis (RDA)

1. Regularized Discriminant Analysis and Reduced-Rank LDA [7]

2. Regularized discriminant analysis for the small sample size in face recognition [8]

3. Regularized Discriminant Analysis and Its Application in Microarrays [9]

## Trick: Using LDA to do QDA - September 28, 2010

There is a trick that allows us to use the linear discriminant analysis (LDA) algorithm to generate as its output a quadratic function that can be used to classify data. This trick is similar to, but more primitive than, the Kernel trick that will be discussed later in the course.

Essentially, the trick involves adding one or more new features (i.e. new dimensions) that just contain our original data projected to that dimension. We then do LDA on our new higher-dimensional data. The answer provided by LDA can then be collapsed onto a lower dimension, giving us a quadratic answer.

### Motivation

Why would we want to use LDA over QDA? In situations where we have fewer data points, LDA turns out to be more robust.

If we look back at the equations for LDA and QDA, we see that in LDA we must estimate $\,\mu_1$, $\,\mu_2$ and $\,\Sigma$. In QDA we must estimate all of those, plus another $\,\Sigma$; the extra $\,\frac{d(d-1)}{2}$ estimations make QDA less robust with fewer data points.

### Theoretically

Suppose we can estimate some vector $\underline{w}^T$ such that

$y = \underline{w}^T\underline{x}$

where $\underline{w}$ is a d-dimensional column vector, and $\underline{x}\ \epsilon\ \mathbb{R}^d$ (vector in d dimensions).

We also have a non-linear function $g(x) = y = \underline{x}^Tv\underline{x} + \underline{w}^T\underline{x}$ that we cannot estimate.

Using our trick, we create two new vectors, $\,\underline{w}^*$ and $\,\underline{x}^*$ such that:

$\underline{w}^{*T} = [w_1,w_2,...,w_d,v_1,v_2,...,v_d]$

and

$\underline{x}^{*T} = [x_1,x_2,...,x_d,{x_1}^2,{x_2}^2,...,{x_d}^2]$

We can then estimate a new function, $g^*(\underline{x},\underline{x}^2) = y^* = \underline{w}^{*T}\underline{x}^*$.

Note that we can do this for any $\, x$ and in any dimension; we could extend a $D \times n$ matrix to a quadratic dimension by appending another $D \times n$ matrix with the original matrix squared, to a cubic dimension with the original matrix cubed, or even with a different function altogether, such as a $\,sin(x)$ dimension. Pay attention, We don't do QDA with LDA. If we try QDA directly on this problem the resulting decision boundary will be different. Here we try to find a nonlinear boundary for a better possible boundary but it is different with general QDA method. We can call it nonlinear LDA.

### By Example

Let's use our trick to do a quadratic analysis of the 2_3 data using LDA.

>> load 2_3;
>> [U, sample] = princomp(X');
>> sample = sample(:,1:2);

We start off the same way, by using PCA to reduce the dimensionality of our data to 2.
>> X_star = zeros(400,4);
>> X_star(:,1:2) = sample(:,:);
>> for i=1:400
for j=1:2
X_star(i,j+2) = X_star(i,j)^2;
end
end

This projects our sample into two more dimensions by squaring our initial two dimensional data set.
>> group = ones(400,1);
>> group(201:400) = 2;
>> [class, error, POSTERIOR, logp, coeff] = classify(X_star, X_star, group, 'linear');
>> sum (class==group)
ans =
375

We can now display our results.
>> k = coeff(1,2).const;
>> l = coeff(1,2).linear;
>> f = sprintf('0 = %g+%g*x+%g*y+%g*(x)^2+%g*(y)^2', k, l(1), l(2),l(3),l(4));
>> ezplot(f,[min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);

The plot shows the quadratic decision boundary obtained using LDA in the four-dimensional space on the 2_3.mat data. Counting the blue and red points that are on the wrong side of the decision boundary, we can confirm that we have correctly classified 375 data points.
Not only does LDA give us a better result than it did previously, it actually beats QDA, which only correctly classified 371 data points for this data set. Continuing this procedure by adding another two dimensions with $x^4$ (i.e. we set X_star(i,j+2) = X_star(i,j)^4) we can correctly classify 376 points.

### Working Example - Diabetes Data Set

Let's take a look at specific data set. This is a diabetes data set from the UC Irvine Machine Learning Repository. It is a fairly small data set by today's standards. The original data had eight variable dimensions. What I did here was to obtain the two prominent principal components from these eight variables. Instead of using the original eight dimensions we will just use these two principal components for this example.

The Diabetes data set has two types of samples in it. One sample type are healthy individuals the other are individuals with a higher risk of diabetes. Here are the prior probabilities estimated for both of the sample types, first for the healthy individuals and second for those individuals at risk:

The first type has a prior probability estimated at 0.651. This means that among the data set, (250 to 300 data points), about 65% of these belong to class one and the other 35% belong to class two. Next, we computed the mean vector for the two classes separately:

Then we computed File:eq3.jpg using the formulas discussed earlier.

Once we have done all of this, we compute the linear discriminant function and found the classification rule. Classification rule:File:eq4.jpg

In this example, if you give me an $\, x$, I then plug this value into the above linear function. If the result is greater than or equal to zero, I claim that it is in class one. Otherwise, it is in class two. Below is a scatter plot of the dominant principle components. The two classes are represented. The first, without diabetes, is shown with red stars (class 1), and the second class, with diabetes, is shown with blue circles (class 2). The solid line represents the classification boundary obtained by LDA. It appears the two classes are not that well separated. The dashed or dotted line is the boundary obtained by linear regression of indicator matrix. In this case, the results of the two different linear boundaries are very close.

It is always good practice to visualize the scheme to check for any obvious mistakes.

• Within training data classification error rate: 28.26%. • Sensitivity: 45.90%. • Specificity: 85.60%.

Below is the contour plot for the density of the diabetes data (the marginal density for $\, x$ is a mixture of two Gaussians, 2 classes). It looks like a single Gaussian distribution. The reason for this is that the two classes are so close together that they merge into a single mode.

### LDA and QDA in Matlab

We have examined the theory behind Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) above; how do we use these algorithms in practice? Matlab offers us a function called classify that allows us to perform LDA and QDA quickly and easily.

In class, we were shown an example of using LDA and QDA on the 2_3 data that is used in the first assignment. The code below applies LDA to the same data set and reproduces that example, slightly modified, and explains each step.

>> load 2_3;
>> [U, sample] = princomp(X');
>> sample = sample(:,1:2);

First, we do principal component analysis (PCA) on the 2_3 data to reduce the dimensionality of the original data from 64 dimensions to 2. Doing this makes it much easier to visualize the results of the LDA and QDA algorithms.

>> plot (sample(1:200,1), sample(1:200,2), '.');
>> hold on;
>> plot (sample(201:400,1), sample(201:400,2), 'r.');

Recall that in the 2_3 data, the first 200 elements are images of the number two handwritten and the last 200 elements are images of the number three handwritten. This code sets up a plot of the data such that the points that represent a 2 are blue, while the points that represent a 3 are red.
See title and legend for information on adding the title and legend.
Before using classify we can set up a vector that contains the actual labels for our data, to train the classification algorithm. If we don't know the labels for the data, then the element in the group vector should be an empty string or NaN. (See grouping data for more information.)
>> group = ones(400,1);
>> group(201:400) = 2;

We can now classify our data.
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'linear');

The full details of this line can be examined in the Matlab help file linked above. What we care about are class, which contains the labels that the algorithm thinks that each data point belongs to, and coeff, which contains information about the line that the algorithm created to separate the data into the two classes.
We can see the efficacy of the algorithm by comparing class to group.
>> sum (class==group)
ans =
369

This compares the value in class to the value in group. The answer of 369 tells us that the algorithm correctly determined the classes of the points 369 times, out of a possible 400 data points. This gives us an empirical error rate of 0.0775.
We can see the line produced by LDA using coeff.
>> k = coeff(1,2).const;
>> l = coeff(1,2).linear;
>> f = sprintf('0 = %g+%g*x+%g*y', k, l(1), l(2));
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);

Those familiar with the programming language C will find the sprintf line refreshingly familiar; those with no exposure to C are directed to Matlab's sprintf page. Essentially, this code sets up the equation of the line in the form 0 = a + bx + cy. We then use the ezplot function to plot the line.
The 2-3 data after LDA is performed. The line shows where the two classes are split.
Let's perform the same steps, except this time using QDA. The main difference with QDA is a slightly different call to classify, and a more complicated procedure to plot the line.
>> [class, error, POSTERIOR, logp, coeff] = classify(sample, sample, group, 'quadratic');
>> sum (class==group)
ans =
371
>> k = coeff(1,2).const;
>> l = coeff(1,2).linear;
>> f = sprintf('0 = %g+%g*x+%g*y+%g*x^2+%g*x*y+%g*y^2', k, l(1), l(2), q(1,1), q(1,2)+q(2,1), q(2,2));
>> ezplot(f, [min(sample(:,1)), max(sample(:,1)), min(sample(:,2)), max(sample(:,2))]);

The 2-3 data after QDA is performed. The curved line shows where QDA splits the two classes. Note that QDA is only correct in 2 more data points compared to LDA; we can see a blue point and a red point that lie on the correct side of the curve produced by QDA that do not lie on the correct side of the line produced by LDA.

classify can also be used with other discriminant analysis algorithms. The steps laid out above would only need to be modified slightly for those algorithms.

Recall: An analysis of the function of princomp in matlab.
In our assignment 1, we learned how to perform Principal Component Analysis using the SVD (Singular Value Decomposition) method. In fact, matlab offers a built-in function called princomp which performs PCA. From the matlab help file on princomp, you can find the details about this function. Here we will analyze Matlab's princomp() code. We find something different than the SVD method we used on our first assignment. The following is Matlab's code for princomp followed by some explanations to emphasize some key steps.

   function [pc, score, latent, tsquare] = princomp(x);
%   PRINCOMP Principal Component Analysis (centered and scaled data).
%   [PC, SCORE, LATENT, TSQUARE] = PRINCOMP(X) takes a data matrix X and
%   returns the principal components in PC, the so-called Z-scores in SC
%   ORES, the eigenvalues of the covariance matrix of X in LATENT,
%   and Hotelling's T-squared statistic for each data point in TSQUARE.
%   Reference: J. Edward Jackson, A User's Guide to Principal Components
%   John Wiley & Sons, Inc. 1991 pp. 1-25.
%   B. Jones 3-17-94
%   Copyright 1993-2002 The MathWorks, Inc.
%   $Revision: 2.9$  $Date: 2002/01/17 21:31:45$
[m,n] = size(x);       %  get the lengh of the rows and columns of matrix x.
r = min(m-1,n);        %  max possible rank of X
avg = mean(x);         %  the mean of every column of X
centerx = (x - avg(ones(m,1),:));
%  centers X by subtracting off column means
[U,latent,pc] = svd(centerx./sqrt(m-1),0);
%  "economy size" decomposition
score = centerx*pc;
%  the representation of X in the principal component space
if nargout < 3
return;
end
latent = diag(latent).^2;
if (r   latent = [latent(1:r); zeros(n-r,1)];
score(:,r+1:end) = 0;
end
if nargout < 4
return;
end
tmp = sqrt(diag(1./latent(1:r)))*score(:,1:r)';
tsquare = sum(tmp.*tmp)';


We should compare the following aspects of the above code with the SVD method:

First, Rows of $\,X$ correspond to observations, columns to variables. When using princomp on 2_3 data in assignment 1, note that we take the transpose of $\,X$.

 >> load 2_3;
>> [U, score] = princomp(X');


Second, princomp centers X by subtracting off column means.

The third, when $\,X=UdV'$, princomp uses $\,V$ as coefficients for principal components, rather than $\,U$.

The following is an example to perform PCA using princomp and SVD respectively to get the same result.

SVD method
 >> load 2_3
>> mn=mean(X,2);
>> X1=X-repmat(mn,1,400);
>> [s d v]=svd(X1');
>> y=X1'*v;

princomp
 >>[U score]=princomp(X');


Then we can see that y=score, v=U.

useful resources: LDA and QDA in Matlab[10],[11],[12]

## Reference

1. Harry Zhang. The optimality of naive bayes. FLAIRS Conference. AAAI Press, 2004

2. Rich Caruana and Alexandru N. Mizil. An empirical comparison of supervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006, ACM.

### Related links to LDA & QDA

LDA:[13]

QDA:[15]

Using discriminant analysis for multi-class classification: an experimental investigation [16]

### Reference articles on solving a small sample size problem when LDA is applied

( Based on Li-Fen Chen, Hong-Yuan Mark Liao, Ming-Tat Ko, Ja-Chen Lin, Gwo-Jong Yu A new LDA-based face recognition system which can solve the small sample size problem Pattern Recognition 33 (2000) 1713-1726 )

Small sample size indicates that the number of samples is smaller than the dimension of each sample. In this case, the within-class covariance we stated in class could be a singular matrix and naturally we cannot find its inverse matrix for further analysis.However, many researchers tried to solve it by different techniques:
1.Goudail et al. proposed a technique which calculated 25 local autocorrelation coefficients from each sample image to achieve dimensionality reduction. (Referenced by F. Goudail, E. Lange, T. Iwamoto, K. Kyuma, N. Otsu, Face recognition system using local autocorrelations and multiscale integration, IEEE Trans. Pattern Anal. Mach. Intell. 18 (10) (1996) 1024-1028.)
2.Swets and Weng applied the PCA approach to accomplish reduction of image dimensionality. (Referenced by D. Swets, J. Weng, Using discriminant eigen features for image retrieval, IEEE Trans. Pattern Anal. Mach. Intell.18 (8) (1996) 831-836.)
3.Fukunaga proposed a more efficient algorithm and calculated eigenvalues and eigenvectors from an m*m matrix, where n is the dimensionality of the samples and m is the rank of the within-class scatter matrix Sw. (Referenced by K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, New York, 1990.)
4.Tian et al. used a positive pseudoinverse matrix instead of calculating the inverse matrix Sw. (Referenced by Q. Tian, M. Barbero, Z.H. Gu, S.H. Lee, Image classification by the Foley-Sammon transform, Opt. Eng. 25 (7) (1986) 834-840.)
5.Hong and Yang tried to add the singular value perturbation in Sw and made Sw a nonsingular matrix. (Referenced by Zi-Quan Hong, Jing-Yu Yang, Optimal discriminant plane for a small number of samples and design method of classifier on the plane, Pattern Recognition 24 (4) (1991) 317-324)
6.Cheng et al. proposed another method based on the principle of rank decomposition of matrices. The above three methods are all based on the conventional Fisher's criterion function. (Referenced by Y.Q. Cheng, Y.M. Zhuang, J.Y. Yang, Optimal fisher discriminant analysis using the rank decomposition, Pattern Recognition 25 (1) (1992) 101-111.)
7.Liu et al. modified the conventional Fisher's criterion function and conducted a number of researches based on the new criterion function. They used the total scatter matrix as the divisor of the original Fisher's function instead of merely using the within-class scatter matrix. (Referenced by K. Liu, Y. Cheng, J. Yang, A generalized optimal set of discriminant vectors, Pattern Recognition 25 (7) (1992) 731-739.)

## Principal Component Analysis - September 30, 2010

### Brief introduction on dimension reduction method

Dimension reduction is a process to reduce the number of variables of the data by some techniques. Principal components analysis (PCA) and factor analysis are two primary classical methods on dimension reduction. PCA is a method to create some new variables by a linear combination of the variables in the data and the number of new variables depends on what proportion of the variance the new ones contribute. On the contrary, factor analysis method tries to express the old variables by the linear combination of new variables. So before creating the expressions, a certain number of factors should be determined firstly by analysis on the features of old variables. In general, the idea of both PCA and factor analysis is to use as less as possible mixed variables to reflect as more as possible information.

### Rough definition

Keepings two important aspects of data analysis in mind:

• Reducing covariance in data
• Preserving information stored in data(Variance is a source of information)

Principal component analysis (PCA) is a dimensionality-reduction method invented by Karl Pearson in 1901 [17]. Depending on where this methodology is applied, other common names of PCA include the Karhunen–Loève transform (KLT) , the Hotelling transform, and the proper orthogonal decomposition (POD). PCA is the simplist eigenvector-based multivariate analysis. It reduces the dimensionality of the data by revealing the internal structure of the data in a way that best explains the variance in the data. To this end, PCA works by using a user-defined number of the most important directions of variation (dimensions or principal components) of the data to project the data onto these directions so as to produce a lower-dimensional representation of the original data. The resulting lower-dimensional representation of our data is usually much easier to visualize and it also exhibits the most informative aspects (dimensions) of our data whilst capturing as much of the variation exhibited by our data as it possibly could.

Furthermore, if one considers the lower dimensional representation produced by PCA as a least square fit of our original data, then it can also be easily shown that this representation is the one that minimizes the reconstruction error of our data. It should be noted, however, that one usually does not have control over which dimensions PCA deems to be the most informative for a given set of data, and thus one usually does not know which dimensions PCA should be selected to be the most informative dimensions in order to create the lower-dimensional representation.

Suppose $\,X$ is our data matrix containing $\,d$-dimensional data. The idea behind PCA is to apply singular value decomposition to $\,X$ to replace the rows of $\,X$ by a subset of it that captures as much of the variance in $\,X$ as possible. First, through the application of singular value decomposition to $\,X$, PCA obtains all of our data's directions of variation. These directions would also be ordered from left to right, with the leftmost directions capturing the most amount of variation in our data and the rightmost directions capturing the least amount. Then, PCA uses a subset of these directions to map our data from its original space to a lower-dimensional space.

By applying singular value decomposition to $\,X$, $\,X$ is decomposed as $\,X = U\Sigma V^T \,$. The $\,d$ columns of $\,U$ are the eigenvectors of $\,XX^T \,$. The $\,d$ columns of $\,V$ are the eigenvectors of $\,X^TX \,$. The $\,d$ diagonal values of $\,\Sigma$ are the square roots of the eigenvalues of $\,XX^T \,$ (also of $\,X^TX \,$), and they correspond to the columns of $\,U$ (also of $\,V$).

We are interested in $\,U$, whose $\,d$ columns are the $\,d$ directions of variation of our data. Ordered from left to right, the $\,ith$ column of $\,U$ is the $\,ith$ most informative direction of variation of our data. That is, the $\,ith$ column of $\,U$ is the $\,ith$ most effective column in terms of capturing the total variance exhibited by our data. A subset of the columns of $\,U$ is used by PCA to reduce the dimensionality of $\,X$ by projecting $\,X$ onto the columns of this subset. In practice, when we apply PCA to $\,X$ to reduce the dimensionality of $\,X$ from $\,d$ to $\,k$, where $k \lt d\,$, we would proceed as follows:

Step 1: Center $\,X$ so that it would have zero mean.
Step 2: Apply singular value decomposition to $\,X$ to obtain $\,U$.
Step 3: Suppose we denote the resulting $\,k$-dimensional representation of $\,X$ by $\,Y$. Then, $\,Y$ is obtained as $\,Y = U_k^TX$. Here, $\,U_k$ consists of the first (leftmost) $\,k$ columns of $\,U$ that correspond to the $\,k$ largest diagonal elements of $\,\Sigma$.

PCA takes a sample of $\, d$ - dimensional vectors and produces an orthogonal(zero covariance) set of $\, d$ 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.

Then we can preserve most of the variance in the sample in a lower dimension by choosing the first $\, k$ Principle Components and approximating the data in $\, k$ - dimensional space, which is easier to analyze and plot.

### Principal Components of handwritten digits

Suppose that we have a set of 130 images (28 by 23 pixels) of handwritten threes.

We can represent each image as a vector of length 644 ($644 = 23 \times 28$). Then we can represent the entire data set as a 644 by 130 matrix, shown below. Each column represents one image (644 rows = 644 pixels).

Using PCA, we can approximate the data as the product of two smaller matrices, which I will call $V \in M_{644,2}$ and $W \in M_{2,103}$. If we expand the matrix product then each image is approximated by a linear combination of the columns of V: $\hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2$, where $\lambda = [\lambda_1, \lambda_2]^T$ is a column of W.

To demonstrate this process, we can compare the images of 2s and 3s. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.

Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.

The first coefficient represents the width of the entire digit, and the second coefficient represents the slant of each handwritten digit.

### Derivation of the first Principle Component

For finding the direction of maximum variation, Let \begin{align}\textbf{w}\end{align} be an arbitrary direction, \begin{align}\textbf{x}\end{align} a data point, and \begin{align}\displaystyle u\end{align} be the length of the projection of \begin{align}\textbf{x}\end{align} in the direction \begin{align}\textbf{w}\end{align}.

\begin{align} \textbf{w} &= [w_1, \ldots, w_D]^T \\ \textbf{x} &= [x_1, \ldots, x_D]^T \\ u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}} \end{align}

The direction \begin{align}\textbf{w}\end{align} is the same as \begin{align}c\textbf{w}\end{align}, for any scalar $c$, so without loss of generality we assume that:

\begin{align} |\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\ u &= \textbf{w}^T \textbf{x}. \end{align}

Let $x_1, \ldots, x_D$ be random variables, then we set our goal as to maximize the variance of $u$,

$\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w}.$

For a finite data set we replace the covariance matrix $\Sigma$ by $s$. The sample covariance matrix

$\textrm{var}(u) = \textbf{w}^T s\textbf{w} .$

The above mentioned variable is the variance of \begin{align}\displaystyle u \end{align} formed by the weight vector \begin{align}\textbf{w} \end{align}. The first principal component is the vector \begin{align}\textbf{w} \end{align} that maximizes the variance,

$\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right)$

where arg max denotes the value of \begin{align}\textbf{w} \end{align} that maximizes the function. Our goal is to find the weight \begin{align}\textbf{w} \end{align} that maximizes this variability, subject to a constraint. Since our function is convex, it has no maximum value. Therefore we need to add a constraint that restricts the length of \begin{align}\textbf{w} \end{align}. However, we are only interested in the direction of the variability, so the problem becomes

$\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right)$

s.t. $\textbf{w}^T \textbf{w} = 1.$

Notice,

$\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| = \| s \|.$

Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.

#### Lagrange Multiplier

Before we can proceed, we must review Lagrange multipliers.

"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]

To find the maximum (or minimum) of a function $\displaystyle f(x,y)$ subject to constraints $\displaystyle g(x,y) = 0$, we define a new variable $\displaystyle \lambda$ called a Lagrange Multiplier and we form the Lagrangian,

$\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)$

If $\displaystyle (x^*,y^*)$ is the max of $\displaystyle f(x,y)$, there exists $\displaystyle \lambda^*$ such that $\displaystyle (x^*,y^*,\lambda^*)$ is a stationary point of $\displaystyle L$ (partial derivatives are 0).
In addition $\displaystyle (x^*,y^*)$ is a point in which functions $\displaystyle f$ and $\displaystyle g$ touch but do not cross. At this point, the tangents of $\displaystyle f$ and $\displaystyle g$ are parallel or gradients of $\displaystyle f$ and $\displaystyle g$ are parallel, such that:

$\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g$

where,
$\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow$ the gradient of $\, f$
$\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow$ the gradient of $\, g$

#### Example

Suppose we wish to maximize the function $\displaystyle f(x,y)=x-y$ subject to the constraint $\displaystyle x^{2}+y^{2}=1$. We can apply the Lagrange multiplier method on this example; the lagrangian is:

$\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)$

We want the partial derivatives equal to zero:

$\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0$

$\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0$

$\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1$

Solving the system we obtain 2 stationary points: $\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)$ and $\displaystyle (-\sqrt{2}/2,\sqrt{2}/2)$. In order to understand which one is the maximum, we just need to substitute it in $\displaystyle f(x,y)$ and see which one as the biggest value. In this case the maximum is $\displaystyle (\sqrt{2}/2,-\sqrt{2}/2)$.

#### Determining W

Back to the original problem, from the Lagrangian we obtain,

$\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)$

If $\textbf{w}^T \textbf{w}$ is a unit vector then the second part of the equation is 0.

If $\textbf{w}^T \textbf{w}$ is not a unit vector then the second part of the equation increases. Thus decreasing overall $\displaystyle L(\textbf{w},\lambda)$. Maximization happens when $\textbf{w}^T \textbf{w} =1$

(Note that to take the derivative with respect to w below, $\textbf{w}^T S \textbf{w}$ can be thought of as a quadratic function of w, hence the 2sw below. For more matrix derivatives, see section 2 of the Matrix Cookbook)

Taking the derivative with respect to w, we get:

$\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w}$

Set $\displaystyle \frac{\partial L}{\partial \textbf{w}} = 0$, we get

$\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^*$

From the eigenvalue equation $\, \textbf{w}^*$ is an eigenvector of S and $\, \lambda^*$ is the corresponding eigenvalue of S. If we substitute $\displaystyle\textbf{w}^*$ in $\displaystyle \textbf{w}^T S\textbf{w}$ we obtain,

$\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^*$

In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, u1 to have the maximum variance
(i.e. capturing as much variability in $\displaystyle x_1, x_2,...,x_D$ as possible.) Subsequent principal components will take up successively smaller parts of the total variability.

D dimensional data will have D eigenvectors

$\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D$ where each $\, \lambda_i$ represents the amount of variation in direction $\, i$

so that

$Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)$

If two eigenvalues happen to be equal, then the data has the same amount of variation in each of the two directions that they correspond to with. If only one of the two equal eigenvalues are to be chosen for dimensionality reduction, then either will do. Note that if ALL of the eigenvalues are the same then this means that the data is on the surface of a d-dimensional sphere (all directions have the same amount of variation).

Note that the Principal Components decompose the total variance in the data:

$\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = Var(\sum_{i = 1}^n x_i)$

i.e. the sum of variations in all directions is the variation in the whole data

Example from class

We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images

The Matlab code is as follows:

 load noisy
who
size(X)
imagesc(reshape(X(:,1),20,28)')
colormap gray
imagesc(reshape(X(:,1),20,28)')
m_X=mean(X,2);
mm=repmat(m_X,1,300);
XX=X-mm;
[u s v] = svd(XX);
xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components
xHat=xHat+mm;
figure
imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.
colormap gray


Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face

The second one is the denoised image

As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. This is because almost all of the noise in the noisy image is captured by the principal components (directions of variation) that capture the least amount of variation in the image, and these principal components were discarded when we used the few principal components that capture most of the image's variation to generate the image's lower-dimensional representation. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.

#### Application of PCA - Feature Extraction

PCA, depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT), the Hotelling transform or proper orthogonal decomposition (POD). One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).

Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data.

#### General PCA Algorithm

The PCA Algorithm is summarized as follows (taken from the Lecture Slides).

#### Algorithm

Recover basis: Calculate $XX^T =\Sigma_{i=1}^{n} x_i x_{i}^{T}$ and let $U=$ eigenvectors of $X X^T$ corresponding to the top $d$ eigenvalues.

Encoding training data: Let $Y=U^TX$ where $Y$ is a $d \times n$ matrix of encoding of the original data.

Reconstructing training data: $\hat{X}= UY=UU^TX$.

Encode set example: $y=U^T x$ where $y$ is a $d-$dimentional encoding of $x$.

Reconstruct test example: $\hat{x}= Uy=UU^Tx$.

Other Notes:

1. The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see detailsPCA in Wikipedia.)
2. Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: $X_{D\times n} \longrightarrow Y_{d\times n}$ using mapping $\, U^T X_{D \times n}$.
3. When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: $\hat{X}_{D\times n}\longleftarrow Y_{d \times n}$ using mapping $\, U_dY_{d \times n}$, where $\,U_d$ contains the first (leftmost) $\,d$ columns of $\,U$.
4. We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.

#### Feature Extraction Uses and Discussion

PCA, as well as other feature extraction methods not within the scope of the course [18] are used as a first step to classification in enhancing generalization capability: one of the classification aspects that will be discussed later in the course is model complexity. As a classification model becomes more complex over its training set, classification error over test data tends to increase. By performing feature extraction prior to attempting classification, we restrict model inputs to only the most important variables, thus decreasing complexity and potentially improving test results.

Feature selection methods, that are used to select subsets of relevant features for building robust learning models, differ from extraction methods, where features are transformed. Feature selection has the added benefit of improving model interpretability.

### Independent Component Analysis

As we have already seen, the Principal Component Analysis (PCA) performed by the Karhunen-Lokve transform produces features $\ y ( i ) ; i = 0, 1, . . . , N - 1$, that are mutually uncorrelated. The obtained by the KL transform solution is optimal when dimensionality reduction is the goal and one wishes to minimize the approximation mean square error. However, for certain applications, the obtained solution falls short of the expectations. In contrast, the more recently developed Independent Component Analysis (ICA) theory, tries to achieve much more than simple decorrelation of the data. The ICA task is casted as follows: Given the set of input samples $\ x$, determine an $\ N \times N$ invertible matrix $\ W$ such that the entries $\ y(i), i = 0, 1, . . . , N - 1$, of the transformed vector

$\ y = W.x$

are mutually independent. The goal of statistical independence is a stronger condition than the uncorrelatedness required by the PCA. The two conditions are equivalent only for Gaussian random variables. Searching for independent rather than uncorrelated features gives us the means of exploiting a lot more of information, hidden in the higher order statistics of the data.

This topic has brought to you from Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas. (Chapter 6) For further details on the ICA and its varieties, refer to this book.

### References

1. Probabilistic Principal Component Analysis [19]

2. Nonlinear Component Analysis as a Kernel Eigenvalue Problem [20]

3. Kernel principal component analysis [21]

4. Principal Component Analysis [22] and [23]

1. I. T. Jolliffe "Principal component analysis" Available here.

2. James V. Stone "Independent component analysis: a tutorial introduction" Available here.

3. Aapo Hyvärinen, Juha Karhunen, Erkki Oja "Independent component analysis" Available here.

## Fisher's (Linear) Discriminant Analysis (FDA) - Two Class Problem - October 5, 2010

### Sir Ronald A. Fisher

Fisher's Discriminant Analysis (FDA), also known as Fisher's Linear Discriminant Analysis (LDA) in some sources, is a classical feature extraction technique. It was originally described in 1936 by Sir Ronald Aylmer Fisher, an English statistician and eugenicist who has been described as one of the founders of modern statistical science. His original paper describing FDA can be found here; a Wikipedia article summarizing the algorithm can be found here. In this paper Fisher used for the first time the term DISCRIMINANT FUNCTION. The term DISCRIMINANT ANALYSIS was introduced later by Fisher himself in a subsequent paper which can be found here.

### Introduction

Linear discriminant analysis (LDA) and the related Fisher's linear discriminant are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterize or separate two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

LDA is also closely related to principal component analysis (PCA) and factor analysis in that both look for linear combinations of variables which best explain the data. LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities. Discriminant analysis is also different from factor analysis in that it is not an interdependence technique: a distinction between independent variables and dependent variables (also called criterion variables) must be made.

LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is discriminant correspondence analysis.

### Contrasting FDA with PCA

As in PCA, the goal of FDA is to project the data in a lower dimension. You might ask, why was FDA invented when PCA already existed? There is a simple explanation for this that can be found here. PCA is an unsupervised method for classification, so it does not take into account the labels in the data. Suppose we have two clusters that have very different or even opposite labels from each other but are nevertheless positioned in a way such that they are very much parallel to each other and also very near to each other. In this case, most of the total variation of the data is in the direction of these two clusters. If we use PCA in cases like this, then both clusters would be projected onto the direction of greatest variation of the data to become sort of like a single cluster after projection. PCA would therefore mix up these two clusters that, in fact, have very different labels. What we need to do instead, in this cases like this, is to project the data onto a direction that is orthogonal to the direction of greatest variation of the data. This direction is in the least variation of the data. On the 1-dimensional space resulting from such a projection, we would then be able to effectively classify the data, because these two clusters would be perfectly or nearly perfectly separated from each other taking into account of their labels. This is exactly the idea behind FDA.

The main difference between FDA and PCA is that, in FDA, in contrast to PCA, we are not interested in retaining as much of the variance of our original data as possible. Rather, in FDA, our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for a direction that is most representative of a particular characteristic e.g. glasses vs. no-glasses). Suppose we have 2-dimensional data, then FDA would attempt to project the data of each class onto a point in such a way that the resulting two points would be as far apart from each other as possible. Intuitively, this basic idea behind FDA is the optimal way for separating each pair of classes along a certain direction.

Please note dimention reduction in PCA is different from subspace cluster , see the details about the subspace cluser [24] {{

 Template:namespace detect


| type = style | image = | imageright = | style = | textstyle = | text = This article may require cleanup to meet Wikicoursenote's quality standards. The specific problem is: Just a thought: how relevant is "Dimensionality reduction techniques" to the concept of "subspace clustering"? As in subspace clustering, the goal is to find a set of features (relevant features, the concept is referred to as local feature relevance in the literature) in the high dimensional space, where potential subspaces accommodating different classes of data points can be defined. This means; the data points are dense when they are considered in a subset of dimensions (features).. Please improve this article if you can. (October 2010) | small = | smallimage = | smallimageright = | smalltext = }} {{

 Template:namespace detect


| type = style | image = | imageright = | style = | textstyle = | text = This article may require cleanup to meet Wikicoursenote's quality standards. The specific problem is: If I'm not mistaken, classification techniques like FDA use labeled training data whereas clustering techniques use unlabeled training data instead. Any other input regarding this would be much appreciated. Thanks. Please improve this article if you can. (October 2010) | small = | smallimage = | smallimageright = | smalltext = }} {{

 Template:namespace detect


| type = style | image = | imageright = | style = | textstyle = | text = This article may require cleanup to meet Wikicoursenote's quality standards. The specific problem is: An extension of clustering is subspace clustering in which different subspace are searched through to find the relavant and appropriate dimentions. High dimentional data sets are roughly equiedistant from each other, so feature selection methods are used to remove the irrelavant dimentions. These techniques do not keep the relative distance so PCA is not useful for these applications. It should be noted that subspace clustering localize their search unlike feature selection algorithms.for more information click here[25]. Please improve this article if you can. (October 2010) | small = | smallimage = | smallimageright = | smalltext = }}

The number of dimensions that we want to reduce the data to depends on the number of classes:
For a 2-classes problem, we want to reduce the data to one dimension (a line), $\displaystyle Z \in \mathbb{R}^{1}$
Generally, for a k-classes problem, we want to reduce the data to k-1 dimensions, $\displaystyle Z \in \mathbb{R}^{k-1}$

As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within-variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, and at the same time the data within each class are as close to each other as possible (collapsed to a single point in the most extreme case).

The following diagram summarizes this goal.

In fact, the two examples above may represent the same data projected on two different lines.

### Distance Metric Learning VS FDA

In many fundamental machine learning problems, the Euclidean distances between data points do not represent the desired topology that we are trying to capture. Kernel methods address this problem by mapping the points into new spaces where Euclidean distances may be more useful. An alternative approach is to construct a Mahalanobis distance (quadratic Gaussian metric) over the input space and use it in place of Euclidean distances. This approach can be equivalently interpreted as a linear transformation of the original inputs, followed by Euclidean distance in the projected space. This approach has attracted a lot of recent interest.

Some of the proposed algorithms are iterative and computationally expensive. In the paper,"Distance Metric Learning VS FDA " written by our instructor, they propose a closed-form solution to one algorithm that previously required expensive semidefinite optimization. They provide a new problem setup in which the algorithm performs better or as well as some standard methods, but without the computational complexity. Furthermore, they show a strong relationship between these methods and the Fisher Discriminant Analysis (FDA). They also extend the approach by kernelizing it, allowing for non-linear transformations of the metric.

Example

In the paper "Distance Metric Learning VS FDA ", classification error rate for three of the six UCI datasets, each learned metric is projected onto a lowdimensional subspace, shown along the x axis are shown as below.

,

### FDA Goals

An intuitive description of FDA can be given by visualizing two clouds of data, as shown above. Ideally, we would like to collapse all of the data points in each cloud onto one point on some projected line, then make those two points as far apart as possible. In doing so, we make it very easy to tell which class a data point belongs to. In practice, it is not possible to collapse all of the points in a cloud to one point, but we attempt to make all of the points in a cloud close to each other while simultaneously far from the points in the other cloud.

#### Example in R

PCA and FDA primary dimension for normal multivariate data, using R.
>> X = matrix(nrow=400,ncol=2)
>> X[1:200,] = mvrnorm(n=200,mu=c(1,1),Sigma=matrix(c(1,1.5,1.5,3),2))
>> X[201:400,] = mvrnorm(n=200,mu=c(5,3),Sigma=matrix(c(1,1.5,1.5,3),2))
>> Y = c(rep("red",200),rep("blue",200))

Create 2 multivariate normal random variables with $\, \mu_1 = \left( \begin{array}{c}1 \\ 1 \end{array} \right), \mu_2 = \left( \begin{array}{c}5 \\ 3 \end{array} \right). ~\textrm{Cov} = \left( \begin{array}{cc} 1 & 1.5 \\ 1.5 & 3 \end{array} \right)$. Create Y, an index indicating which class they belong to.
>> s <- svd(X,nu=1,nv=1)

Calculate the singular value decomposition of X. The most significant direction is in s$v[,1], and is displayed as a black line. >> s2 <- lda(X,grouping=Y)  The lda function, given the group for each item, uses Fischer's Linear Discriminant Analysis (FLDA) to find the most discriminant direction. This can be found in s2$scaling.

Now that we've calculated the PCA and FLDA decompositions, we create a plot to demonstrate the differences between the two algorithms. FLDA is clearly better suited to discriminating between two classes whereas PCA is primarily good for reducing the number of dimensions when data is high-dimensional.

>> plot(X,col=Y,main="PCA vs. FDA example")

Plot the set of points, according to colours given in Y.
>> slope = s$v[2]/s$v[1]
>> intercept = mean(X[,2])-slope*mean(X[,1])
>> abline(a=intercept,b=slope)

Plot the main PCA direction, drawn through the mean of the dataset. Only the direction is significant.
>> slope2 = s2$scaling[2]/s2$scaling[1]
>> intercept2 = mean(X[,2])-slope2*mean(X[,1])
>> abline(a=intercept2,b=slope2,col="red")

Plot the FLDA direction, again through the mean.
>> legend(-2,7,legend=c("PCA","FDA"),col=c("black","red"),lty=1)

Labeling the lines directly on the graph makes it easier to interpret.

FDA projects the data into lower dimensional space, where the distances between the projected means are maximum and the within-class variances are minimum. There are two categories of classification problems:

1. Two-class problem

2. Multi-class problem (addressed next lecture)

### Two-class problem

In the two-class problem, we have prior knowledge that the data points belong to two classes. Conceptually, points of each class form a cloud around the class mean, and each class has an distinct size. To divide points among the two classes, we must determine the class whose mean is closest to each point, and we must also account for the different size of each class given by the covariance of each class.

Assume $\underline{\mu_{1}}=\frac{1}{n_{1}}\displaystyle\sum_{i:y_{i}=1}\underline{x_{i}}$ and $\displaystyle\Sigma_{1}$, represent the mean and covariance of the 1st class, and $\underline{\mu_{2}}=\frac{1}{n_{2}}\displaystyle\sum_{i:y_{i}=2}\underline{x_{i}}$ and $\displaystyle\Sigma_{2}$ represent the mean and covariance of the 2nd class. We have to find a transformation which satisfies the following goals:

1.To make the means of these two classes as far apart as possible

In other words, the goal is to maximize the distance after projection between class 1 and class 2. This can be done by maximizing the distance between the means of the classes after projection. When projecting the data points to a one-dimensional space, all points will be projected to a single line; the line we seek is the one with the direction that achieves maximum separation of classes upon projection. If the original points are $\underline{x_{i}} \in \mathbb{R}^{d}$and the projected points are $\underline{w}^T \underline{x_{i}}$ then the mean of the projected points will be $\underline{w}^T \underline{\mu_{1}}$ and $\underline{w}^T \underline{\mu_{2}}$ for class 1 and class 2 respectively. The goal now becomes to maximize the Euclidean distance between projected means, $(\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})^T (\underline{w}^T\underline{\mu_{1}}-\underline{w}^T\underline{\mu_{2}})$. The steps of this maximization are given below.

2.We want to collapse all data points of each class to a single point, i.e., minimize the covariance within each class

Notice that the variance of the projected classes 1 and 2 are given by $\underline{w}^T\Sigma_{1}\underline{w}$ and $\underline{w}^T\Sigma_{2}\underline{w}$. The second goal is to minimize the sum of these two covariances (the summation of the two covariances is a valid covariance, satisfying the symmetry and positive semi-definite criteria).

{{

 Template:namespace detect


| type = style | image = | imageright = | style = | textstyle = | text = This article may require cleanup to meet Wikicoursenote's quality standards. The specific problem is: In 2. above, I wonder if the computation would be much more complex if we instead find a weighted sum of the covariances of the two classes where the weights are the sizes of the two classes?. Please improve this article if you can. (October 2010) | small = | smallimage = | smallimageright = | smalltext = }}

{{

 Template:namespace detect


| type = style | image = | imageright = | style = | textstyle = | text = This article may require cleanup to meet Wikicoursenote's quality standards. The specific problem is: If using the weighted sum of two covariances, you will need to use the shared mean of the two classes, and the weighted sum will be the shared covariance. Doing this will result in collapsing the two classes into one point, which contradicts the purpose of using FDA. Please improve this article if you can. (December 2010) | small = | smallimage = | smallimageright = | smalltext = }}

As is demonstrated below, both of these goals can be accomplished simultaneously.

Original points are $\underline{x_{i}} \in \mathbb{R}^{d}$
$\ \{ \underline x_1 \underline x_2 \cdot \cdot \cdot \underline x_n \}$

Projected points are $\underline{z_{i}} \in \mathbb{R}^{1}$ with $\underline{z_{i}} = \underline{w}^T \cdot\underline{x_{i}}$ where $\ z_i$ is a scalar

#### 1. Minimizing within-class variance

$\displaystyle \min_w (\underline{w}^T\sum_1\underline{w})$

$\displaystyle \min_w (\underline{w}^T\sum_2\underline{w})$

and this problem reduces to $\displaystyle \min_w (\underline{w}^T(\sum_1 + \sum_2)\underline{w})$
(where $\,\sum_1$ and $\,\sum_2$ are the covariance matrices of the 1st and 2nd classes of data, respectively)

Let $\displaystyle \ s_w=\sum_1 + \sum_2$ be the within-classes covariance. Then, this problem can be rewritten as $\displaystyle \min_w (\underline{w}^Ts_w\underline{w})$.

#### 2. Maximize the distance between the means of the projected data

$\displaystyle \max_w ||\underline{w}^T \mu_1 - \underline{w}^T \mu_2||^2,$

[math]\begin{align} ||\underline{w}^T \mu_1 - \underline{w}^T \mu_2||^2 &= (\underline{w}^T \mu_1 - \underline{w}^T \mu_2)^T(\underline{w}^T \mu_1 - \underline{w}^T \mu_2)\\ &= (\mu_1^T\underline{w} - \mu_2^T\underline{w})(\underline{w}^T \mu_1 - \underline{w}^T \mu_2)\\ &= (\mu_1 - \mu_2)^T \underline{w} \underline{w}^T (\mu_1 - \mu_2) \\ &= ((\mu_1 - \mu_2)^T \underline{w})^{T} (\underline{w}^T (\mu_1 - \mu_2))^{T} \\ &= \un