# Difference between revisions of "stat841f10"

## Proposal Fall 2010

{{

 Template:namespace detect


| type = style | image = | imageright = | style = | textstyle = | text = This article may require cleanup to meet Wikicoursenote's quality standards. The specific problem is: Provide a summary for each topic here.. Please improve this article if you can. (October 8 2010) | small = | smallimage = | smallimageright = | smalltext = }}

## Reference Textbook

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, February 2009 Trevor Hastie, Robert Tibshirani, Jerome Friedman (3rd Edition is available)

## Classification - September 21, 2010

### Classification

Statistical classification, or simply known as classification, is an area of supervised learning that addresses the problem of how to systematically assign unlabeled (classes unknown) novel data to their labels (classes or groups or types) by using knowledge of their features (characteristics or attributes) that are obtained from observation and/or measurement. A classifier is a specific technique or method for performing classification. To classify new data, a classifier first uses labeled (classes are known) training data to train a model, and then it uses a function known as its classification rule to assign a label to each new data input after feeding the input's known feature values into the model to determine how much the input belongs to each class.

Classification has been an important task for people and society since the beginnings of history. According to this link, the earliest application of classification in human society was probably done by prehistory peoples for recognizing which wild animals were beneficial to people and which ones were harmful, and the earliest systematic use of classification was done by the famous Greek philosopher Aristotle (384 BC - 322 BC) when he, for example, grouped all living things into the two groups of plants and animals. Classification is generally regarded as one of four major areas of statistics, with the other three major areas being regression, clustering, and dimensionality reduction (feature extraction or manifold learning). Please be noted that some people consider classification to be a broad area that consists of both supervised and unsupervised methods of classifying data. In this view, as can be seen in this link, clustering is simply a special case of classification and it may be called unsupervised classification.

In classical statistics, classification techniques were developed to learn useful information using small data sets where there is usually not enough of data. When machine learning was developed after the application of computers to statistics, classification techniques were developed to work with very large data sets where there is usually too many data. A major challenge facing data mining using machine learning is how to efficiently find useful patterns in very large amounts of data. An interesting quote that describes this problem quite well is the following one made by the retired Yale University Librarian Rutherford D. Rogers, a link to a source of which can be found here.

       "We are drowning in information and starving for knowledge."
- Rutherford D. Rogers


In the Information Age, machine learning when it is combined with efficient classification techniques can be very useful for data mining using very large data sets. This is most useful when the structure of the data is not well understood but the data nevertheless exhibit strong statistical regularity. Areas in which machine learning and classification have been successfully used together include search and recommendation (e.g. Google, Amazon), automatic speech recognition and speaker verification, medical diagnosis, analysis of gene expression, drug discovery etc.

The formal mathematical definition of classification is as follows:

Definition: Classification is the prediction of a discrete random variable $\mathcal{Y}$ from another random variable $\mathcal{X}$, where $\mathcal{Y}$ represents the label assigned to a new data input and $\mathcal{X}$ represents the known feature values of the input.

A set of training data used by a classifier to train its model consists of $\,n$ independently and identically distributed (i.i.d) ordered pairs $\,\{(X_{1},Y_{1}), (X_{2},Y_{2}), \dots , (X_{n},Y_{n})\}$, where the values of the $\,ith$ training input's feature values $\,X_{i} = (\,X_{i1}, \dots , X_{id}) \in \mathcal{X} \subset \mathbb{R}^{d}$ is a d-dimensional vector and the label of the $\, ith$ training input is $\,Y_{i} \in \mathcal{Y}$ that can take a finite number of values. The classification rule used by a classifier has the form $\,h: \mathcal{X} \mapsto \mathcal{Y}$. After the model is trained, each new data input whose feature values is $\,x$ is given the label $\,\hat{Y}=h(x)$.

As an example, if we would like to classify some vegetables and fruits, then our training data might look something like the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides.

After we have selected a classifier and then built our model using our training data, we could use the classifier's classification rule $\ h$ to classify any newly-given vegetable or fruit such as the one shown in the following picture from Professor Ali Ghodsi's Fall 2010 STAT 841 slides after first obtaining its feature values.

As another example, suppose we wish to classify newly-given fruits into apples and oranges by considering three features of a fruit that comprise its color, its diameter, and its weight. After selecting a classifier and constructing a model using training data $\,\{(X_{color, 1}, X_{diameter, 1}, X_{weight, 1}, Y_{1}), \dots , (X_{color, n}, X_{diameter, n}, X_{weight, n}, Y_{n})\}$, we could then use the classifier's classification rule $\,h$ to assign any newly-given fruit having known feature values $\,x = (\,x_{color}, x_{diameter} , x_{weight})$ the label $\, \hat{Y}=h(x) \in \mathcal{Y}= \{apple,orange\}$.

### Examples of Classification

• Email spam filtering (spam vs not spam).

• Detecting credit card fraud (fraudulent or legitimate).

• Face detection in images (face or background).

• Web page classification (sports vs politics vs entertainment etc).

• Steering an autonomous car across the US (turn left, right, or go straight).

• Medical diagnosis (classification of disease based on observed symptoms).

### Independent and Identically Distributed (iid) Data Assumption

Suppose that we have training data X which contains N data points. The Independent and Identically Distributed (IID) assumption declares that the datapoints are drawn independently from identical distributions. This assumption implies that ordering of the data points does not matter, and the assumption is used in many classification problems. For an example of data that is not IID, consider daily temperature: today's temperature is not independent of the yesterday's temperature -- rather, there is a strong correlation between the temperatures of the two days.

### Error rate

The empirical error rate (or training error rate) of a classifier having classification rule $\,h$ is defined as the frequency at which $\,h$ does not correctly classify the data inputs in the training set, i.e., it is defined as $\,\hat{L}_{n} = \frac{1}{n} \sum_{i=1}^{n} I(h(X_{i}) \neq Y_{i})$, where $\,I$ is an indicator variable and $\,I = \left\{\begin{matrix} 1 &\text{if } h(X_i) \neq Y_i \\ 0 &\text{if } h(X_i) = Y_i \end{matrix}\right.$. Here, $\,X_{i} \in \mathcal{X}$ and $\,Y_{i} \in \mathcal{Y}$ are the known feature values and the true class of the $\,ith$ training input, respectively.

The true error rate $\,L(h)$ of a classifier having classification rule $\,h$ is defined as the probability that $\,h$ does not correctly classify any new data input, i.e., it is defined as $\,L(h)=P(h(X) \neq Y)$. Here, $\,X \in \mathcal{X}$ and $\,Y \in \mathcal{Y}$ are the known feature values and the true class of that input, respectively.

In practice, the empirical error rate is obtained to estimate the true error rate, whose value is impossible to be known because the parameter values of the underlying process cannot be known but can only be estimated using available data. The empirical error rate, in practice, estimates the true error rate quite well in that, as mentioned here, it is an unbiased estimator of the true error rate.

An Error Rate Comparison of Classification Methods [1]

### Decision Theory

we can identify three distinct approaches to solving decision problems, all of which have been used in practical applications. These are given, in decreasing order of complexity, by:

a. First solve the inference problem of determining the class-conditional densities $\ p(x|C_k)$ for each class $\ C_k$ individually. Also separately infer the prior class probabilities $\ p(C_k)$. Then use Bayes’ theorem in the form

\begin{align}p(C_k|x)=\frac{p(x|C_k)p(C_k)}{p(x)} \end{align}

to find the posterior class probabilities $\ p(C_k|x)$. As usual, the denominator in Bayes’ theorem can be found in terms of the quantities appearing in the numerator, because

\begin{align}p(x)=\sum_{k} p(x|C_k)p(C_k) \end{align}

Equivalently, we can model the joint distribution $\ p(x, C_k)$ directly and then normalize to obtain the posterior probabilities. Having found the posterior probabilities, we use decision theory to determine class membership for each new input $\ x$. Approaches that explicitly or implicitly model the distribution of inputs as well as outputs are known as "generative models", because by sampling from them it is possible to generate synthetic data points in the input space.

b. First solve the inference problem of determining the posterior class probabilities $\ p(C_k|x)$, and then subsequently use decision theory to assign each new $\ x$ to one of the classes. Approaches that model the posterior probabilities directly are called "discriminative models".

c. Find a function $\ f(x)$, called a discriminant function, which maps each input $\ x$ directly onto a class label. For instance, in the case of two-class problems, $\ f(.)$ might be binary valued and such that $\ f = 0$ represents class $\ C_1$ and $\ f = 1$ represents class $\ C_2$. In this case, probabilities play no role.

This topic has brought to you from Pattern Recognition and Machine Learning by Christopher M. Bishop (Chapter 1)

### Bayes Classifier

A Bayes classifier is a simple probabilistic classifier based on applying Bayes' Theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".

In simple terms, a Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend on each other or upon the existence of the other features, a Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.

Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without believing in Bayesian probability or using any Bayesian methods.

In spite of their design and apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable efficacy of Bayes classifiers [1]. Still, a comprehensive comparison with other classification methods in 2006 showed that Bayes classification is outperformed by more current approaches, such as boosted trees or random forests[2].

An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix.

After training its model using training data, the Bayes classifier classifies any new data input in two steps. First, it uses the input's known feature values and the Bayes formula to calculate the input's posterior probability of belonging to each class. Then, it uses its classification rule to place the input into the most-probable class, which is the one associated with the input's largest posterior probability.

In mathematical terms, for a new data input having feature values $\,(X = x)\in \mathcal{X}$, the Bayes classifier labels the input as $(Y = y) \in \mathcal{Y}$, such that the input's posterior probability $\,P(Y = y|X = x)$ is maximum over all of the members of $\mathcal{Y}$.

Suppose there are $\,k$ classes and we are given a new data input having feature values $\,x$. The following derivation shows how the Bayes classifier finds the input's posterior probability $\,P(Y = y|X = x)$ of belonging to each class $y \in \mathcal{Y}$.

\begin{align} P(Y=y|X=x) &= \frac{P(X=x|Y=y)P(Y=y)}{P(X=x)} \\ &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)} \end{align}

Here, $\,P(Y=y|X=x)$ is known as the posterior probability as mentioned above, $\,P(Y=y)$ is known as the prior probability, $\,P(X=x|Y=y)$ is known as the likelihood, and $\,P(X=x)$ is known as the evidence.

In the special case where there are two classes, i.e., $\, \mathcal{Y}=\{0, 1\}$, the Bayes classifier makes use of the function $\,r(x)=P\{Y=1|X=x\}$ which is the posterior probability of a new data input having feature values $\,x$ belonging to the class $\,Y = 1$. Following the above derivation for the posterior probabilities of a new data input, the Bayes classifier calculates $\,r(x)$ as follows:

\begin{align} r(x)&=P(Y=1|X=x) \\ &=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x)}\\ &=\frac{P(X=x|Y=1)P(Y=1)}{P(X=x|Y=1)P(Y=1)+P(X=x|Y=0)P(Y=0)} \end{align}

The Bayes classifier's classification rule $\,h^*: \mathcal{X} \mapsto \mathcal{Y}$, then, is

$\, h^*(x)= \left\{\begin{matrix} 1 &\text{if } \hat r(x)\gt \frac{1}{2} \\ 0 &\mathrm{otherwise} \end{matrix}\right.$.

Here, $\,x$ is the feature values of a new data input and $\hat r(x)$ is the estimated value of the function $\,r(x)$ given by the Bayes classifier's model after feeding $\,x$ into the model. Still in this special case of two classes, the Bayes classifier's decision boundary is defined as the set $\,D(h)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}$. The decision boundary $\,D(h)$ essentially combines together the trained model and the decision function $\,h^*$, and it is used by the Bayes classifier to assign any new data input to a label of either $\,Y = 0$ or $\,Y = 1$ depending on which side of the decision boundary the input lies in. From this decision boundary, it is easy to see that, in the case where there are two classes, the Bayes classifier's classification rule can be re-expressed as

$\, h^*(x)= \left\{\begin{matrix} 1 &\text{if } P(Y=1|X=x)\gt P(Y=0|X=x) \\ 0 &\mathrm{otherwise} \end{matrix}\right.$.

Bayes Classification Rule Optimality Theorem The Bayes classifier is the optimal classifier in that it results in the least possible true probability of misclassification for any given new data input, i.e., for any generic classifier having classification rule $\,h$, it is always true that $\,L(h^*(x)) \le L(h(x))$. Here, $\,L$ represents the true error rate, $\,h^*$ is the Bayes classifier's classification rule, and $\,x$ is any given data input's feature values.

Although the Bayes classifier is optimal in the theoretical sense, other classifiers may nevertheless outperform it in practice. The reason for this is that various components which make up the Bayes classifier's model, such as the likelihood and prior probabilities, must either be estimated using training data or be guessed with a certain degree of belief. As a result, the estimated values of the components in the trained model may deviate quite a bit from their true population values, and this can ultimately cause the calculated posterior probabilities of inputs to deviate quite a bit from their true values. Estimation of all these probability functions, as likelihood, prior probability, and evidence function is a very expensive task, computationally, which also makes some other classifiers more favorable than Bayes classifier.

A detailed proof of this theorem is available here.

Defining the classification rule:

In the special case of two classes, the Bayes classifier can use three main approaches to define its classification rule $\,h^*$:

1) Empirical Risk Minimization: Choose a set of classifiers $\mathcal{H}$ and find $\,h^*\in \mathcal{H}$ that minimizes some estimate of the true error rate $\,L(h^*)$.
2) Regression: Find an estimate $\hat r$ of the function $x$ and define
$\, h^*(x)= \left\{\begin{matrix} 1 &\text{if } \hat r(x)\gt \frac{1}{2} \\ 0 &\mathrm{otherwise} \end{matrix}\right.$.
3) Density Estimation: Estimate $\,P(X=x|Y=0)$ from the $\,X_{i}$'s for which $\,Y_{i} = 0$, estimate $\,P(X=x|Y=1)$ from the $\,X_{i}$'s for which $\,Y_{i} = 1$, and estimate $\,P(Y = 1)$ as $\,\frac{1}{n} \sum_{i=1}^{n} Y_{i}$. Then, calculate $\,\hat r(x) = \hat P(Y=1|X=x)$ and define
$\, h^*(x)= \left\{\begin{matrix} 1 &\text{if } \hat r(x)\gt \frac{1}{2} \\ 0 &\mathrm{otherwise} \end{matrix}\right.$.

Typically, the Bayes classifier uses approach 3 to define its classification rule. These three approaches can easily be generalized to the case where the number of classes exceeds two.

Multi-class classification:

Suppose there are $\,k$ classes, where $\,k \ge 2$.

In the above discussion, we introduced the Bayes formula for this general case:

\begin{align} P(Y=y|X=x) &=\frac{P(X=x|Y=y)P(Y=y)}{\Sigma_{\forall i \in \mathcal{Y}}P(X=x|Y=i)P(Y=i)} \end{align}

which can re-worded as:

\begin{align} P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i} \end{align}

Here, $\,f_y(x) = P(X=x|Y=y)$ is known as the likelihood function and $\,\pi_y = P(Y=y)$ is known as the prior probability.

In the general case where there are at least two classes, the Bayes classifier uses the following theorem to assign any new data input having feature values $\,x$ into one of the $\,k$ classes.

Theorem

Suppose that $\mathcal{Y}= \{1, \dots, k\}$, where $\,k \ge 2$. Then, the optimal classification rule is $\,h^*(x) = arg max_{i} P(Y=i|X=x)$, where $\,i \in \{1, \dots, k\}$.

Example: We are going to predict if a particular student will pass STAT 441/841. There are two classes represented by $\, \mathcal{Y}\in \{ 0,1 \}$, where 1 refers to pass and 0 refers to fail. Suppose that the prior probabilities are estimated or guessed to be $\,\hat P(Y = 1) = \hat P(Y = 0) = 0.5$. We have data on past student performances, which we shall use to train the model. For each student, we know the following:

Whether or not the student’s GPA was greater than 3.0 (G).
Whether or not the student had a strong math background (M).
Whether or not the student was a hard worker (H).
Whether or not the student passed or failed the course. Note: these are the known y values in the training data.

These known data are summarized in the following tables:

For each student, his/her feature values is $\, x = \{G, M, H\}$ and his or her class is $\, y \in \{0, 1\}$.

Suppose there is a new student having feature values $\, x = \{0, 1, 0\}$, and we would like to predict whether he/she would pass the course. $\,\hat r(x)$ is found as follows:

$\, \hat r(x) = P(Y=1|X =(0,1,0))=\frac{P(X=(0,1,0)|Y=1)P(Y=1)}{P(X=(0,1,0)|Y=0)P(Y=0)+P(X=(0,1,0)|Y=1)P(Y=1)}=\frac{0.05*0.5}{0.05*0.5+0.2*0.5}=\frac{0.025}{0.125}=\frac{1}{5}\lt \frac{1}{2}.$

The Bayes classifier assigns the new student into the class $\, h^*(x)=0$. Therefore, we predict that the new student would fail the course.

Naive Bayes Classifier:

The naive Bayes classifier is a special (simpler) case of the Bayes classifier. It uses an extra assumption: that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. This assumption allows for an easier likelihood function $\,f_y(x)$ in the equation:

\begin{align} P(Y=y|X=x) &=\frac{f_y(x)\pi_y}{\Sigma_{\forall i \in \mathcal{Y}} f_i(x)\pi_i} \end{align}

The simper form of the likelihood function seen in the naive Bayes is:

\begin{align} f_y(x) = P(X=x|Y=y) = {\prod_{i=1}^{n} P(X_{i}=x_{i}|Y=y)} \end{align}

The Bayes classifier taught in class was not the naive Bayes classifier.

### Bayesian vs. Frequentist

The Bayesian view of probability and the frequentist view of probability are the two major schools of thought in the field of statistics regarding how to interpret the probability of an event.

The Bayesian view of probability states that, for any event E, event E has a prior probability that represents how believable event E would occur prior to knowing anything about any other event whose occurrence could have an impact on event E's occurrence. Theoretically, this prior probability is a belief that represents the baseline probability for event E's occurrence. In practice, however, event E's prior probability is unknown, and therefore it must either be guessed at or be estimated using a sample of available data. After obtaining a guessed or estimated value of event E's prior probability, the Bayesian view holds that the probability, that is, the believability of event E's occurrence, can always be made more accurate should any new information regarding events that are relevant to event E become available. The Bayesian view also holds that the accuracy for the estimate of the probability of event E's occurrence is higher as long as there are more useful information available regarding events that are relevant to event E. The Bayesian view therefore holds that there is no intrinsic probability of occurrence associated with any event. If one adherers to the Bayesian view, one can then, for instance, predict tomorrow's weather as having a probability of, say, $\,50%$ for rain. The Bayes classifier as described above is a good example of a classifier developed from the Bayesian view of probability. The earliest works that lay the framework for the Bayesian view of probability is accredited to Thomas Bayes (1702–1761).

In contrast to the Bayesian view of probability, the frequentist view of probability holds that there is an intrinsic probability of occurrence associated with every event to which one can carry out many, if not an infinite number, of well-defined independent random trials. In each trial for an event, the event either occurs or it does not occur. Suppose $n_x$ denotes the number of times that an event occurs during its trials and $n_t$ denotes the total number of trials carried out for the event. The frequentist view of probability holds that, in the long run, where the number of trials for an event approaches infinity, one could theoretically approach the intrinsic value of the event's probability of occurrence to any arbitrary degree of accuracy, i.e., :$P(x) = \lim_{n_t\rightarrow \infty}\frac{n_x}{n_t}$. In practice, however, one can only carry out a finite number of trials for an event and, as a result, the probability of the event's occurrence can only be approximated as $P(x) \approx \frac{n_x}{n_t}$. If one adherers to the frequentist view, one cannot, for instance, predict the probability that there would be rain tomorrow. This is because one cannot possibly carry out trials for any event that is set in the future. The founder of the frequentist school of thought is arguably the famous Greek philosopher Aristotle. In his work Rhetoric, Aristotle gave the famous line "the probable is that which for the most part happens".

More information regarding the Bayesian and the frequentist schools of thought are available here. Furthermore, an interesting and informative youtube video that explains the Bayesian and frequentist views of probability is available here.

There is useful information about Machine Learning, Neural and Statistical Classification in this link [2] Machine Learning, Neural and Statistical Classification; there is some description of Classification in chapter 2 Classical Statistical Methods in chapter 3 and Modern Statistical Techniques in chapter 4.

### Extension: Statistical Classification Framework

In statistical classification, each object is represented by 'd' (a set of features) a measurement vector, and the goal of classifier becomes finding compact and disjoint regions for classes in a d-dimensional feature space. Such decision regions are defined by decision rules that are known or can be trained. The simplest configuration of a classification consists of a decision rule and multiple membership functions; each membership function represents a class. The following figures illustrate this general framework.

Simple Conceptual Classifier.

The classification process can be described as follows:

A measurement vector is input to each membership function. Membership functions feed the membership scores to the decision rule. A decision rule compares the membership scores and returns a class label.

## Linear and Quadratic Discriminant Analysis

### Introduction

Linear discriminant analysis (LDA) and the related Fisher's linear discriminant are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterize or separate two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

LDA is also closely related to principal component analysis (PCA) and factor analysis in that both look for linear combinations of variables which best explain the data. LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities. Discriminant analysis is also different from factor analysis in that it is not an interdependence technique: a distinction between independent variables and dependent variables (also called criterion variables) must be made.

LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is discriminant correspondence analysis.

### Content

First, we shall limit ourselves to the case where there are two classes, i.e. $\, \mathcal{Y}=\{0, 1\}$. In the above discussion, we introduced the Bayes classifier's decision boundary $\,D(h^*)=\{x: P(Y=1|X=x)=P(Y=0|X=x)\}$, which represents a hyperplane that determines the class of any new data input depending on which side of the hyperplane the input lies in. Now, we shall look at how to derive the Bayes classifier's decision boundary under certain assumptions of the data. Linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) are two of the most well-known ways for deriving the Bayes classifier's decision boundary, and we shall look at each of them in turn.

Let us denote the likelihood $\ P(X=x|Y=y)$ as $\ f_y(x)$ and the prior probability $\ P(Y=y)$ as $\ \pi_y$.

First, we shall examine LDA. As explained above, the Bayes classifier is optimal. However, in practice, the prior and conditional densities are not known. Under LDA, one gets around this problem by making the assumptions that both of the two classes have multivariate normal (Gaussian) distributions and the two classes have the same covariance matrix $\, \Sigma$. Under the assumptions of LDA, we have: $\ P(X=x|Y=y) = f_y(x) = \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_k)^\top \Sigma^{-1} (x - \mu_k) \right)$. Now, to derive the Bayes classifier's decision boundary using LDA, we equate $\, P(Y=1|X=x)$ to $\, P(Y=0|X=x)$ and proceed from there. The derivation of $\,D(h^*)$ is as follows:

$\,Pr(Y=1|X=x)=Pr(Y=0|X=x)$
$\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}$ (using Bayes' Theorem)
$\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)$ (canceling the denominators)
$\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0$
$\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0$
$\,\Rightarrow \exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) \right)\pi_1=\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) \right)\pi_0$
$\,\Rightarrow -\frac{1}{2} (x - \mu_1)^\top \Sigma^{-1} (x - \mu_1) + \log(\pi_1)=-\frac{1}{2} (x - \mu_0)^\top \Sigma^{-1} (x - \mu_0) +\log(\pi_0)$ (taking the log of both sides).
$\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( x^\top\Sigma^{-1}x + \mu_1^\top\Sigma^{-1}\mu_1 - 2x^\top\Sigma^{-1}\mu_1 - x^\top\Sigma^{-1}x - \mu_0^\top\Sigma^{-1}\mu_0 + 2x^\top\Sigma^{-1}\mu_0 \right)=0$ (expanding out)
$\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\left( \mu_1^\top\Sigma^{-1} \mu_1-\mu_0^\top\Sigma^{-1}\mu_0 - 2x^\top\Sigma^{-1}(\mu_1-\mu_0) \right)=0$ (canceling out alike terms and factoring).

It is easy to see that, under LDA, the Bayes's classifier's decision boundary $\,D(h^*)$ has the form $\,ax+b=0$ and it is linear in $\,x$. This is where the word linear in linear discriminant analysis comes from.

LDA under the two-classes case can easily be generalized to the general case where there are $\,k \ge 2$ classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes $\,m$ and $\,n$, then all we need to do is follow a derivation very similar to the one shown above, except with the classes $\,1$ and $\,0$ being replaced by the classes $\,m$ and $\,n$. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary $\,D(h^*)$ between classes $\,m$ and $\,n$ to be $\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\left( \mu_m^\top\Sigma^{-1} \mu_m-\mu_n^\top\Sigma^{-1}\mu_n - 2x^\top\Sigma^{-1}(\mu_m-\mu_n) \right)=0$ . In addition, for any two classes $\,m$ and $\,n$ for whom we would like to find the Bayes classifier's decision boundary using LDA, if $\,m$ and $\,n$ both have the same number of data, then, in this special case, the resulting decision boundary would lie exactly halfway between the centers (means) of $\,m$ and $\,n$.

The Bayes classifier's decision boundary for any two classes as derived using LDA looks something like the one that can be found in this link:

Although the assumption under LDA may not hold true for most real-world data, it nevertheless usually performs quite well in practice, where it often provides near-optimal classifications. For instance, the Z-Score credit risk model that was designed by Edward Altman in 1968 and revisited in 2000, is essentially a weighted LDA. This model has demonstrated a 85-90% success rate in predicting bankruptcy, and for this reason it is still in use today.

According to this link, some of the limitations of LDA include:

• LDA implicitly assumes that the data in each class has a Gaussian distribution.
• LDA implicitly assumes that the mean rather than the variance is the discriminating factor.
• LDA may over-fit the training data.

The following link provides a comparison of discriminant analysis and artificial neural networks [3]

#### Different Approaches to LDA

Data sets can be transformed and test vectors can be classified in the transformed space by two different approaches.

• Class-dependent transformation: This type of approach involves maximizing the ratio of between

class variance to within class variance. The main objective is to maximize this ratio so that adequate class separability is obtained. The class-specific type approach involves using two optimizing criteria for transforming the data sets independently.

• Class-independent transformation: This approach involves maximizing the ratio of overall variance

to within class variance. This approach uses only one optimizing criterion to transform the data sets and hence all data points irrespective of their class identity are transformed using this transform. In this type of LDA, each class is considered as a separate class against all other classes.

The following are some applications that use LDA and QDA:

1- Linear discriminant analysis for improved large vocabulary continuous speech recognition here

2- 2D-LDA: A statistical linear discriminant analysis for image matrix here

3- Regularization studies of linear discriminant analysis in small sample size scenarios with application to face recognition here

4- The sparse discriminant vectors are useful for supervised dimension reduction for high dimensional data. Naive application of classical Fisher’s LDA to high dimensional, low sample size settings suffers from the data piling problem. In [4] they have use sparse LDA method which selects important variables for discriminant analysis and thereby yield improved classification. Introducing sparsity in the discriminant vectors is very effective in eliminating data piling and the associated overfitting problem.

## Linear and Quadratic Discriminant Analysis cont'd - September 23, 2010

### LDA x QDA

Linear discriminant analysis[5] is a statistical method used to find the linear combination of features which best separate two or more classes of objects or events. It is widely applied in classifying diseases, positioning, product management, and marketing research. LDA assumes that the different classes have the same covariance matrix $\, \Sigma$.

Quadratic Discriminant Analysis[6], on the other hand, aims to find the quadratic combination of features. It is more general than linear discriminant analysis. Unlike LDA, QDA does not make the assumption that the different classes have the same covariance matrix $\, \Sigma$. Instead, QDA makes the assumption that each class $\, k$ has its own covariance matrix $\, \Sigma_k$.

The derivation of the Bayes classifier's decision boundary $\,D(h^*)$ under QDA is similar to that under LDA. Again, let us first consider the two-classes case where $\, \mathcal{Y}=\{0, 1\}$. This derivation is given as follows:

$\,Pr(Y=1|X=x)=Pr(Y=0|X=x)$
$\,\Rightarrow \frac{Pr(X=x|Y=1)Pr(Y=1)}{Pr(X=x)}=\frac{Pr(X=x|Y=0)Pr(Y=0)}{Pr(X=x)}$ (using Bayes' Theorem)
$\,\Rightarrow Pr(X=x|Y=1)Pr(Y=1)=Pr(X=x|Y=0)Pr(Y=0)$ (canceling the denominators)
$\,\Rightarrow f_1(x)\pi_1=f_0(x)\pi_0$
$\,\Rightarrow \frac{1}{ (2\pi)^{d/2}|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{ (2\pi)^{d/2}|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0$
$\,\Rightarrow \frac{1}{|\Sigma_1|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1) \right)\pi_1=\frac{1}{|\Sigma_0|^{1/2} }\exp\left( -\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0) \right)\pi_0$ (by cancellation)
$\,\Rightarrow -\frac{1}{2}\log(|\Sigma_1|)-\frac{1}{2} (x - \mu_1)^\top \Sigma_1^{-1} (x - \mu_1)+\log(\pi_1)=-\frac{1}{2}\log(|\Sigma_0|)-\frac{1}{2} (x - \mu_0)^\top \Sigma_0^{-1} (x - \mu_0)+\log(\pi_0)$ (by taking the log of both sides)
$\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top\Sigma_1^{-1}x + \mu_1^\top\Sigma_1^{-1}\mu_1 - 2x^\top\Sigma_1^{-1}\mu_1 - x^\top\Sigma_0^{-1}x - \mu_0^\top\Sigma_0^{-1}\mu_0 + 2x^\top\Sigma_0^{-1}\mu_0 \right)=0$ (by expanding out)
$\,\Rightarrow \log(\frac{\pi_1}{\pi_0})-\frac{1}{2}\log(\frac{|\Sigma_1|}{|\Sigma_0|})-\frac{1}{2}\left( x^\top(\Sigma_1^{-1}-\Sigma_0^{-1})x + \mu_1^\top\Sigma_1^{-1}\mu_1 - \mu_0^\top\Sigma_0^{-1}\mu_0 - 2x^\top(\Sigma_1^{-1}\mu_1-\Sigma_0^{-1}\mu_0) \right)=0$

It is easy to see that, under QDA, the decision boundary $\,D(h^*)$ has the form $\,ax^2+bx+c=0$ and it is quadratic in $\,x$. This is where the word quadratic in quadratic discriminant analysis comes from.

As is the case with LDA, QDA under the two-classes case can easily be generalized to the general case where there are $\,k \ge 2$ classes. In the general case, suppose we wish to find the Bayes classifier's decision boundary between the two classes $\,m$ and $\,n$, then all we need to do is follow a derivation very similar to the one shown above, except with the classes $\,1$ and $\,0$ being replaced by the classes $\,m$ and $\,n$. Following through with a similar derivation as the one shown above, one obtains the Bayes classifier's decision boundary $\,D(h^*)$ between classes $\,m$ and $\,n$ to be $\,\log(\frac{\pi_m}{\pi_n})-\frac{1}{2}\log(\frac{|\Sigma_m|}{|\Sigma_n|})-\frac{1}{2}\left( x^\top(\Sigma_m^{-1}-\Sigma_n^{-1})x + \mu_m^\top\Sigma_m^{-1}\mu_m - \mu_n^\top\Sigma_n^{-1}\mu_n - 2x^\top(\Sigma_m^{-1}\mu_m-\Sigma_n^{-1}\mu_n) \right)=0$.

### Summarizing LDA and QDA

We can summarize what we have learned so far into the following theorem.

Theorem:

Suppose that $\,Y \in \{1,\dots,K\}$, if $\,f_k(x) = Pr(X=x|Y=k)$ is Gaussian, the Bayes Classifier rule is

$\,h^*(x) = \arg\max_{k} \delta_k(x)$

where,

• In the case of LDA, which assumes that a common covariance matrix is shared by all classes, $\,\delta_k(x) = x^\top\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + log (\pi_k)$, and the Bayes classifier's decision boundary $\,D(h^*)$ is linear in $\,x$.
• In the case of QDA, which assumes that each class has its own covariance matrix, $\,\delta_k(x) = - \frac{1}{2}log(|\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k) + log (\pi_k)$, and the Bayes classifier's decision boundary $\,D(h^*)$ is quadratic in $\,x$.

Note $\,\arg\max_{k} \delta_k(x)$returns the set of k for which $\,\delta_k(x)$ attains its largest value.

### In practice

We need to estimate the prior, so in order to do this, we use the Maximum Likelihood estimates from the sample for $\,\pi,\mu_k,\Sigma_k$ in place of their true values, i.e.

File:estimation.png
Estimation of the probability of belonging to either class k or l

$\,\hat{\pi_k} = \hat{Pr}(y=k) = \frac{n_k}{n}$