# f10 Stat841 digest

## Classification - September 21, 2010

• Classification is an area of supervised learning that systematically assigns unlabeled novel data to their label through the characteristics and attributes obtained from observation.
• Classification is the prediction of a discrete random variable $\mathcal{Y}$ from another random variable $\mathcal{X}$, where $\mathcal{Y}$ represents the label assigned to a new data input and $\mathcal{X}$ represents the known feature values of the input. The classification rule used by a classifier has the form $\,h: \mathcal{X} \mapsto \mathcal{Y}$.
• True error rate is the probability that the classification rule $\,h$ does not correctly classify any data input. Empirical error rate is the frequency where the classification rule $\,h$ does not correctly classify any data input in the training set. In experimental tasks true error cannot be measured and as a result the empirical error rate is used as its estimate.
• Bayes Classifier is a probabilistic classifier by applying Bayes Theorem with strong (naive) independence assumptions. It has the advantage of requiring small training data to estimate the parameters needed for classification. Under this classifier an input $\,x$ is classified to class $\,y$ where the posterior probability for $\,y$ is the largest for input $\,x$.
• Bayes Classification Rule Optimality Theorem states that Bayes classifier is the optimal classifier, in other words the true error rate of the Bayes classification rule will always be smaller or equal to any other classification rule
• Bayes Decision Boundary is the hyperplane boundary that separates the two classes $\,m, n$ obtained by setting the posterior probability for the two classes equal, $\,D(h)=\{x: P(Y=m|X=x)=P(Y=n|X=x)\}$.
• Linear Discriminant Analysis (LDA) for the Bayes classifier decision boundary between two classes makes the assumption that both are generated from Gaussian distribution and have the same covariance matrix.
• PCA is an appropriate method when you have obtained measures on a number of observed variables and wish to develop a smaller number of artificial variables (called principal components) that will account for most of the variance in the observed variables. This is a powerful technique for dimensionally reduction. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.

## Linear and Quadratic Discriminant Analysis cont'd - September 23, 2010

In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors $\Pr(Y=k|X=x)$ we have optimal classification. He also shows that by assuming that the classes have common covariance matrix $\Sigma_{k}=\Sigma \forall k$, the decision boundary between classes $k$ and $l$ is linear (LDA). However, if we do not assume same covariance between the two classes, the decision boundary is a quadratic function (QDA).

The following MATLAB examples can be used to demonstrated LDA and QDA.

## Principle Component Analysis -September 30, 2010

Principal component analysis (PCA) is a dimensionality-reduction method invented by Karl Pearson in 1901 [1]. Depending on where this methodology is applied, other common names of PCA include the Karhunen–Loève transform (KLT) , the Hotelling transform, and the proper orthogonal decomposition (POD). PCA is the simplist eigenvector-based multivariate analysis. It reduces the dimensionality of the data by revealing the internal structure of the data in a way that best explains the variance in the data. To this end, PCA works by using a user-defined number of the most important directions of variation (dimensions or principal components) of the data to project the data onto these directions so as to produce a lower-dimensional representation of the original data. The resulting lower-dimensional representation of our data is usually much easier to visualize and it also exhibits the most informative aspects (dimensions) of our data whilst capturing as much of the variation exhibited by our data as it possibly could.

## Fisher's (Linear) Discriminant Analysis (FDA) - Two Class Problem - October 5, 2010

This lecture introduces Fisher's linear discrimination analysis (FDA), which is a supervised dimensionality reduction method. FDA does not assume any distribution of the data and it works by reducing the dimensionality of the data by projecting the data on a line. That is, given d-dimensional data FDA project it to one-dimensional representation by $z = \underline{w}^T \underline{x}$ where $x \in \mathbb{R}^{d}$ and $\underline{w} = \begin{bmatrix}w_1 \\ \vdots \\w_d \end{bmatrix} _{d \times 1}$
FDA derives a set of feature vectors by which high-dimensional data can be projected onto a low-dimensional feature space in the sense of maximizing class separability. Furthermore, the lecture clarifies a set of FDA basic concepts like Fisher’s ratio, ratio of between-class scatter matrix to within-class scatter matrix. It also discusses the goals specified by Fisher for his analysis then proceeding by mathematical formulation of these goals.

## Fisher's (Linear) Discriminant Analysis (FDA) - Multi-Class Problem - October 7, 2010

This lecture describes a generalization of Fisher's discriminant analysis to more than 2 classes. For the multi-class, or $k$-class problem, we are trying to find a projection from a $d$-dimensional space to a $(k-1)$-dimensional space. Recall that for the $2$-class problem, the objective function was $\displaystyle \max \frac{(w^Ts_Bw)}{(w^Ts_ww)}$ . In the $k$-class problem, $\mathbf{W}$ is a $d\times (k-1)$ transformation matrix, $\mathbf{W} =[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}]$, and the objective function becomes $\displaystyle \max \frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]}$

As in the $2$-class case, this is also a generalized eigenvalue problem, and the solution can be computed as the first $(k-1)$ eigenvectors of $\mathbf{S}_{W}^{-1}\mathbf{S}_{B},$ i.e. $\mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =\lambda_{i}\mathbf{w}_{i}$.

## Linear and Logistic Regression - October 12, 2010

In this Lecture, Prof Ali Ghodsi reviews the LDA as a dimensionality reduction method and introduces 2 models for regression, linear and logistic regression.

Regression analysis is a general statistical technique for modeling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, $\,y$, changes according to changes in $\,X$.

General information on linear regression can be found at the University of South Florida and this MIT lecture.

## Logistic Regression Cont. - October 14, 2010

Traditionally logistic regression parameters are estimated using maximum likelihood. However , other optimization techniques may be used as well.
Since there is no closed form solution for finding the zero of the first derivative of the log likelihood the Newton Raphson algorithm is used. Since the problem is convex Newtons is guaranteed to converge to a global optimum.
Logistic regression requires less parameters than LDA or QDA and is therefore more favorable for high dimensional data.

## Multi-Class Logistic Regression & Perceptron - October 19, 2010

In this lecture, the topic of logistic regression was finalized by covering the multi-class logistic regression and new topic of perceptron has started. Perceptron is a linear classifier for two-class problems. The main goal of perceptron is to minimize the distances between the misclassified points and the decision boundary. This will be continued in the following lectures.

## Perceptron Learning Algorithm and Feed Forward Neural Networks - October 21, 2010

In this lecture, we finalize our discussion of the Perceptron by reviewing its learning algorithm, which is based on gradient descent. We then begin the next topic, Neural Networks (NN), and we focus on a NN that is useful for classification: the Feed Forward Neural Network (FFNN). The mathematical model for the FFNN is shown, and we review one of its most popular learning algorithms: Back-Propagation.

To open the Neural Network discussion, we present a formulation of the universal function approximator. The mathematical model for Neural Networks is then built upon this formulation. We also discuss the trade-off between training error and testing error -- known as the generalization problem -- under the universal function approximator section.

## Complexity Control - October 26, 2010

Selecting the model structure with an appropriate complexity is a standard problem in pattern recognition and machine learning. Systems with the optimal complexity have a good generalization to unseen data.

A wide range of techniques may be used which alter the system complexity. In this lecture, we introduce the topic of Model Selection, also known as Complexity Control. This involves choosing the model that will best classify any given data by minimizing test error rates or by regularization.

The two problems that we wish to avoid when choosing a model are over-fitting and under-fitting. We present over & under-fitting concepts with an example to illustrate how we choose a good classifier and how to avoid over-fitting.

Moreover, Cross-validation has been introduced during the leacture which is a method for estimating generalization error based on "resampling" (Weiss and Kulikowski 1991; Efron and Tibshirani 1993; Hjorth 1994; Plutowski, Sakata, and White 1994; Shao and Tu 1995). The resulting estimates of generalization error are often used for choosing among various models. Also, it can be used for model selection by choosing one of several models that has the smallest estimated generalization error. Finally, We discuss two methods to estimate the true error rate: simple cross-validation and K-fold cross-validation. Both of these methods involve splitting the data into a training set to train the model and a validation set to compute the test error rate.

## Leave-One-Out Cross-Validation and Radial Basis Function Networks - October 28, 2010

In this lecture, we finalize the discussion of cross-validation with leave-one-out cross-validation and begin regression by discussing Radial Basis Function (RBF) Networks. Leave-one-out is similar to k-fold cross-validation, but separates each single element. Under linear model conditions, leave-one-out cross-validation performs well asymptotically.

RBF networks are similar to neural networks. There is one hidden layer with each node being a basis function. Between the hidden layer and output layer there are weights. Noise in the actual model creates issues in model selection, so we consider the expected squared difference between actual and estimated outputs.

{{

 Template:namespace detect


| type = style | image = | imageright = | style = | textstyle = | text = This article may require cleanup to meet Wikicoursenote's quality standards. The specific problem is: It is worth to note that any function can be expanded in terms of RBF functions since they are a orthogonal basis so RBF network has the property to interpolate any function that we have in our problems. Please improve this article if you can. (November 2010) | small = | smallimage = | smallimageright = | smalltext = }}

• Radial Basis Function (RBF) networks is an artificial neural network consisting of an output layer, a hidden layer, and weights from the hidden layer to the output; it has a closed form solution and can be solved without back-propagation.
• During the model selection process, the complexity (the number of neurons in the hidden layer), the basis function, and the basis function parameters are estimated.
• A common basis function is the RBF (or Gaussian) function;
• Function parameters can be estimated by clustering the data into as many clusters as there are nodes and using the sample mean and variance
• The complexity of the model is determined during the training process (methods like K-Fold Cross-Validation or Leave-One-Out Cross Validation can be used)

## Model Selection (SURE) for RBF Network - November 2nd, 2010

Model selection is the task of selecting a model of optimal complexity for a given set of data. Learning a radial basis function network from data is a parameter estimation problem. A model is selected that has parameters associated with the best observed performance on the training data. Squared error is used as the performance index.

Some basic assumptions are taken initially. Such as:

• $\hat f(X)$ denote the prediction/estimated model, which is generated from a training data set $\displaystyle D = \{(x_i, y_i)\}^n_{i=1}$.
• $\displaystyle MSE=E[(\hat f-f)^2]$ denote the mean squared error, where $\hat f(X)$ is the estimated model and $\displaystyle f(X)$ is the true model.

• $\displaystyle \epsilon$ is additive Gaussian noise, and
• $\displaystyle \epsilon_i$ ~ $\displaystyle N(0,\sigma^2)$.

Then we estimated

• $\hat f$ from the training data set $D=\{(x_i,y_i)\}^n_{i=1}$ and
• $\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2E[\epsilon_i(\hat f_i-f_i)]$

We have taken the last term $\displaystyle 2E[\epsilon_i(\hat f_i-f_i)]$ of mean squared error for two cases as follows:

### Case 1

New data point has been introduced to the estimated model, i.e. $(x_i,y_i)\not\in D$; this new point belongs to the testing/validation data set $V=\{(x_i,y_i)\}^m_{i=1}$.

We found that in this case $\displaystyle err=Err+m\sigma^2$.

### Case 2

In this case we do not use new data points to assess the performance of the estimated model, we suppose $(x_i,y_i)\in D$. Then by applying Stein's Lemma, we obtained equation for one data point: $\displaystyle E[(\hat y_i-y_i)^2 ]=E[(\hat f_i-f_i)^2]+\sigma^2-2\sigma^2E\left[\frac {\partial \hat f}{\partial y_i}\right]$ or

$\displaystyle err=Err+n\sigma^2-2\sigma^2\sum_{i=1}^n \frac {\partial \hat f}{\partial y_i}$ known as Stein's unbiased risk estimate (SURE).

### SURE for RBF Network

Finally based on SURE, we derived

$\displaystyle Err=err-n\sigma^2+2\sigma^2m$

for RBF Network with no intercept.

## Regularization Weight Decay - November 4, 2010

### Regularization

Regularization is used to prevent overfitting. We add a penalty function to training error such that when the complexity increases, the penalty function increases as well.

Regularization methods are also used for model selection, well-known model selection techniques include AIC, BIC, and MDL.

### Weight Decay

Weight decay adds a penalty term to the error function. The usual penalty is the sum of squared weights times a decay constant.

$\,REG = err + \rho \sum_{ij}u_{ij}^2$

## Support Vector Machine - November 9, 2010

Support Vector Machines (SVM) are a set of related supervised learning methods used for classification and regression. A support vector machine constructs a maximum margin hyperplane or set of hyperplanes in a higher or infinite dimensional space. The set of points near the class boundaries, support vectors, define the model which can be used for classification, regression or other tasks.

We supposed that

• hyperplane is defined as $\displaystyle \underline{\beta}^{T}\underline{x}+\beta_0=0$
• $\displaystyle y_i\in\{-1,+1\}$
• the data is linearly separable

We discussed the following facts about optimal hyperplane:

1. $\displaystyle \underline{\beta}$ is orthogonal to the hyperplane:
i.e. $\displaystyle \underline{\beta} \perp (\underline{x}_1-\underline{x}_2)$, where $\displaystyle x_1$ and $\displaystyle x_2$ are on the plane.

2. For any point $\displaystyle x_0$ on the plane $\displaystyle \underline{\beta}^{T}\underline{x}_0+\beta_0=0$
$\displaystyle \Rightarrow \underline{\beta}^{T}\underline{x}_0=-\beta_0$

3. For any point $\displaystyle x_i$, the distance of the point to the hyperplane denoted by $\displaystyle d_i$ is the projection of $\displaystyle (\underline{x}_i-\underline{x}_0)$ on $\displaystyle \underline{\beta}$:
$\displaystyle d_i = \frac {\underline{\beta}^{T}(\underline{x}_i-\underline{x}_0)}{| \beta |} = \frac {\underline{\beta}^{T}\underline{x}_i-\underline{\beta}^{T}\underline{x}_0}{| \beta |} = \frac {\underline{\beta}^{T}\underline{x}_i+{\beta}_0}{| \beta |}$

4. Margin is the distance of the closest point to the hyperplane:
margin = $\displaystyle max(y_i d_i)$
margin = $\displaystyle max( \frac {y_i(\underline{\beta}^{T}\underline{x}_i+{\beta}_0)}{| \beta |})$

In order to maximize the Margine we have to minimize $\displaystyle\frac{1}{2}\|\beta\|^2$

s.t $\displaystyle y_i(\beta^T x_i + \beta_0) \geq 1 \forall$ i

5. We applied Lagrange multiplier to margin cost function which resulted in well known optimization problem: Quadratic Programming.
The Lagrangian form is introduced to ensure that the optimization conditions are satisfied, as well as finding an optimal solution.
Therefore, we have a new optimization problem:
$\underset{\alpha}{\max} \sum_{i=1}^n{\alpha_i}- \,\frac{1}{2}\sum_{i=1}^n{\sum_{j=1}^n{\alpha_i\alpha_jy_iy_jx_i^Tx_j}}$
1) $\,\alpha_i$ > 0
2) $\sum{\alpha_i y_i}$ = $0$

This is a much simpler optimization problem and we can solve it by quadratic programming (QP) whcih is a special type of mathematical optimization problem. It is the problem of optimizing (minimizing or maximizing) a quadratic function of several variables subject to linear constraints on these variables.

## Support Vector Machine Cont., Kernel Trick - November 11, 2010

Continued from last lecture, we have created the optimization problem:

$\displaystyle \overset{min} {_{\underline \beta}}$ $\displaystyle \tfrac{1}{2} {| \underline \beta |}^2$

s.t. $\displaystyle y_i(\underline{\beta}^{T}\underline{x}_i+{\beta}_0) \ge 1 \qquad \forall i$

and it's dual is

$\displaystyle \overset{max} {_{ \alpha}}$ $\displaystyle \sum_{i} {\alpha_i} - \tfrac{1}{2} \sum_{i} \sum_{j} {\alpha_i} {\alpha_j} y_i y_j \underline x_i^T \underline x_j$

s.t. $\displaystyle \alpha_i \ge 0$

$\displaystyle \sum_{i} \alpha_i y_i = 0$

Which we can write in matrix form as follows:

$\displaystyle \overset{min} {_{ \underline \alpha}}$ $\displaystyle \alpha^T . \mathbf{1} - \alpha^T S \alpha$

$\displaystyle \underline \alpha \ge \underline 0$

$\displaystyle \alpha^T \underline y = 0$

By using K.K.T. Condition $\displaystyle \alpha_i [ y_i(\underline{\beta}^{T}\underline{x}_i+{\beta}_0) - 1 ] = 0$, which is true for all points.

• If a point is not on the margin i.e. $\displaystyle y_i(\underline{\beta}^{T}\underline{x}_i+{\beta}_0) \gt 1$ then $\displaystyle \alpha_i = 0$.
• If $\displaystyle \alpha_i \gt 0$ then $\displaystyle y_i(\underline{\beta}^{T}\underline{x}_i+{\beta}_0) = 1$ which means that point is on the margin and these points are called Support vectors.

We can easily find $\displaystyle \underline \beta = \sum_{i} {\alpha_i} y_i \underline x_i$ and to find $\displaystyle {\beta}_0$, we can choose a point $\displaystyle i$ with $\displaystyle \alpha_i \gt 0$ then solve $\displaystyle y_i(\underline{\beta}^{T}\underline{x}_i+{\beta}_0) = 1$

Discussed Kernel methods and how we can obtain a non-linear classification boundary using linear methods.

## Support Vector Machine, Kernel Trick - Cont. Case II - November 16, 2010

In this case, we have supposed that the data is non-seperable and optimization problem (Primal form) for this case becomes:

$\min \frac{1}{2}|\beta|^2+\gamma\sum_{i} {\xi_i}$
$\,s.t.$ $y_i(\beta^Tx_i+\beta_0) \geq 1-\xi_i$
$\xi_i \geq 0$

It's Lagrangian is

$\frac{1}{2} |\beta|^2 + \gamma \sum_{i} \xi_i - \sum_{i} \alpha_i[y_i(\beta^T x_i+\beta_0)-1+\xi_i]-\sum_{i} \lambda_i \xi_i$
$\alpha_i \geq 0, \lambda_i \geq 0$

We applied KKT Conditions and found that:

1.$\displaystyle \beta = \sum_{i} \alpha_i y_i x_i$
$\displaystyle \sum_{i} \alpha_i y_i =0$
$\displaystyle \gamma - \alpha_i - \lambda_i=0 \Rightarrow \gamma = \alpha_i+\lambda_i$, which is the only new condition added for soft margin
2.$\,\alpha_i \geq 0, \lambda_i \geq 0$, dual feasibility
3.$\,\alpha_i[y_i(\beta^T x_i+\beta_0)-1+\xi_i]=0$ and $\,\lambda_i \xi_i=0$
4.$\,y_i( \beta^T x_i+ \beta_0)-1+ \xi_i \geq 0$

By solving the Lagrangian, we found new optimization problem:

$\displaystyle \max_{\alpha_i} \sum_{i}{\alpha_i} - \frac{1}{2}\sum_{i}{\sum_{j}{\alpha_i \alpha_j y_i y_j x_i^T x_j}}$ such that $\displaystyle 0 \le \alpha_i \le \gamma$ and $\displaystyle \sum_{i}{\alpha_i y_i} = 0$

Then we discussed that we can easily recover hyperplane by finding $\displaystyle \underline \beta$ and $\displaystyle \beta_0$

• $\displaystyle \beta$ can easily find from first KKT condition i.e. $\displaystyle \beta = \sum_{i} \alpha_i y_i x_i$
• For $\displaystyle \beta_0$, we have to choose a point that satisfy $\displaystyle 0 \lt \alpha_i \lt \gamma$ and
solve the following equation (obtained from third KKT condition) for $\displaystyle \beta_0$:
$\displaystyle y_i( \beta^T x_i+ \beta_0)= 1$

In the end of the lecture we discussed another classification model Naive Bayes Classifier

## Classification Models - November 18, 2010

We have discussed the following classification models: