f10 Stat841 digest: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 57: Line 57:
To open the Neural Network discussion, we present a formulation of the universal function approximator. The mathematical model for Neural Networks is then built upon this formulation. We also discuss the trade-off between training error and testing error -- known as the generalization problem -- under the universal function approximator section.
To open the Neural Network discussion, we present a formulation of the universal function approximator. The mathematical model for Neural Networks is then built upon this formulation. We also discuss the trade-off between training error and testing error -- known as the generalization problem -- under the universal function approximator section.


== Model Selection - October 26, 2010 ==
== Complexity Control - October 26, 2010 ==


In this lecture, we introduce the topic of Model Selection, also known as Complexity Control. This involves choosing the model that will best classify any given data by minimizing test error rates or by regularization.
In this lecture, we introduce the topic of Model Selection, also known as Complexity Control. This involves choosing the model that will best classify any given data by minimizing test error rates or by regularization.

Revision as of 16:25, 27 October 2010

Classification - September 21, 2010

  • Classification is an area of supervised learning that systematically assigns unlabeled novel data to their label through the characteristics and attributes obtained from observation.
  • Classification is the prediction of a discrete random variable [math]\displaystyle{ \mathcal{Y} }[/math] from another random variable [math]\displaystyle{ \mathcal{X} }[/math], where [math]\displaystyle{ \mathcal{Y} }[/math] represents the label assigned to a new data input and [math]\displaystyle{ \mathcal{X} }[/math] represents the known feature values of the input. The classification rule used by a classifier has the form [math]\displaystyle{ \,h: \mathcal{X} \mapsto \mathcal{Y} }[/math].
  • True error rate is the probability that the classification rule [math]\displaystyle{ \,h }[/math] does not correctly classify any data input. Empirical error rate is the frequency where the classification rule [math]\displaystyle{ \,h }[/math] does not correctly classify any data input in the training set. In experimental tasks true error cannot be measured and as a result the empirical error rate is used as its estimate.
  • Bayes Classifier is a probabilistic classifier by applying Bayes Theorem with strong (naive) independence assumptions. It has the advantage of requiring small training data to estimate the parameters needed for classification. Under this classifier an input [math]\displaystyle{ \,x }[/math] is classified to class [math]\displaystyle{ \,y }[/math] where the posterior probability for [math]\displaystyle{ \,y }[/math] is the largest for input [math]\displaystyle{ \,x }[/math].
  • Bayes Classification Rule Optimality Theorem states that Bayes classifier is the optimal classifier, in other words the true error rate of the Bayes classification rule will always be smaller or equal to any other classification rule
  • Bayes Decision Boundary is the hyperplane boundary that separates the two classes [math]\displaystyle{ \,m, n }[/math] obtained by setting the posterior probability for the two classes equal, [math]\displaystyle{ \,D(h)=\{x: P(Y=m|X=x)=P(Y=n|X=x)\} }[/math].
  • Linear Discriminant Analysis (LDA) for the Bayes classifier decision boundary between two classes makes the assumption that both are generated from Gaussian distribution and have the same covariance matrix.
  • PCA is an appropriate method when you have obtained measures on a number of observed variables and wish to develop a smaller number of artificial variables (called principal components) that will account for most of the variance in the observed variables. This is a powerful technique for dimensionally reduction. It has applications in data visualization, data mining, reducing the dimensionality of a data set and etc. It is mostly used for data analysis and for making predictive models.

Linear and Quadratic Discriminant Analysis cont'd - September 23, 2010

In the second lecture, Professor Ali Ghodsi recapitulates that by calculating the class posteriors [math]\displaystyle{ \Pr(Y=k|X=x) }[/math] we have optimal classification. He also shows that by assuming that the classes have common covariance matrix [math]\displaystyle{ \Sigma_{k}=\Sigma \forall k }[/math], the decision boundary between classes [math]\displaystyle{ k }[/math] and [math]\displaystyle{ l }[/math] is linear (LDA). However, if we do not assume same covariance between the two classes, the decision boundary is a quadratic function (QDA).

The following MATLAB examples can be used to demonstrated LDA and QDA.

Fisher's (Linear) Discriminant Analysis (FDA) - Two Class Problem - October 5, 2010

This lecture introduces Fisher's linear discrimination analysis (FDA), which is a supervised dimensionality reduction method. FDA does not assume any distribution of the data and it works by reducing the dimensionality of the data by projecting the data on a line. That is, given d-dimensional data FDA project it to one-dimensional representation by [math]\displaystyle{ z = \underline{w}^T \underline{x} }[/math] where [math]\displaystyle{ x \in \mathbb{R}^{d} }[/math] and [math]\displaystyle{ \underline{w} = \begin{bmatrix}w_1 \\ \vdots \\w_d \end{bmatrix} _{d \times 1} }[/math]
FDA derives a set of feature vectors by which high-dimensional data can be projected onto a low-dimensional feature space in the sense of maximizing class separability. Furthermore, the lecture clarifies a set of FDA basic concepts like Fisher’s ratio, ratio of between-class scatter matrix to within-class scatter matrix. It also discusses the goals specified by Fisher for his analysis then proceeding by mathematical formulation of these goals.

Fisher's (Linear) Discriminant Analysis (FDA) - Multi-Class Problem - October 7, 2010

This lecture describes a generalization of Fisher's discriminant analysis to more than 2 classes. For the multi-class, or [math]\displaystyle{ k }[/math]-class problem, we are trying to find a projection from a [math]\displaystyle{ d }[/math]-dimensional space to a [math]\displaystyle{ (k-1) }[/math]-dimensional space. Recall that for the [math]\displaystyle{ 2 }[/math]-class problem, the objective function was [math]\displaystyle{ \displaystyle \max \frac{(w^Ts_Bw)}{(w^Ts_ww)} }[/math] . In the [math]\displaystyle{ k }[/math]-class problem, [math]\displaystyle{ \mathbf{W} }[/math] is a [math]\displaystyle{ d\times (k-1) }[/math] transformation matrix, [math]\displaystyle{ \mathbf{W} =[\mathbf{w}_{1}, \mathbf{w}_{2},..., \mathbf{w}_{k-1}] }[/math], and the objective function becomes [math]\displaystyle{ \displaystyle \max \frac{Tr[\mathbf{W}^{T}\mathbf{S}_{B}\mathbf{W}]}{Tr[\mathbf{W}^{T}\mathbf{S}_{W}\mathbf{W}]} }[/math]

As in the [math]\displaystyle{ 2 }[/math]-class case, this is also a generalized eigenvalue problem, and the solution can be computed as the first [math]\displaystyle{ (k-1) }[/math] eigenvectors of [math]\displaystyle{ \mathbf{S}_{W}^{-1}\mathbf{S}_{B}, }[/math] i.e. [math]\displaystyle{ \mathbf{S}_{W}^{-1}\mathbf{S}_{B}\mathbf{w}_{i} =\lambda_{i}\mathbf{w}_{i} }[/math].

Linear and Logistic Regression - October 12, 2010

In this Lecture, Prof Ali Ghodsi reviews the LDA as a dimensionality reduction method and introduces 2 models for regression, linear and logistic regression.

Regression analysis is a general statistical technique for modeling and analyzing how a dependent variable changes according to changes in independent variables. In classification, we are interested in how a label, [math]\displaystyle{ \,y }[/math], changes according to changes in [math]\displaystyle{ \,X }[/math].

General information on linear regression can be found at the University of South Florida and this MIT lecture.

Logistic Regression Cont. - October 14, 2010

Traditionally logistic regression parameters are estimated using maximum likelihood. However , other optimization techniques may be used as well.
Since there is no closed form solution for finding the zero of the first derivative of the log likelihood the Newton Raphson algorithm is used. Since the problem is convex Newtons is guaranteed to converge to a global optimum.
Logistic regression requires less parameters than LDA or QDA and is therefore more favorable for high dimensional data.

Multi-Class Logistic Regression & Perceptron - October 19, 2010

In this lecture, the topic of logistic regression was finalized by covering the multi-class logistic regression and new topic of perceptron has started. Perceptron is a linear classifier for two-class problems. The main goal of perceptron is to minimize the distances between the misclassified points and the decision boundary. This will be continued in the following lectures.

Perceptron Learning Algorithm and Feed Forward Neural Networks - October 21, 2010

In this lecture, we finalize our discussion of the Perceptron by reviewing its learning algorithm, which is based on gradient descent. We then begin the next topic, Neural Networks (NN), and we focus on a NN that is useful for classification: the Feed Forward Neural Network (FFNN). The mathematical model for the FFNN is shown, and we review one of its most popular learning algorithms: Back-Propagation.

To open the Neural Network discussion, we present a formulation of the universal function approximator. The mathematical model for Neural Networks is then built upon this formulation. We also discuss the trade-off between training error and testing error -- known as the generalization problem -- under the universal function approximator section.

Complexity Control - October 26, 2010

In this lecture, we introduce the topic of Model Selection, also known as Complexity Control. This involves choosing the model that will best classify any given data by minimizing test error rates or by regularization.

The two problems that we wish to avoid when choosing a model are over-fitting and under-fitting.

We discuss two methods to estimate the true error rate: simple cross-validation and K-fold cross-validation. Both of these methods involve splitting the data into a training set to train the model and a validation set to compute the test error rate.