# self-Taught Learning

## Introduction and Motivation

Figure 1: Illustration of classification paradigms. An orange outline indicates that data is labeled. Diagram from <ref name="Raina"/>

Self-taught learning is a new paradigm in machine learning introduced by Stanford researchers in 2007 <ref name="Raina">Raina et al, (2007). "Self-taught Learning: Transfer Learning from Unlabeled Data". Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR.</ref>. The full paper can be found here. It builds on ideas from existing supervised, semi-supervised and transfer learning algorithms. The differences between these methods depend on the usage of labeled and unlabeled data (Figure 1):

• Supervised Learning - All data is labeled and of the same type (shares the same class labels).
• Semi-supervised learning - Only some of the data is labeled but it is all of the same class. One drawback is that acquiring unlabeled data of the same class is often difficult and/or expensive.
• Transfer learning - All data is labeled but some is of another type (i.e. has class labels that do not apply to data set that we wish to classify).

Self-taught learning combines the latter two ideas. It uses labeled data belonging to the desired classes and unlabeled data from other, somehow similar, classes. It is important to emphasize that the unlabeled data need not belong to the class labels we wish to assign, as long as it is related. This fact distinguishes it from semi-supervised learning. Since it uses unlabeled data from new classes, it can be thought of as semi-supervised transfer learning.

The additional unlabeled data can be used to learn a "higher-level feature representation on the inputs" <ref name="Raina"/>. After this representation is found, data can be analyzed and classified in this new space. In other words, the unlabeled data helps with dimension 'reduction' (in reality, the new space is often of higher dimension, but the representation is sparse) and labeled data helps with classification in the new representation.

The main advantage of this approach is that unlabeled similar data is often easier and cheaper to obtain than labeled data belonging to our classes. Additionally, the data from other classes often shares characteristics with data belonging to the desired classes that can be useful in supervised learning. For example, in the context of image classification, unlabeled images can be used to learn edges so that labeled images can be constructed as a linear combination of these base images. Classification of sound and text are also applications that self-taught learning can be applied to.

## Self-Taught Learning Algorithm

The self-taught learning algorithm can be summarized as follows:

1. Use unlabeled data to construct a new representation (typically a sparse high-dimensional one)
2. Express labeled data in this new representation
3. Use existing classification methods in this new space

There is much freedom as to how to accomplish each of the three steps. Existing techniques can be used, namely in the classification step. However, there is also a lot of room for new techniques to be developed in this area that are tailored specifically to the idea of self-taught learning.

With that said, a specific algorithm will be discussed for each step to give some idea of how it works.

### Problem Specification

Formally, suppose we have $\,m$ labeled training points $\{(x_{\ell}^{(i)}, y^{(i)}), i = 1,\dots,m\}$, where $x_{\ell}^{(i)} \in \mathbb{R}^n$ and $\,y^{(i)}$ is a class label. Further assume that the data are independently and identically taken from some distribution. We also have a set of unlabeled data $\{x_u^{(i)},i=1,\dots,k\}$, $x_u^{i} \in \mathbb{R}^n$. It is not required that this data comes from the same distribution, which differentiates this method from semi-supervised learning; however, the data should somehow be relevant, as it is with transfer learning.

### Constructing Bases Using Unlabeled Data

There are numerous techniques that can potentially be used to construct a new representation of the data. A particular one, the sparse-coding algorithm by Olshausen and Field <ref name="OF1996"> Olshausen, B. A., & Field, D. J.. (1996) "Emergence of simple-cell receptive field properties by learning a sparse code for natural images". Nature, 381,607-609.</ref>, will be discussed here. This method aims to solve the optimization problem

$\min_{\textbf{b},\textbf{a}} \sum_{i=1}^k || x_u^{(i)} - \sum_{j=1}^s a_j^{(i)}b_j||^2_2 + \beta||a^{(i)}||_1 \,\,\,\,\,(1)$

subject to the constraints $||b_j||_2 \leq 1$ for all $j=1,\dots,s$. Note that:

• $\textbf{b} = \{b_1,\dots,b_s\}$ are 'basis vectors', with $b_j \in \mathbb{R}^n$
• $\textbf{a} = \{a^{(1)},\dots, a^{(k)}\}$ are 'activations' with each $a^{(i)} \in \mathbb{R}^s$
• $\,s$ is the number of bases that are used in the new representation. $\,s$ can be larger than $\,n$

The first term in the cost function has the purpose of creating $\,b_j$ and $a_j^{(i)}$ such that each unlabeled point can be approximately expressed as a linear combination of these bases and activations (weights). The $\,\beta||a^{(i)}||_1$ term acts as a penalty to ensure that the activation terms are sparse<ref>Tibshirani, R (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B., 58, 267-28.</ref> <ref>Ng, A.Y. (2004). Feature selection, L1 vs. L2 regularization, and rotational invariance. ICML.</ref> Using terms other than this $L_1$ regularization leads to non-sparse activation terms which, in turn, reduce the learning performance.

This leads to two sub-problems, each of which is convex in either $\,\textbf{a}$ or $\,\textbf{b}$ when the other one is constant. Since each of these cases can be solved efficiently, a solution to this optimization problem can be found by alternating between optimizing $\,\textbf{a}$ and $\,\textbf{b}$, while fixing the other.

### Projecting Labeled Data in New Bases and Classification

Figure 2: This is an example of a patch of an image $x$ represented as a combination of three (of many) bases vectors. Diagram from <ref name="Raina"/>

Once the optimization problem is solved, the set of bases can be used on labeled data (recall that the bases were constructed using unlabeled data). That is, given the bases $b_1,\dots,b_s$, for each $x_{\ell}^{(i)}$, we need to find the corresponding weight $\hat{a}(x_{\ell}^{(i)})$. This is achieved as follows:

$\hat{a}(x_{\ell}^{(i)}) - \arg\min_{a^{(i)}} \sum_{i=1}^k || x_{\ell}^{(i)} = \sum_{j=1}^s a_j^{(i)}b_j||^2_2 + \beta||a^{(i)}||_1 \,\,\,\,\,(2)$

This is also a convex problem with an efficient solution, yielding an approximate representation of our point in the constructed bases. Note that due to the second term, the activation vector is sparse. Figure 2 shows a sample decomposition of images into images that play the role of bases.

Once the data is in the new representation, standard supervised methods can be used to classify the data.

### A Classifier Specific to Sparse-Coding

In the algorithm described above, once the data is represented in the new bases typical classifiers can be used.

However, it might be advantageous to create a classifier specific to sparse-coding with hopes of improving accuracy. Specifically, there is a natural kernel that can be used to measure similarity given that the data lies in a space constructed with sparse-coding.

Namely, there are two assumptions that can be made:

• Assume that $\,x$ has Gaussian noise$\,\eta$ (i.e. $\,\eta \sim N(0, \sigma^2)$ and $P(x = \sum_j a_j b_j + \eta|b,a) \propto \exp(-||\eta||^2_2/2\sigma^2)$)
• Impose a Laplacianprioron $\,\mathbf{a}$ (typical for sparseness) with $P(a) \propto \exp(-\beta \sum_j |a_j|)$ and then expression (1) aims to learn $\,\mathbf{b}$ of a linear generative model of $\,x$

Therefore, a Fisher kernel can be used to measure similarity in the new representation<ref name="Jaakkola">Jaakkola, T., & Haussler, D. (1998). Exploiting generative models in discriminative classifiers. NIPS.</ref>. The algorithm is as follows:

• Use (1) and unlabeled data to learn $\,\mathbf{b}$
• Compute features $\hat{\mathbf{a}}(x)$ by solving (2)
• Compute kernels, $K(x^{(s)},x^{(t)}) = \left(\hat{a}(x^{(s)})^T\hat{a}(x^{(t)})\right) {(r^{(s)}}^Tr^{(t)})$ where $r^{(s)} = x^{(s)} - \sum_j \hat{a}_jb_j$, the residual from our estimate in the new representation.
• Use a kernel-based classifier like SVM using this new kernel.

## Results and Discussion

This method of learning applies best to classification of images, text, and sound, so most discussion will focus on classification in this context.

A natural first comparison might be look at this method against using PCA on the unlabeled data to construct a new basis. PCA differs in two major ways: it looks for linear structure and it requires that new bases be orthogonal (thus limiting the number of dimensions in the new space to, at most, the number of dimensions in the original data). So, it is hard for PCA to form bases that represent the basic patterns that underlie the data; however, this is not a problem for sparse-coding.

In each of the experiments below there are three stages of processing:

• Raw data with no processing
• Data that have been reduced using PCA
• Data in the new representation after sparse-coding was applied to the PCA data.

Additionally, in some data sets, it is helpful to use the original features in addition to the new features.

In each case, both SVM and Gaussian Discriminant Analysis (GDA) were used and the best result was taken.

### Recognizing Font Characters

The first experiment examined using handwritten digits and handwritten letters to classify font characters. Table 1 shows the results for digits and Table 2 for handwritten letters.

Table 1

Training Size Raw PCA Sparse-Coding
100 39.8% 25.3% 39.7%
500 54.8% 54.8% 58.5%
1000 61.9% 64.5% 65.3%

Table 2

Training Size Raw PCA Sparse-Coding Raw and Sparse-Coding
100 8.2% 5.7% 7.0% 9.2%
500 17.9% 14.5% 16.6% 20.2%
1000 25.6% 23.7% 23.2% 28.3%

<references />