self-Taught Learning
Theory
Introduction
Self-taught learning is a new paradigm in machine learning that builds on ideas from existing supervised, semi-supervised and transfer learning algorithms. The differences in these methods depend on the availability of labeled and unlabeled data:
- Supervised Learning - All data is labeled and of the same type (shares the same class labels).
- Semi-supervised learning - Some of the data is labeled but all of it is the same type.
- Transfer-learning All data is labeled but some is of some other type (i.e. has class labels that do not apply to data set that we wish to classify).
Self-taught learning combines the latter two ideas. It uses labeled data with labels from the desired classes and unlabeled data from other, but somehow similar, classes. It is important to emphasize that the unlabeled data need not belong to the class labels we wish to assign.
The additional unlabeled data can then be used to learn a "higher-level feature representation on the inputs" which will make classification simpler. After this representation is found, it can be used on the labeled data. Classification is subsequently done in the new representation.
The main advantage of this approach is that unlabeled similar data is often easier and cheaper to obtain than labeled data belonging to our classes.
Self-Taught Learning Algorithm
The self-taught learning algorithm can be summarized as follows:
- Use unlabeled data to construct a new representation (typically a sparse high-dimensional one)
- Express labeled data in this new representation
- Use existing classification methods in this new space
There is much freedom as to how to accomplish each of the three steps, so there is a lot of room for new techniques. For steps 1 and 2, a specific method will be discussed.
Problem Specification
Formally, suppose we have [math]\displaystyle{ \,m }[/math] labeled training points [math]\displaystyle{ \{(x_{\ell}^{(i)}, y^{(i)}), i = 1,\dots,m\} }[/math], where [math]\displaystyle{ x_{\ell}^{(i)} \in \mathbb{R}^n }[/math] and [math]\displaystyle{ \,y^{(1)} }[/math] is a class label. Further assume that the data are independently and identically taken from some distribution. In addition, we also have a set of unlabeled data [math]\displaystyle{ \{x_u^{i)},i=1,\dots,k\} }[/math], [math]\displaystyle{ x_u^{i} \in \mathbb{R}^n }[/math]. It is not required that this data comes from the same distribution, which differentiates this method from semi-supervised learning; however, the data should somehow be relevant, as it is with transfer learning.
Constructing Bases Using Unlabeled Data
Given this data and a self-taught learning paradigm, there are many techniques that can be used for classification. A particular one, the sparse-coding algorithm by Olshausen and Field <ref name="OF1996">B. A. Olshausen and D.J. Field. (1996) "Emergence of simple-cell receptive field properties by learning a sparse code for natural images". Nature, 381:607-609.</ref>, will be discussed here. This method aims to solve the optimization problem
subject to the constraints [math]\displaystyle{ ||b_j||_2 \leq 1 }[/math] for all [math]\displaystyle{ j=1,\dots,s }[/math]. Note that:
- [math]\displaystyle{ b = \{b_1,\dots,b_s\} }[/math] are 'basis vectors', with [math]\displaystyle{ b_j \in \mathbb{R}^n }[/math]
- [math]\displaystyle{ a = \{a^{(1)},\dots, a^{(k)}\} }[/math] are 'activations' with each [math]\displaystyle{ a^{(i)} \in \mathbb{R}^s }[/math]
- [math]\displaystyle{ \,s }[/math] is the number of bases that are used in the new representation. [math]\displaystyle{ \,s }[/math] can be large than [math]\displaystyle{ \,n }[/math]
The first term in the cost function has the purpose of creating [math]\displaystyle{ \,b_j }[/math] and [math]\displaystyle{ a_j^{(1)} }[/math] such that each unlabeled point can be expressed as a linear combination of these bases and activations (weights). The [math]\displaystyle{ \beta }[/math] term acts as a penalty to ensure that the activation terms are sparse. REF.
This problem is convex in either [math]\displaystyle{ \,\textbf{a} }[/math] or [math]\displaystyle{ \,\textbf{b} }[/math] if the other one is constant. Since each of these cases can be solved efficiently , a solution to this optimization problem can be found by alternating between optimizing [math]\displaystyle{ \,\textbf{a} }[/math] and [math]\displaystyle{ \,\textbf{b} }[/math], while fixing the other.
Projecting Labeled Data in New Bases and Classification
Once the optimization problem is solved, the set of bases can be used on labeled data (recall that the bases were constructed using unlabeled data). That is, given the bases [math]\displaystyle{ b_1,\dots,b_s }[/math], for each [math]\displaystyle{ x_{\ell}^{(i)} }[/math], we need to find the corresponding weight $\hat{a}(x_{\ell}^{(i)})$. This is achieved as follows:
This is also a convex problem with an efficient solution, yielding an approximate representation of our point in the constructed bases. Note that due to the second term, the activation vector is sparse.
Once the data is in the new representation, standard supervised methods can be used to classify the data.
A Classifier Specific to Sparse-Coding
Results and Discussion
This method of learning applies best to classification of images, text, and sound, so most discussion will focus on classification in this context.
One natural first comparison might be look at this method against using PCA on the unlabeled data to construct a new bases. PCA differs in two major ways: it looks for linear structure and requires that new bases be orthogonal (thus limiting the number of dimensions in the new space to, at most, the number of dimensions in the original data). So, it is hard for PCA to form bases that represent the basic patterns that underlie data; however, this is not a problem for sparse-coding.
Experimental Results
References
<references />