learning Fast Approximations of Sparse Coding: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 15: Line 15:
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, for some chosen sparsity penalty <math> \alpha </math>.
:: <math> E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 </math>, for some chosen sparsity penalty <math> \alpha </math>.


Using this, the optimal code for input <math> X </math> is naturally defined as <math> Z^* = argmin_Z E(X, Z) </math>.   
Using this, the optimal code for input <math> X </math> is naturally defined as <math> \, Z^* = argmin_Z E(X, Z) </math>.   


From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss <math> E_{W_d}(X, Z^*) </math> across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.

Revision as of 17:03, 21 November 2015

Overview

In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space.

The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.

Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.

Review of Sparse Coding

For an input [math]\displaystyle{ X \epsilon \mathbb{R}^n }[/math], we seek a new representation [math]\displaystyle{ Z \epsilon \mathbb{R}^m }[/math] which satisfies the previously-stated objective. In order to find an optimal code [math]\displaystyle{ Z }[/math] of [math]\displaystyle{ X }[/math], we also require a dictionary [math]\displaystyle{ W_d \epsilon \mathbb{R}^{m x n} }[/math], the matrix of normalized vectors that the coordinates of [math]\displaystyle{ Z }[/math] are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.

These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:

[math]\displaystyle{ E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 }[/math], for some chosen sparsity penalty [math]\displaystyle{ \alpha }[/math].

Using this, the optimal code for input [math]\displaystyle{ X }[/math] is naturally defined as [math]\displaystyle{ \, Z^* = argmin_Z E(X, Z) }[/math].

From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss [math]\displaystyle{ E_{W_d}(X, Z^*) }[/math] across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.

Pre-existing Approximations: Iterative Shrinkage Algorithms

ISTA & Fast ISTA

The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant FISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing [math]\displaystyle{ Z^{(0)} = 0 }[/math], we have the recursive update rule:

[math]\displaystyle{ Z^{(k)} = h_{\theta}(Z^{(k - 1)} - \frac{1}{L}W_d^T(W_dZ - X)) }[/math]

Here, [math]\displaystyle{ L }[/math] is an upper-bound on the size of the eigenvalues of [math]\displaystyle{ W_d^TW_d }[/math], and [math]\displaystyle{ h_{\theta}( ) }[/math] is the shrinkage function with components [math]\displaystyle{ h_{\theta}(V)_i = sign(V_i) }[/math] [math]\displaystyle{ max(|V_i| - \theta_i, }[/math] [math]\displaystyle{ 0) }[/math], where [math]\displaystyle{ \theta \epsilon \mathbb{R}^m }[/math] consists of the sparsity thresholds for the components of the code. Thresholds are typically set to [math]\displaystyle{ \theta_i =\frac{\alpha}{L} }[/math].

Depending on a few possible choices to be made in implementing this scheme, the time complexity in using ISTA to construct a code for a new input will be [math]\displaystyle{ O(m^2) }[/math], [math]\displaystyle{ O(nm) }[/math], or [math]\displaystyle{ O(km) }[/math], with [math]\displaystyle{ k }[/math] being the average sparsity across samples and iterations.

Fast ISTA ensures faster convergence than standard ISTA through the addition of a momentum term.