learning Fast Approximations of Sparse Coding
Background
In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space.
The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.
Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.
Review of Sparse Coding
For an input [math]\displaystyle{ X \epsilon \mathbb{R}^n }[/math], we seek a new representation [math]\displaystyle{ Z \epsilon \mathbb{R}^m }[/math] which satisfies the previously-stated objective. In order to find an optimal code [math]\displaystyle{ \, Z }[/math] of [math]\displaystyle{ \, X }[/math], we also require a dictionary [math]\displaystyle{ W_d \epsilon \mathbb{R}^{m x n} }[/math], the matrix of normalized vectors that the coordinates of [math]\displaystyle{ \, Z }[/math] are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:
- [math]\displaystyle{ E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 }[/math], for some chosen sparsity penalty [math]\displaystyle{ \alpha }[/math].
Using this, the optimal code for input [math]\displaystyle{ X }[/math] is naturally defined as [math]\displaystyle{ \, Z^* = argmin_Z E(X, Z) }[/math].
From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss [math]\displaystyle{ E_{W_d}(X, Z^*) }[/math] across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.
Pre-existing Approximations: Iterative Shrinkage Algorithms
Iterative Shrinkage & Thresholding (ISTA)
The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing [math]\displaystyle{ \, Z^{(0)} = 0 }[/math], we have the recursive update rule:
- [math]\displaystyle{ Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{k} - X)) }[/math]
Here, [math]\displaystyle{ L }[/math] is an upper-bound on the size of the eigenvalues of [math]\displaystyle{ W_d^TW_d }[/math], and [math]\displaystyle{ \, h_{\theta}( ) }[/math] is the shrinkage function with components [math]\displaystyle{ \, h_{\theta}(V)_i = sign(V_i) }[/math] [math]\displaystyle{ \, max(|V_i| - \theta_i, }[/math] [math]\displaystyle{ \, 0) }[/math], where [math]\displaystyle{ \theta \epsilon \mathbb{R}^m }[/math] consists of the sparsity thresholds for the components of the code. Thresholds are typically set to [math]\displaystyle{ \theta_i =\frac{\alpha}{L} }[/math].
Time Complexity & Fast ISTA
Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be [math]\displaystyle{ \, O(m^2) }[/math], [math]\displaystyle{ \, O(nm) }[/math], or [math]\displaystyle{ \, O(km) }[/math], with [math]\displaystyle{ \, k }[/math] being the average sparsity across samples and iterations.
Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:
- [math]\displaystyle{ Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{k-1} - h_{\theta}^{k - 2}) }[/math]
In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.
Coordinate Descent
Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time.
In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes [math]\displaystyle{ \, O(m) }[/math] operations, and, so in also accounting for each component-wise optimization performed, each iteration requires [math]\displaystyle{ \, O(m^2) }[/math] steps. So, we see that if we were to iterate this procedure [math]\displaystyle{ \, O(n) }[/math] or [math]\displaystyle{ \, O(m) }[/math] times, we would end up with a runtime complexity of the same magnitude as ISTA. However,