learning Fast Approximations of Sparse Coding

From statwiki
Revision as of 04:39, 22 November 2015 by Derek (talk | contribs) (→‎Background)
Jump to navigation Jump to search

Background

In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed with the objective of obtaining a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space.

The introduction of a larger set of spanning vectors is a consequence of the desire to produce accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.

Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is demonstrated to be ~10 times more efficient than the previous state-of-the-art approximation, in empirical testing.

Review of Sparse Coding

For an input [math]\displaystyle{ X \epsilon \mathbb{R}^n }[/math], we seek a new representation [math]\displaystyle{ Z \epsilon \mathbb{R}^m }[/math] which satisfies the previously-stated objective. In order to find an optimal code [math]\displaystyle{ \, Z }[/math] of [math]\displaystyle{ \, X }[/math], we also require a dictionary [math]\displaystyle{ W_d \epsilon \mathbb{R}^{m x n} }[/math], the matrix of normalized vectors that the coordinates of [math]\displaystyle{ \, Z }[/math] are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.

These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:

[math]\displaystyle{ E_{W_d}(X, Z) = \frac{1}{2}\|X - W_dZ\|_2^2 + \alpha \|Z\|_1 }[/math],

where [math]\displaystyle{ \, \alpha }[/math] is the specified sparsity penalty. Using this, the optimal code for input [math]\displaystyle{ X }[/math] is naturally defined as [math]\displaystyle{ \, Z^* = argmin_Z E(X, Z) }[/math].

From here, the dictionary is learned in an unsupervised manner, typically through the application of stochastic gradient descent to the minimization of the average loss [math]\displaystyle{ E_{W_d}(X, Z^*) }[/math] across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.

Pre-existing Approximations: Iterative Shrinkage Algorithms

Iterative Shrinkage & Thresholding (ISTA)

The Iterative Shrinkage & Thresholding Algorithm (ISTA) and its sped-up variant Fast ISTA approximate the optimal code vector of an input by updating all the code components in parallel. The idea is that, at each iteration, we should shift our current code in the direction of greatest reconstruction error, and then apply a component-wise shrinkage function to enforce sparsity. Initializing [math]\displaystyle{ \, Z^{(0)} = 0 }[/math], we have the recursive update rule:

[math]\displaystyle{ Z^{(k + 1)} = h_{\theta}(Z^{(k)} - \frac{1}{L}W_d^T(W_dZ^{(k)} - X)) = h_{\theta}(W_eX + SZ^{(k)}) }[/math] (**),
where [math]\displaystyle{ W_e = \frac{1}{L}W_d^T }[/math] is the filter matrix, and [math]\displaystyle{ S = I - \frac{1}{L}W_d^TW_d }[/math] is the mutual-inhibition matrix.

Here, [math]\displaystyle{ \, L }[/math] is an upper-bound on the size of the eigenvalues of [math]\displaystyle{ W_d^TW_d }[/math], and [math]\displaystyle{ \, h_{\theta}( ) }[/math] is the shrinkage function with components [math]\displaystyle{ \, h_{\theta}(V)_i = sign(V_i) }[/math] [math]\displaystyle{ \, max(|V_i| - \theta_i, }[/math] [math]\displaystyle{ \, 0) }[/math], where [math]\displaystyle{ \theta \epsilon \mathbb{R}^m }[/math] consists of the sparsity thresholds for the components of the code. Thresholds are typically set to [math]\displaystyle{ \theta_i =\frac{\alpha}{L} }[/math].

Fast ISTA

Depending on a few possible choices to be made in implementing this scheme, the per-iteration time complexity in using ISTA to construct a code for a new input will be [math]\displaystyle{ \, O(m^2) }[/math], [math]\displaystyle{ \, O(nm) }[/math], or [math]\displaystyle{ \, O(km) }[/math], with [math]\displaystyle{ \, k }[/math] being the average sparsity across samples and iterations.

Fast ISTA is a modification of this approximation which ensures faster convergence through the addition of a momentum term. Here, our update rule becomes:

[math]\displaystyle{ Z^{(k + 1)} = h_{\theta}(Z^{(k)}) + \lambda (h_{\theta}^{(k-1)} - h_{\theta}^{(k - 2)}) }[/math]

In other words, the updated code is the shrinkage function applied to the current code, plus a multiple of the difference of the outputs of the shrinkage function for the preceding two iterations. This second term the rate at which the approximated code is changing.

Coordinate Descent

Instead of automatically updating all the entries of the code in parallel, we might consider strategically selecting a single component to be updated each iteration. Coordinate Descent adopts this mentality, and resultantly yields a superior approximation to the parallel ISTA methods in the same order of time. In fact, Coordinate Descent was widely believed to be the most efficient algorithm available for approximating sparse codes.

In each iteration of the procedure, we search for the entry of the current code which, if updated so as to minimize the corresponding loss while holding the other components constant, results in the largest change in comparison to the current code. This search step takes [math]\displaystyle{ \, O(m) }[/math] operations, and, so in also accounting for each component-wise optimization performed (which follows a similar process to that of the parallel case), each iteration requires [math]\displaystyle{ \, O(m^2) }[/math] steps. Alternatively, we could repeat the update process [math]\displaystyle{ \, O(n) }[/math] times instead, to achieve a per-iteration complexity of [math]\displaystyle{ \, O(nm) }[/math], which is again comparable to ISTA. Either way, it turns out that, deploying both for an approximately equal amount of time, Coordinate Descent will out-perform the ISTA methods in its approximation to an optimal code.

Encoders for Sparse Code Approximation

In seeking an approach to further improve upon the efficiency of Coordinate Descent, the authors present the use of feed-forward networks for real-time sparse code inference. Essentially, for the training phase, we will perform learning for a neural network which takes our original input [math]\displaystyle{ \, X \epsilon \mathbb{R}^n }[/math] and generates a prediction of its optimal code with respect to the previously-estimated dictionary. The training set will consist of the original [math]\displaystyle{ \, X }[/math] as our input, and their sparse codes estimated via Coordinate Descent as the target values. In learning the network weights, we use stochastic gradient descent to minimize the average squared-error between the network's predictions and these estimated sparse codes. The size of the network will be chosen with consideration of its feasibility in applying it to online processing.

A Simplistic Architecture and its Limitations

The most straight-forward approach to this task would be to use a single-layer feed-forward network. However, since we have the additional requirement that the network output is to be sparse, special consideration of the activation function must be made. The authors consider three such candidates: double tanh, a learned non-linearity, and the shrinkage function [math]\displaystyle{ \, h_{\theta}( ) }[/math] used for ISTA. The three approaches perform comparably in the authors' empirical testing, and so they opt to use [math]\displaystyle{ \, h_{\theta}( ) }[/math] to maintain a strong basis for comparison with the previous methods.

Despite the appeal of its simplicity, this network configuration is unable to learn "explaining-away" phenomena, a conditional-independence structure relevant to this task. Essentially, this means that if the learned weight matrix happens to contain two highly similar rows, the network will uniformly represent two components of the code as nearly-equal. However, this inability to select only one of the components and suppress the other redundant one is clearly indicative of a limitation of the network's ability to produce sparse encodings. Consequently, a more sophisticated architecture is proposed.

Learned ISTA & Learned Coordinate Descent

To address explaining-away phenomena, interaction terms between components are introduced. To implement these terms controlling redundancy amongst the components, the authors design a sequence of feed-forward networks structured so as to be analogous to executing some number of steps of ISTA or a pre-determined number of steps of Coordinate Descent. The use of ISTA versus Coordinate Descent corresponds to two distinct encoders, both of which can be viewed as a "time-unfolded" recurrent network.

Before understanding the rationale behind this approach, we must recognize a few relevant values which are inherently fixed in the process of executing ISTA or Coordinate Descent. The recursive equations iterated over in these procedures can both be re-expressed in terms parameters including a threshold vector [math]\displaystyle{ \, \theta }[/math], a filter matrix [math]\displaystyle{ \, W_e }[/math], and a mutual-inhibition matrix [math]\displaystyle{ \, S }[/math], where these terms are defined differently for the two procedures. Now, instead of using these fully-determined parameter forms and iterating the procedure until it converges, this encoder-driven approach purposes to learn [math]\displaystyle{ \, \theta }[/math], [math]\displaystyle{ \, W_e }[/math], [math]\displaystyle{ \, S }[/math], and then execute only a fixed number of steps of one of these procedures, in order to reduce the total computational cost. Using the available data to adaptively set these parameters allows the corresponding network to handle the issues pertaining to explaining-away.

In Learned ISTA (LISTA), the encoder structure takes the form defined by the recursive update for ISTA (**), iterated for a fixed number of times T. We learn the parameters [math]\displaystyle{ \, W_e }[/math], [math]\displaystyle{ \, \theta }[/math], and [math]\displaystyle{ \, S }[/math] by using stochastic gradient descent to minimize the previously-described squared-error loss for sparse code prediction. From its definition, we can see that [math]\displaystyle{ \, S }[/math] is shared across the [math]\displaystyle{ T }[/math] time steps, and so we use back-propagation through time to compute the error gradient for gradient descent.

Similarly, in Learned Coordinate Descent (LCoD), the network architecture is defined by the recursive update for Coordinate Descent (omitted for brevity) iterated for a fixed number of time steps. The parameters [math]\displaystyle{ \, W_e }[/math], [math]\displaystyle{ \, \theta }[/math], and S are learned analogous to the procedure for LISTA, except for the technicality that sub-gradients are propagated, resulting from the fact that we search for the component inducing the largest update in the code Z.

Empirical Results

Two sets of experiments were undertaken:

  • The procedures were compared in their performance on exact sparse code inference, using the Berkeley image database.
  • The MNIST digits dataset was used in assessing whether improved error-rates in code-prediction yields superior performance in recognition tasks.

Overall, the results indicate that this proposal to construct encoders from a set of iterations of ISTA or Coordinate Descent yields a significantly-reduced runtime when compared to the pre-existing procedures, at a pre-specified error-rate.

Berkeley Image Database

From this database, 10x10-pixel image patches were randomly drawn to compare (Fast) ISTA with LISTA, and Coordinate Descent with LCoD, in sparse code prediction. Performance was tested on dictionaries of sizes m = 100 and m = 400, using a sparsity penalty of [math]\displaystyle{ \alpha = 0.5 }[/math], and the squared-error loss for the distance between the estimate and the optimal code, as computed by Coordinate Descent.

Figure 1 suggests that, for a small number of iterations, LISTA notably out-performs FISTA here. Furthermore, 18 iterations of FISTA are required to achieve the error-rate produced by 1 iteration of LISTA when m = 100, and 35 iterations when m = 400. This leads the authors to conclude that LISTA is ~20 times faster than FISTA.

Figure 1: Comparison of LISTA and FISTA. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes m = 100, and "4x" for m = 400

Figure 2 indicates that, for a moderately-large number of iterations, LCoD has superior accuracy to Coordinate Descent. When the number of iterations is larger, Coordinate Descent out-performs LCoD, but this can be reversed by initializing matrices with their LCoD values prior to training.

Figure 2: Comparison of LCoD and Coordinate Descent. Squared-error in code prediction plotted against number of iterations. In the legend, "1x" denotes m = 100, and "4x" for m = 400. Open Circles indicate that matrices have been initialized with their LCoD values prior to training

MNIST Digits

Here, the authors examined whether these coding approximations could be effectively applied in classification tasks. Evaluation was conducted using both the entire 28x28-pixel images to create 784-dimensional codes, as well as extracted 16x16-pixel patches for codes with 256 components. In addition to the finding that increasing the number of iterations reduced classification error, it was observed that Learned Coordinate Descent approached the optimal error rate after only 10 iterations.