learning Fast Approximations of Sparse Coding: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 13: Line 13:
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:
These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:


:: <math> E_{W_d}(X, Z) = \frac{1}{2}|\X - W_dZ\|_2^2 + \alpha |Z\|_1 </math>,
:: <math> E_{W_d}(X, Z) = \frac{1}{2}|/X - W_dZ\|_2^2 + \alpha |Z\|_1 </math>,
for some chosen sparsity penalty <math> \alpha </math>.
for some chosen sparsity penalty <math> \alpha </math>.



Revision as of 03:01, 21 November 2015

Overview

In contrast to a dimensionality-reduction approach to feature-extraction, sparse coding is an unsupervised method which attempts to construct a novel representation of the data by mapping it linearly into a higher-dimensional space. This transformation is performed in the hope of being able to obtain a new feature space in which, for each vector, we can find a smaller subset of features that largely reconstruct it. Essentially, this allows us to perform case-specific feature-extraction, as for a given input, we seek a smaller subset of features to which we will assign the majority of the weight for its representation, with the remainder being negligibly small. This provides us with a procedure which attempts to flexibly represent unseen instances of the input space.

The introduction of a larger set of spanning vectors is a consequence of the objective of producing accurate reconstructions across a broad range of possible input from the original space. However, the algebra of linear transformations tells us that input vectors will no longer have a unique representation in the higher-dimensional feature space. This short-coming is alleviated by the fact that we would like to assign the majority of influence to only a subset of the new features. We implement this goal using the notion of sparsity; namely, we will penalize large weight values.

Unfortunately, there are some implementation issues which prevent the use of sparse coding in certain contexts. The fact that we must find a representation for each new case the system is provided with often renders the procedure infeasible for online processing tasks, as new data must be handled in real-time. Several approximation algorithms have been proposed to address issues in processing speed. However, these methods suffer from deficiencies in their ability to take into account some relevant conditional-independence structures in the data. To resolve these limitations, the authors introduce a feed-forward architecture which adapts some of these approximation schemes, giving a new procedure which is ~10 times more efficient than the previous state-of-the-art approximation.

Review of Sparse Coding

For an input [math]\displaystyle{ X \epsilon \mathbb{R}^n }[/math], we seek a new representation [math]\displaystyle{ Z \epsilon \mathbb{R}^m }[/math] which satisfies the previously-stated objective. In order to find an optimal code [math]\displaystyle{ Z }[/math] of [math]\displaystyle{ X }[/math], we also require a dictionary [math]\displaystyle{ W_d \epsilon \mathbb{R}^{m x n} }[/math], the matrix of normalized vectors that the coordinates of [math]\displaystyle{ Z }[/math] are defined in relation to. Given a training set, we will estimate the optimal sparse codes for each training case, in pursuit of the dictionary matrix to be used in coding novel input.

These solutions will be found based on a loss function taking into account the squared reconstruction error and the complexity of the code:

[math]\displaystyle{ E_{W_d}(X, Z) = \frac{1}{2}|/X - W_dZ\|_2^2 + \alpha |Z\|_1 }[/math],

for some chosen sparsity penalty [math]\displaystyle{ \alpha }[/math].

The optimal code for input [math]\displaystyle{ X }[/math] is [math]\displaystyle{ Z^* = argmin_Z E(X, Z) }[/math].

The dictionary is learned in an unsupervised manner, typically through the use of stochastic gradient descent in minimizing the average loss [math]\displaystyle{ E_{W_d}(X, Z^*) }[/math] across a subset of the training cases. The dictionary to be applied to new cases is learned prior to the execution of the approximation proposed here.